Software Heritage is the project for which I’ve been working during the past two and a half years now. The grand vision of the project is to build the universal software archive, which will collect, preserve and share the Software Commons.
Today, we’ve announced that Software Heritage is archiving the contents of Debian daily. I’m reposting this article on my blog as it will probably be of interest to readers of Planet Debian.
TL;DR: Software Heritage now archives all source packages of Debian as well as its security archive daily. Everything is ready for archival of other Debian derivatives as well. Keep on reading to get details of the work that made this possible.
History
When we first announced Software Heritage, back in 2016, we had archived the historical contents of Debian as present on the snapshot.debian.org service, as a one-shot proof of concept import.
This code was then left in a drawer and never touched again, until last summer when Sushant came do an internship with us. We’ve had the opportunity to rework the code that was originally written, and to make it more generic: instead of the specifics of snapshot.debian.org, the code can now work with any Debian repository. Which means that we could now archive any of the numerous Debian derivatives that are available out there.
This has been live for a few months, and you can find Debian package origins in the Software Heritage archive now.
Mapping a Debian repository to Software Heritage
The main challenge in listing and saving Debian source packages in Software Heritage is mapping the content of the repository to the generic source history data model we use for our archive.
Organization of a Debian repository
Before we start looking at a bunch of unpacked Debian source packages, we need to know how a Debian repository is actually organized.
At the top level of a Debian repository lays a set of suites, representing versions of the distribution, that is to say a set of packages that have been tested and are known to work together. For instance, Debian currently has 6 active suites, from wheezy (“old old stable” version), all the way up to experimental; Ubuntu has 8, from precise (12.04 LTS), up to bionic (the future 18.04 release), as well as a devel suite. Each of those suites also has a bunch of “overlay” suites, such as backports, which are made available in the archive alongside full suites.
Under the suites, there’s another level of subdivision, which Debian calls components, and Ubuntu calls areas. Debian uses its components to segregate packages along licensing terms (main, contrib and non-free), while Ubuntu uses its areas to denote the level of support of the packages (main, universe, multiverse, …).
Finally, components contain source packages, which merge upstream sources with distribution-specific patches, as well as machine-readable instructions on how to build the package.
Organization of the Software Heritage archive
The Software Heritage archive is project-centric rather than version-centric. What this means is that we are interested in keeping the history of what was available in software origins, which can be thought of as a URL of a repository containing software artifacts, tagged with a type representing the means of access to the repository.
For instance, the origin for the GitHub mirror of the Linux kernel repository has the following data:
For each visit of an origin, we take a snapshot of all the branches (and tagged versions) of the project that were visible during that visit, complete with their full history. See for instance one of the latest visits of the Linux kernel. For the specific case of GitHub, pull requests are also visible as virtual branches, so we fetch those as well (as branches named refs/pull/<pull request number>/head).
Bringing them together
As we’ve seen, Debian archives (just as well as archives for other “traditional” Linux distributions) are release-centric rather than package-centric. Mapping distributions to the Software Heritage archive therefore takes a little bit of gymnastics, to transpose the list of source packages available in each suite to a list of available versions per source package. We do this step by step:
- Download the Sources indices for all the suites and components known in the Debian repository
- Parse the Sources indices, listing all source packages inside
- For each source package, tell the Debian loader to load all the available versions (grouped by name), generating a complete snapshot of the state of the source package across the Debian repository
The source packages are mapped to origins using the following format:
- type: deb
- url: deb://<repository name>/packages/<source package name> (e.g. deb://Debian/packages/linux)
We use a repository name rather than the actual URL to a repository so that links can persist even if a given mirror disappears.
Loading Debian source packages
To load Debian source packages into the Software Heritage archive, we have to convert them: Debian-based distributions distribute source packages as a set of files, a dsc (Debian Source Control) and a set of tarballs (usually, an upstream tarball and a Debian-specific overlay). On the other hand, Software Heritage only stores version-control information such as revisions, directories, files.
Unpacking the source packages
Our philosophy at Software Heritage is to store the source code of software in the precise form that allows a developer to start working on it. For Debian source packages, this is the unpacked source code tree, with all patches applied. After checking that the files we have downloaded match the checksums published in the index files, we simply use dpkg-source -x to extract the source package, with patches applied, ready to build. This also means that we currently fail to import packages that don’t extract with the version of dpkg-source available in Debian Stretch.
Generating a synthetic revision
After walking the extracted source package tree, computing identifiers for all its contents, we get the identifier of the top-level tree, which we will reference in the synthetic revision.
The synthetic revision contains the “reproducible” metadata that is completely intrinsic to the Debian source package. With the current implementation, this means:
- the author of the package, and the date of modification, as referenced in the last entry of the source package changelog (referenced as author and committer)
- the original artifact (i.e. the information about the original source package)
- basic information about the history of the package (using the parsed changelog)
However, we never set parent revisions in the synthetic commits, for two reasons:
- there is no guarantee that packages referenced in the changelog have been uploaded to the distribution, or imported by Software Heritage (our update frequency is lower than that of the Debian archive)
- even if this guarantee existed, and all versions of all packages were available in Software Heritage, there would be no guarantee that the version referenced in the changelog is indeed the version we imported in the first place
This makes the information stored in the synthetic revision fully intrinsic to the source package, and reproducible. In turn, this allows us to keep a cache, mapping the original artifacts to synthetic revision ids, to avoid loading packages again once we have loaded them once.
Storing the snapshot
Finally, we can generate the top-level object in the Software Heritage archive, the snapshot. For instance, you can see the snapshot for the latest visit of the glibc package.
To do so, we generate a list of branches by concatenating the suite, the component, and the version number of each detected source package (e.g. stretch/main/2.24-10 for version 2.24-10 of the glibc package available in stretch/main). We then point each branch to the synthetic revision that was generated when loading the package version.
In case a version of a package fails to load (for instance, if the package version disappeared from the mirror between the moment we listed the distribution, and the moment we could load the package), we still register the branch name, but we make it a “null” pointer.
There’s still some improvements to make to the lister specific to Debian repositories: it currently hardcodes the list of components/areas in the distribution, as the repository format provides no programmatic way of eliciting them. Currently, only Debian and its security repository are listed.
Looking forward
We believe that the model we developed for the Debian use case is generic enough to capture not only Debian-based distributions, but also RPM-based ones such as Fedora, Mageia, etc. With some extra work, it should also be possible to adapt it for language-centric package repositories such as CPAN, PyPI or Crates.
Software Heritage is now well on the way of providing the foundations for a generic and unified source browser for the history of traditional package-based distributions.
We’ll be delighted to welcome contributors that want to lend a hand to get there.