Report from Debian SnowCamp: day 3

[Previously: day 1, day 2]

Thanks to Valhalla and other members of LIFO, a bunch of fine Debian folks have convened in Laveno, on the shores of Lake Maggiore, for a nice weekend of relaxing and sprinting on various topics, a SnowCamp.

As a starter, and on request from Valhalla, please enjoy an attempt at a group picture (unfortunately, missing a few people). Yes, the sun even showed itself for a few moments today!

One of the numerous SnowCamp group pictures

As for today’s activities… I’ve cheated a bit by doing stuff after sending yesterday’s report and before sleep: I reviewed some of Stefano’s dc18 pull requests; I also fixed papered over the debexpo uscan bug.

After keeping eyes closed for a few hours, the day was then spent tickling the python-gitlab module, packaged by Federico, in an attempt to resolve in a generic way.

The features I intend to implement are mostly inspired from jcowgill’s multimedia-cli:

  • per-team yaml configuration of “expected project state” (access level, hooks and other integrations, enablement of issues, merge requests, CI, …)
  • new repository creation (according to a team config or a personal config, e.g. for collab-main the Debian group)
  • audit of project configurations
  • mass-configuration changes for projects

There could also be some use for bits of group management, e.g. to handle the access control of the DebConf group and its subgroups, although I hear Ganneff prefers shell scripts.

My personal end goal is to (finally) do the 3D printer team repository migration, but e.g. the Python team would like to update configuration of all repos to use the new KGB hook instead of irker, so some generic interest in the tool exists.

As the tool has a few dependencies (because I really have better things to do than reimplement another wrapper over the GitLab API) I’m not convinced devscripts is the right place for it to live… We’ll see when I have something that does more than print a list of projects to show!

In the meantime, I have the feeling Stefano has lined up a new batch of DebConf website pull requests for me, so I guess that’s what I’m eating for breakfast “tomorrow”… Stay tuned!

My attendance to SnowCamp is in part made possible by donations to the Debian project. If you want to keep the project going, please consider donating, joining the Debian partners program, or sponsoring the upcoming Debian Conference.

Report from Debian SnowCamp: day 2

[Previously: day 1]

Thanks to Valhalla and other members of LIFO, a bunch of fine Debian folks have convened in Laveno, on the shores of Lake Maggiore, for a nice weekend of relaxing and sprinting on various topics, a SnowCamp.

Today’s pièce de résistance was the long overdue upgrade of the machine hosting to (jessie then) stretch. We’ve spent most of the afternoon doing the upgrades with Mattia.

The first upgrade to jessie was a bit tricky because we had to clean up a lot of cruft that accumulated over the years. I even managed to force an unexpected database restore test 😇. After a few code fixes, and getting annoyed at apache2.4 for ignoring VirtualHost configs that don’t end with .conf (and losing an hour of debugging time in the process…), we managed to restore the functonality of the website.

We then did the stretch upgrade, which was somewhat smooth sailing in comparison… We had to remove some functionality which depended on packages that didn’t make it to stretch: fedmsg, and the SOAP interface. We also noticed that the gpg2 transition completely broke the… “interesting” GPG handling of mentors… An install of gnupg1 later everything should be working as it was before.

We’ve also tried to tackle our current need for a patched FTP daemon. To do so, we’re switching the default upload queue directory from / to /pub/UploadQueue/. Mattia has submitted bugs for dput and dupload, and will upload an updated dput-ng to switch the default. Hopefully we can do the full transition by the next time we need to upgrade the machine.

Known bugs: the uscan plugin now fails to parse the uscan output… But at least it “supports” version=4 now 🙃

Of course, we’re still sorely lacking volunteers who would really care about; the codebase is a pile of hacks upon hacks upon hacks, all relying on an old version of a deprecated Python web framework. A few attempts have been made at a smooth transition to a more recent framework, without really panning out, mostly for lack of time on the part of the people running the service. I’m still convinced things should restart from scratch, but I don’t currently have the energy or time to drive it… Ugh.

More stuff will happen tomorrow, but probably not on See you then!

My attendance to SnowCamp is in part made possible by donations to the Debian project. If you want to keep the project going, please consider donating, joining the Debian partners program, or sponsoring the upcoming Debian Conference.

Report from Debian SnowCamp: day 1

Thanks to Valhalla and other members of LIFO, a bunch of fine Debian folks have convened in Laveno, on the shores of Lake Maggiore, for a nice weekend of relaxing and sprinting on various topics, a SnowCamp.

This morning, I arrived in Milan at “omfg way too early” (5:30AM, thanks to a 30 minute early (!) night train), and used the opportunity to walk the empty streets around the Duomo while the Milanese .oO(mapreri) were waking up. This gave me the opportunity to take very nice pictures of monuments without people, which is always appreciated!


After a short train ride to Laveno, we arrived at the Hostel at around 10:30. Some people had already arrived the day before, so there already was a hacking kind of mood in the air.  I’d post a panorama but apparently my phone generated a corrupt JPEG 🙄

After rearranging the tables in the common spaces to handle power distribution correctly (♥ Gaffer Tape), we could start hacking!

Today’s efforts were focused on the DebConf website: there were a bunch of pull requests made by Stefano that I reviewed and merged:

I’ve also written a modicum of code.

Finally, I have created the Debian 3D printing team on salsa in preparation for migrating our packages to git. But now is time to do the sleep thing. See you tomorrow?

My attendance to SnowCamp is in part made possible by donations to the Debian project. If you want to keep the project going, please consider donating, joining the Debian partners program, or sponsoring the upcoming Debian Conference.

Listing and loading of Debian repositories: now live on Software Heritage

Software Heritage is the project for which I’ve been working during the past two and a half years now. The grand vision of the project is to build the universal software archive, which will collect, preserve and share the Software Commons.

Today, we’ve announced that Software Heritage is archiving the contents of Debian daily. I’m reposting this article on my blog as it will probably be of interest to readers of Planet Debian.

TL;DR: Software Heritage now archives all source packages of Debian as well as its security archive daily. Everything is ready for archival of other Debian derivatives as well. Keep on reading to get details of the work that made this possible.


When we first announced Software Heritage, back in 2016, we had archived the historical contents of Debian as present on the service, as a one-shot proof of concept import.

This code was then left in a drawer and never touched again, until last summer when Sushant came do an internship with us. We’ve had the opportunity to rework the code that was originally written, and to make it more generic: instead of the specifics of, the code can now work with any Debian repository. Which means that we could now archive any of the numerous Debian derivatives that are available out there.

This has been live for a few months, and you can find Debian package origins in the Software Heritage archive now.

Mapping a Debian repository to Software Heritage

The main challenge in listing and saving Debian source packages in Software Heritage is mapping the content of the repository to the generic source history data model we use for our archive.

Organization of a Debian repository

Before we start looking at a bunch of unpacked Debian source packages, we need to know how a Debian repository is actually organized.

At the top level of a Debian repository lays a set of suites, representing versions of the distribution, that is to say a set of packages that have been tested and are known to work together. For instance, Debian currently has 6 active suites, from wheezy (“old old stable” version), all the way up to experimental; Ubuntu has 8, from precise (12.04 LTS), up to bionic (the future 18.04 release), as well as a devel suite. Each of those suites also has a bunch of “overlay” suites, such as backports, which are made available in the archive alongside full suites.

Under the suites, there’s another level of subdivision, which Debian calls components, and Ubuntu calls areas. Debian uses its components to segregate packages along licensing terms (main, contrib and non-free), while Ubuntu uses its areas to denote the level of support of the packages (main, universe, multiverse, …).

Finally, components contain source packages, which merge upstream sources with distribution-specific patches, as well as machine-readable instructions on how to build the package.

Organization of the Software Heritage archive

The Software Heritage archive is project-centric rather than version-centric. What this means is that we are interested in keeping the history of what was available in software origins, which can be thought of as a URL of a repository containing software artifacts, tagged with a type representing the means of access to the repository.

For instance, the origin for the GitHub mirror of the Linux kernel repository has the following data:

For each visit of an origin, we take a snapshot of all the branches (and tagged versions) of the project that were visible during that visit, complete with their full history. See for instance one of the latest visits of the Linux kernel. For the specific case of GitHub, pull requests are also visible as virtual branches, so we fetch those as well (as branches named refs/pull/<pull request number>/head).

Bringing them together

As we’ve seen, Debian archives (just as well as archives for other “traditional” Linux distributions) are release-centric rather than package-centric. Mapping distributions to the Software Heritage archive therefore takes a little bit of gymnastics, to transpose the list of source packages available in each suite to a list of available versions per source package. We do this step by step:

  1. Download the Sources indices for all the suites and components known in the Debian repository
  2. Parse the Sources indices, listing all source packages inside
  3. For each source package, tell the Debian loader to load all the available versions (grouped by name), generating a complete snapshot of the state of the source package across the Debian repository

The source packages are mapped to origins using the following format:

  • type: deb
  • url: deb://<repository name>/packages/<source package name> (e.g. deb://Debian/packages/linux)

We use a repository name rather than the actual URL to a repository so that links can persist even if a given mirror disappears.

Loading Debian source packages

To load Debian source packages into the Software Heritage archive, we have to convert them: Debian-based distributions distribute source packages as a set of files, a dsc (Debian Source Control) and a set of tarballs (usually, an upstream tarball and a Debian-specific overlay). On the other hand, Software Heritage only stores version-control information such as revisions, directories, files.

Unpacking the source packages

Our philosophy at Software Heritage is to store the source code of software in the precise form that allows a developer to start working on it. For Debian source packages, this is the unpacked source code tree, with all patches applied. After checking that the files we have downloaded match the checksums published in the index files, we simply use dpkg-source -x to extract the source package, with patches applied, ready to build. This also means that we currently fail to import packages that don’t extract with the version of dpkg-source available in Debian Stretch.

Generating a synthetic revision

After walking the extracted source package tree, computing identifiers for all its contents, we get the identifier of the top-level tree, which we will reference in the synthetic revision.

The synthetic revision contains the “reproducible” metadata that is completely intrinsic to the Debian source package. With the current implementation, this means:

  • the author of the package, and the date of modification, as referenced in the last entry of the source package changelog (referenced as author and committer)
  • the original artifact (i.e. the information about the original source package)
  • basic information about the history of the package (using the parsed changelog)

However, we never set parent revisions in the synthetic commits, for two reasons:

  • there is no guarantee that packages referenced in the changelog have been uploaded to the distribution, or imported by Software Heritage (our update frequency is lower than that of the Debian archive)
  • even if this guarantee existed, and all versions of all packages were available in Software Heritage, there would be no guarantee that the version referenced in the changelog is indeed the version we imported in the first place

This makes the information stored in the synthetic revision fully intrinsic to the source package, and reproducible. In turn, this allows us to keep a cache, mapping the original artifacts to synthetic revision ids, to avoid loading packages again once we have loaded them once.

Storing the snapshot

Finally, we can generate the top-level object in the Software Heritage archive, the snapshot. For instance, you can see the snapshot for the latest visit of the glibc package.

To do so, we generate a list of branches by concatenating the suite, the component, and the version number of each detected source package (e.g. stretch/main/2.24-10 for version 2.24-10 of the glibc package available in stretch/main). We then point each branch to the synthetic revision that was generated when loading the package version.

In case a version of a package fails to load (for instance, if the package version disappeared from the mirror between the moment we listed the distribution, and the moment we could load the package), we still register the branch name, but we make it a “null” pointer.

There’s still some improvements to make to the lister specific to Debian repositories: it currently hardcodes the list of components/areas in the distribution, as the repository format provides no programmatic way of eliciting them. Currently, only Debian and its security repository are listed.

Looking forward

We believe that the model we developed for the Debian use case is generic enough to capture not only Debian-based distributions, but also RPM-based ones such as Fedora, Mageia, etc. With some extra work, it should also be possible to adapt it for language-centric package repositories such as CPAN, PyPI or Crates.

Software Heritage is now well on the way of providing the foundations for a generic and unified source browser for the history of traditional package-based distributions.

We’ll be delighted to welcome contributors that want to lend a hand to get there.