What is this?

Welcome to metasnap.d.n. This service is an addition to snapshot.d.o and is meant to answer questions like:

Why is it useful?

For tools like debrebuild or debbisect we need to know which snapshot.d.o timestamp of a given distribution contains which package versions. snapshot.d.o cannot provide this functionality because it doesn't investigate Packages files. The first_seen attribute from the snapshot.d.o API indicates the timestamp a package file was first seen anywhere in Debian and is not useful beyond the unstable/sid distribution. Even when we limit ourselves to unstable/sid, the first_seen attribute doesn't help in situations where we want to find a minimal set of timestamps providing a set multiple of packages. For this functionality we would need a last_seen property which would still not sufficient because it would not be possible to provide it per suite unless Packages files are parsed. Lastly, the first_seen attribute is based on files as they appear in /pool and files in there might never appear in any Packages file (example: ocaml-odoc 1.5.0-1) or they appear in /pool before being in the Packages file.

How does it work?

metasnap investigates the Packages files for all known snapshot timestamps, all suites and all architectures and then records for each suite, each architecture and each binary package, which version of it was seen in which snapshot.d.o timestamps in that suite. In detail:

  1. A script scrapes all available snapshot.d.o timestamps from HTML
  2. Another script goes through all the timestamps that haven't been processed yet, downloads Packages files (using pdiffs if possible) and creates a git commit tagged with the timestamp with textfiles containing the package versions per suite. Processing all of snapshot.d.o from 2005 until 2020 will take 14 weeks and download 160 GB of data.
  3. A third script checks out each commit and turns the data into timestamp ranges for each package version per suite. The initial import of all timestamps from 2005 to 2020 will take five days.
  4. Steps 1. to 3. are performed four times a day.

How to access the data

The data is provided by a Python cgi script with read-only access to a git repository and a sqlite database. The order of arguments matters. Timestamps are given in the same format as snapshot.d.o uses: %Y%m%dT%H%M%SZ. The only valid component names are main, contrib and non-free. On success, HTTP status 200 will be returned. If you inserted invalid data, status 500 will be returned. If the requested package(s) cannot be found, HTTP 404 will be returned. If multiple packages cannot be found, the response body will contain a list of packages that could not be found. For other bad requests, error 400 is returned with the explanation following the HTTP error code. The cgi script supports the following functionalities:

cgi-bin/api?timestamps=<archive>

Returns a list of all timestamps from snapshot.d.o, one per line. The value <archive> controls the archive the timestamps were recorded for. The only supported archive right now is "debian".

Example: cgi-bin/api?timestamps=debian

cgi-bin/api?archive=debian&timestamp=<T>&suite=<S>&comp=<C>&pkg=<P>&arch=<A>

Use this interface if you have a timestamp T and you want to know which version a certain package P with architecture A in suite S and component C had at that time. The result is one or more versions with one version per line. This is equivalent to downloading and parsing http://snapshot.debian.org/archive/debian/<T>/dists/<S>/<C>/binary-<A>/Packages.xz but has the advantage, that instead of downloading and parsing several megabytes of data and thus unnecessarily putting load on snapshot.d.o, only a few bytes need to be retrieved.

Example: cgi-bin/api?archive=debian&timestamp=20090825T163333Z&suite=unstable&comp=main&pkg=libflute-1.3-jfree-java-doc&arch=hurd-i386

cgi-bin/api?archive=debian&timestamp=<T>&suite=<S>&comp=<C>&arch=<A>

Like 4.2 above but returns all binary packages with their version for the given timestamp, suite and component, one per line, separated by a space.

Example: cgi-bin/api?archive=debian&timestamp=20090825T163333Z&suite=unstable&comp=main&arch=hurd-i386

cgi-bin/api?archive=debian&pkg=<P>&arch=<A>&ver=<V>

All you have is a packagename, architecture, version tuple and you want to know which suites and components contained that package for what range of timestamps. The result of this query is a plain file with four space separated entries per line. The first is the suite name, the second the component name, the third the first timestamp this version was seen in this suite and the fourth the last timestamp the package was seen in it.

Example: cgi-bin/api?archive=debian&pkg=libcamel1.2-dev&arch=powerpc&ver=2.27.90-1

cgi-bin/api?archive=debian&pkg=<P>&arch=<A>&ver=<V>&suite=<S>

Same as 4.4 above but the output doesn't contain the first column with the suite name, as the request is restricted to a specific suite. As a result the output only contains the component, the first timestamp and the last timestamp per line, separated by a space.

Example: cgi-bin/api?archive=debian&pkg=libcamel1.2-dev&arch=powerpc&ver=2.27.90-1&suite=rc-buggy

cgi-bin/api?archive=debian&pkg=<P>&arch=<A>&ver=<V>&suite=<S>&comp=<C>

Same as 4.4 and 4.5 above but the output doesn't contain the first and second column with the suite and component name, as the request is restricted to a specific suite and component. As a result the output only contains the first timestamp and the last timestamp per line, separated by a space.

Example: cgi-bin/api?archive=debian&pkg=libcamel1.2-dev&arch=powerpc&ver=2.27.90-1&suite=rc-buggy&comp=main

Even though a given three-tuple of package name, version and architecture should not appear in the same suite multiple times, this still sometimes happens. This is why the result of API 4.4, 4.5 and 4.6 can contain multiple rows for the same suite.

Example: cgi-bin/api?archive=debian&pkg=clang-3.4-doc&arch=s390x&ver=1:3.4~svn190846-1&suite=unstable&comp=main

cgi-bin/api?archive=debian&pkgs=<P1=V1[,...]>&arch=<A>&suite=<S>&comp=<C>

If you need to find a snapshot timestamp providing multiple packages at once, then instead of making multiple queries using the api endpoints above, you can use this query instead. You supply a list of binary package names, a default architecture, a suite and a component and the result will be a minimal list of timestamps which include all the requested packages in their specific version. Each line in the results will list the architecture and the snapshot timestamp, separated by a space. The pkgs argument lists packages separated by a space. Versions are appended to package names after an equal sign. If the architecture differs from the default value given by the arch argument, then it can be appended directly after the package name and before the version with a colon. This mimics the way that a package, version, architecture three-tuple is understood by the apt command line.

Example: cgi-bin/api?archive=debian&pkgs=dpkg%3D1.18.18,diffutils%3D1%3A3.5-1&arch=amd64&suite=unstable&comp=main

cgi-bin/api?archive=debian&pkgs=<P1=V1[,...]>&arch=<A>

Similar to 4.7 above, but as you do not specify the suite and component, the result will include all timestamp ranges containing the requested version across all suites and components. So instead of issuing query 4.4 multiple times, you can retrieve the same data in a single query. You then have to perform the computations to find a good set of timestamps that contain all your versions by yourself. Each line of the output contains seven values, separated by a space: The package name, the architecture, the version, the suite, the component, the first timestamp that the package was seen in that suite and lastly, the last timestamp the package was seen in it.

Example: cgi-bin/api?archive=debian&pkgs=dpkg%3D1.18.18,diffutils%3D1%3A3.5-1&arch=amd64

Downloading raw data

The cgi script uses a git repository to satisfy queries 4.1 and 4.2. That repository can be cloned:

git clone https://metasnap.debian.net/by-timestamp/debian.git/.git/

The repository is about 1.2 GB in size. The remaining queries utilize a sqlite database which can be downloaded here. The database is 3.5 GB large.

Demo

To demonstrate how this data is useful, you can place a buildinfo file in the textarea below. Client-side JavaScript will then download all required files from this service and compute a minimal set of timestamps containing all package versions referenced by the buildinfo file. The script assumes that all packages were at some point part of Debian unstable main.

This JavaScript is only a demo and not intended for production use. It uses the api function 4.6 even though it would be better to use 4.8 to reduce the number of requests. Or even better, use api function 4.7 to do the whole computation server-side.

As a convenience function you can directly upload a buildinfo file using the form below.

buildinfo:

You can also fill the form using the following curl command:

curl -F 'buildinfo=<-' https://metasnap.debian.net/cgi-bin/api < foo.buildinfo

Limitations

Statistics

Code

The code that downloads Packages files from snapshot.d.o puts it into a git repository and turns that into an sqlite database with ranges of timestamps per package, version and architecture can be found here:

https://salsa.debian.org/josch/metasnap

The work is licensed under MIT/Expat.

Bugs

Report them in the salsa issue tracker

Contact

Johannes Schauer Marin Rodrigues <josch [at] debian [dot] org>