What is this?

Welcome to metasnap.d.n. This service is an addition to snapshot.d.o and is meant to answer questions like:

Why is it useful?

For tools like debrebuild, debbisect or debootsnap we need to know which snapshot.d.o timestamp of a given distribution contains which package versions. snapshot.d.o cannot provide this functionality because it doesn't associate GPG-signed Release files and Packages files to the binary packages it offers for download. The first_seen attribute from the snapshot.d.o API indicates the timestamp a package file was first seen anywhere in Debian and is not useful beyond the unstable/sid distribution. Even when we limit ourselves to unstable/sid, the first_seen attribute doesn't help in situations where we want to find a minimal set of timestamps providing a set multiple of packages. For this functionality we would need a last_seen property which would still not sufficient because it would not be possible to provide it per suite unless Packages files are parsed. Lastly, the first_seen attribute is based on files as they appear in /pool and files in there might never appear in any Packages file (example: ocaml-odoc 1.5.0-1) or they appear in /pool before being in the Packages file.

How does it work?

metasnap investigates the Packages files for all known snapshot timestamps, all suites and all architectures and then records for each suite, each architecture and each binary package, which version of it was seen in which snapshot.d.o timestamps in that suite. In detail:

  1. A script scrapes all available snapshot.d.o timestamps from HTML
  2. Another script goes through all the timestamps that haven't been processed yet, downloads Packages files (using pdiffs if possible) and creates a git commit tagged with the timestamp with textfiles containing the package versions per suite. Processing all of snapshot.d.o from 2005 until 2020 will take 14 weeks and download 160 GB of data.
  3. A third script checks out each commit and turns the data into timestamp ranges for each package version per suite. The initial import of all timestamps from 2005 to 2020 will take five days.
  4. Steps 1. to 3. are performed four times a day.

How to access the data

The data is provided by a Python cgi script with read-only access to a git repository and a sqlite database. The order of arguments matters. Timestamps are given in the same format as snapshot.d.o uses: %Y%m%dT%H%M%SZ. The only valid component names are main, contrib and non-free. On success, HTTP status 200 will be returned. If you inserted invalid data, status 500 will be returned. If the requested package(s) cannot be found, HTTP 404 will be returned. If multiple packages cannot be found, the response body will contain a list of packages that could not be found. For other bad requests, error 400 is returned with the explanation following the HTTP error code. The cgi script supports the following functionalities:

GET /cgi-bin/api?timestamps=<R>

Returns a list of all timestamps from snapshot.d.o, one per line. The value <R> controls the archive the timestamps were recorded for. Possible values are "debian", "debian-archive", "debian-backports", "debian-debug", "debian-ports", "debian-security" and "debian-volatile".

Example: cgi-bin/api?timestamps=debian

GET /cgi-bin/api?archive=<R>&timestamp=<T>&suite=<S>&comp=<C>&pkg=<P>&arch=<A>

Use this interface if you have a timestamp T and you want to know which version a certain package P with architecture A in suite S and component C had at that time. The result is one or more versions with one version per line. This is equivalent to downloading and parsing http://snapshot.debian.org/archive/debian/<T>/dists/<S>/<C>/binary-<A>/Packages.xz but has the advantage, that instead of downloading and parsing several megabytes of data and thus unnecessarily putting load on snapshot.d.o, only a few bytes need to be retrieved.

Example: cgi-bin/api?archive=debian&timestamp=20090825T163333Z&suite=unstable&comp=main&pkg=libflute-1.3-jfree-java-doc&arch=hurd-i386

GET /cgi-bin/api?archive=<R>&timestamp=<T>&suite=<S>&comp=<C>&arch=<A>

Like 4.2 above but returns all binary packages with their version for the given timestamp, suite and component, one per line, separated by a space.

Example: cgi-bin/api?archive=debian&timestamp=20090825T163333Z&suite=unstable&comp=main&arch=hurd-i386

GET /cgi-bin/api?archive=<R>&pkg=<P>

Given a package name, returns a plain file with six space separated entries per line. The first is the architecture, the second the version, the third is the suite name, the fourth the component name, the fifth the first timestamp this version was seen in this suite and the sixth the last timestamp the package was seen in it.

Example: cgi-bin/api?archive=debian&pkg=base-files

GET /cgi-bin/api?archive=<R>&pkg=<P>&arch=<A>

Given a package name and an architecture, returns a plain file with five space separated entries per line. The the first is the version, the second is the suite name, the third the component name, the fourth the first timestamp this version was seen in this suite and the fifth the last timestamp the package was seen in it.

Example: cgi-bin/api?archive=debian&pkg=base-files&arch=amd64

GET /cgi-bin/api?archive=<R>&pkg=<P>&arch=<A>&ver=<V>

All you have is a packagename, architecture, version tuple and you want to know which suites and components contained that package for what range of timestamps. The result of this query is a plain file with four space separated entries per line. The first is the suite name, the second the component name, the third the first timestamp this version was seen in this suite and the fourth the last timestamp the package was seen in it.

Example: cgi-bin/api?archive=debian&pkg=libcamel1.2-dev&arch=powerpc&ver=2.27.90-1

GET /cgi-bin/api?archive=<R>&pkg=<P>&arch=<A>&ver=<V>&suite=<S>

Same as 4.6 above but the output doesn't contain the first column with the suite name, as the request is restricted to a specific suite. As a result the output only contains the component, the first timestamp and the last timestamp per line, separated by a space.

Example: cgi-bin/api?archive=debian&pkg=libcamel1.2-dev&arch=powerpc&ver=2.27.90-1&suite=rc-buggy

GET /cgi-bin/api?archive=<R>&pkg=<P>&arch=<A>&ver=<V>&suite=<S>&comp=<C>

Same as 4.6 and 4.7 above but the output doesn't contain the first and second column with the suite and component name, as the request is restricted to a specific suite and component. As a result the output only contains the first timestamp and the last timestamp per line, separated by a space.

Example: cgi-bin/api?archive=debian&pkg=libcamel1.2-dev&arch=powerpc&ver=2.27.90-1&suite=rc-buggy&comp=main

Even though a given three-tuple of package name, version and architecture should not appear in the same suite multiple times, this still sometimes happens. This is why the result of API 4.4, 4.5, 4.6, 4.7 and 4.8 can contain multiple rows for the same suite.

Example: cgi-bin/api?archive=debian&pkg=clang-3.4-doc&arch=s390x&ver=1:3.4~svn190846-1&suite=unstable&comp=main

After pasting a buildinfo file into the textarea below and clicking the button, client-side JavaScript will use this API function and compute a minimal set of timestamps containing all package versions referenced by the buildinfo file. The script assumes that all packages were at some point part of Debian unstable main.

GET /cgi-bin/api?archive=<R>&pkgs=<P1=V1[,...]>&arch=<A>&suite=<S>&comp=<C>

If you need to find a snapshot timestamp providing multiple packages at once, then instead of making multiple queries using the api endpoints above, you can use this query instead. You supply a list of binary package names, a default architecture, a suite and a component and the result will be a minimal list of timestamps which include all the requested packages in their specific version. Each line in the results will list the architecture and the snapshot timestamp, separated by a space. The pkgs argument lists packages separated by a space. Versions are appended to package names after an equal sign. If the architecture differs from the default value given by the arch argument, then it can be appended directly after the package name and before the version with a colon. This mimics the way that a package, version, architecture three-tuple is understood by the apt command line.

Example: cgi-bin/api?archive=debian&pkgs=dpkg%3D1.18.18,diffutils%3D1%3A3.5-1&arch=amd64&suite=unstable&comp=main

GET /cgi-bin/api?archive=<R>&pkgs=<P1=V1[,...]>&arch=<A>

Similar to 4.9 above, but as you do not specify the suite and component, the result will include all timestamp ranges containing the requested version across all suites and components. So instead of issuing query 4.6 multiple times, you can retrieve the same data in a single query. You then have to perform the computations to find a good set of timestamps that contain all your versions by yourself. Each line of the output contains seven values, separated by a space: The package name, the architecture, the version, the suite, the component, the first timestamp that the package was seen in that suite and lastly, the last timestamp the package was seen in it.

Example: cgi-bin/api?archive=debian&pkgs=dpkg%3D1.18.18,diffutils%3D1%3A3.5-1&arch=amd64

POST-ing package list

Large package lists can easily go beyond the size limits of a GET request (32768 bytes) and a RFC 2388 multipart/form-data POST interface is provided for these situations. Required form fields are "archive" and "pkgs". Optional fields are "arch" (the default architecture), "suite" (limit to a single suite) and "comp" (limit to a single component). The value of the "archive" field can be "debian-backports", "debian-debug", "debian-ports", "debian-security" or "debian-volatile". The "pkgs" field has the same format as the "pkgs" argument in 4.9 and 4.10 but may also be separated by whitespace or newlines instead of a comma. Without the optional "arch" field, all packages must declare their architecture explicitly after the package name and before the version, separated by a colon. Without the optional "suite" and "comp" fields, the result will be a text file with seven space separated entries per line. The first is the package name, the second the architecture, the third the version, the fourth the suite, the fifth the component, the sixth the first timestamp the package was seen in the respective suite and the seventh the last timestamp the package was seen. If the "suite" and/or "comp" fields are used, then the respective columns of the returned data are ommitted.

POST-ing buildinfo file

archive: suite: component: buildinfo:

You can also fill the form using the following curl command:

curl -F 'buildinfo=<-' https://metasnap.debian.net/cgi-bin/api < foo.buildinfo

Downloading raw data

Archive git repo sqlite db
debian 1.2 GB 3.5 GB
debian-backports 11 MB 11 MB
debian-debug 896 MB 337 MB
debian-ports 2.3 GB 1.3 GB
debian-security 51 MB 60 MB
debian-volatile 8.3 MB 6.1 MB

Limitations

Statistics

Code

The code that downloads Packages files from snapshot.d.o puts it into a git repository and turns that into an sqlite database with ranges of timestamps per package, version and architecture can be found here:

https://salsa.debian.org/metasnap-team/metasnap

The work is licensed under MIT/Expat.

Bugs

Report them in the salsa issue tracker

Contact

Johannes Schauer Marin Rodrigues <josch [at] debian [dot] org>