tl;dr: You can now use our searchable database to download Bitcoin timestamps for items in the Internet Archive.

While that title sounds like clickbait, the hard work of the Internet
Archive made it much more accurate than it sounds.
They’re a San Francisco non-profit digital library that provides free public access to
collections of digitized materials, ranging from software applications/games,
music, movies/videos, moving images, and millions of public-domain books. But
they’re perhaps best known for the Wayback Machine,
an archive of hundreds of billions of website snapshots, providing a priceless
historical record of the evolution of the web.

In short, if it’s on the internet, there’s a pretty good chance the Internet
Archive has a copy of it.

But is that copy the right copy?

OpenTimestamps helps answer that question by cryptographically proving data
existed in the past, long before an attacker would have had an opportunity or reason to
forge or modify that data.

The OpenTimestamps team has timestamped every item in the Internet Archive –
about 750,000,000 files in total – and made those timestamps publicly
available via a searchable
database
. This means that right
now you can get timestamps for every book, movie, song, computer program, legal
document, etc. in the thousands of collections in the archive. In the future we
hope to be able to work with the Internet Archive to extend this to
timestamping website snapshots, and our infrastructure will continue to
timestamping new items as they’re added to the archive.

Let’s look at an example attack on the archive, how a timestamp could prevent
it, and finally, the tech details behind this effort.

Disclaimer: this is not an official Internet Archive project and was done
entirely with publicly accessible APIs (though we did check with them in
advance to ensure they had no objections to the project).

Contents

  1. I’m Satoshi Nakamoto
    1. How Timestamps can (and can’t) Protect the Internet Archive
  2. The Tech
    1. Getting the Digests
    2. Generating the Merkle Tree
    3. The User-Interface
    4. Timestamping the Merkle Tree
    5. SHA1 is Good Enough for Timestamps!
  3. What’s Next?
    1. Browser Compatibility
    2. Improved Coverage
    3. Mirrors
    4. Web Captures
  4. Footnotes

I’m Satoshi Nakamoto

…and I’d like you to invest your money in my next project, mChain, which will
revolutionize the energy efficiency of proof-of-work with simulated quantum
computing. Still don’t believe me? Let’s prove it. First, here’s a PGP signed
statement:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Peter Todd is Bitcoin creator Satoshi Nakamoto.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJZIgp7AAoJEMRbeOy+vH8ry20IAIn1jGibaU39n6Z3Mn1MwKlA
AHksriNkxZSTivm0kHN5xjCatujFDXL7WSkQkuP30/TUhVuMfwU5Fiw7qHw9QfFA
f2JrLy+XcEv2xxsziA7IrdvjJPSAIjl39hQODgSrhBpj21+hQxIzTtlm6UaHAnVg
iiSOCOVkl35GFi6bMRU80apHEFMOAcakMEDje+qlv2C/p1J/0lPdigdfEJh9nOnw
7EOTps9aXG5LeCXG2IuGBW1CzqlMuD/KOfmkK2WQxVytC80TaNBmkN9i05xSYbnd
BsByB3rMEKAlNRMo2pHhreOzdww+badEB7/w4Dj1rsLgcyGqjq/ZKeeXR7j7MLI=
=SMVi
-----END PGP SIGNATURE-----

As we all know the Australian scammer Craig Wright produced a similar
PGP message
with a fake backdated PGP key. We know that key is a fake for a lot
of reasons, including the fact that it doesn’t match the Wayback Machine’s Jan
2011 snapshot of bitcoin.org
.

And yes, I signed that message with a different key too. But I have an
explanation: you see, I stored all the Satoshi Nakamoto pseudonym stuff on a
MicroSD card, which I lost in a tragic house fire right after Gavin visited the
CIA. But you see, I actually had to delay the publication of Bitcoin a few
months when I realised I needed to add smart contracts to it, and I just found
a backup from that attempt. I uploaded it to the Internet Archive a few months
prior to releasing Bitcoin:

Fake Satoshi Key Search Result

Similarly the fingerprint is mentioned in the original whitepaper, also on the
Internet Archive:

Fake Satoshi Paper Search Result

Of course, those screen shots are photoshopped. But if I colluded with – or
coerced – an Internet Archive sysadmin I could easily make them all too
real. With Craig Wright alone allegedly scamming tens of millions of dollars,
it’s easy to see how there can be a lot of incentive to manipulate history.

How Timestamps can (and can’t) Protect the Internet Archive

By consistently timestamping all Internet Archive content, we make attacks like
the above easy to detect. The OpenTimestamps proofs we’ve generated are
traceable back to the Bitcoin blockchain, a widely witnessed data structure with
timestamps that can’t be backdated. Even with a sysadmin’s help, the best the
attacker could do is create a modified file that’s very suspiciously missing a
timestamp that all other files have.

However, it’s important to note timestamps are not a panacea: they’re just
evidence as to when a file existed; by themselves they can’t prove a file
is legit. For example, if I had known in 2008 that Satoshi was going to release
Bitcoin, I could have generated fake keys and fake Bitcoin papers with 100%
real timestamps. While such a scam is much less likely, it’s certainly not
impossible1.

The Tech

The Internet Archive collection is massive, dozens of petabytes in size. It’s
so big that when Wayback Machine Director Mark Graham learned that we had
timestamped the Internet Archive, he sent me an email asking:

How are you able to do anything with “all” of the Internet Archive? 🙂

In fact, due to the excellent API the Internet Archive provides, this was way
easier than you might expect!

Getting the Digests

Every item in the archive consists of a set of files and some associated
metadata
.
Here’s an example from an item I’ve uploaded:


    1479664926
    407240704
    8afba33859360da23c2d354b92ab5c47
    d5b9f56b
    0f42d040ff8370ff8a3041dd3e659f7e8a3d6c8c
    ISO Image

Importantly, we can get the SHA1 digest without actually downloading the file!

To get the complete set of digests I use the scraping
API
to search for all items added on
a given day. Using the excellent library internetarchive the above was implemented
in about sixty lines of Python.

Generating the Merkle Tree

There have been a lot of ridiculously inefficient Bitcoin timestamping schemes,
using one, or even
two
Bitcoin transactions per timestamp. The insane thing is these schemes actually
get used on a large scale:

Not having $116 million of spare change lying around, I decided to use a merkle
tree instead – the advanced moon-math invented by Ralph Merkle in 19792.

To save time I decided to re-use the OpenTimestamps Calendar
Server
codebase. The
way a calendar server normally works is it maintains an append-only journal
of commitments (digests) it has promised to timestamp, and a database of
timestamps for those commitments.

The actual database is a simple LevelDB key:value mapping, with the keys being
the messages, and the values being the commitment operations and
notary attestations
(namely Bitcoin headers) that comprise the timestamp tree nodes. When a client
asks for a timestamp for a given message, the server just walks the tree
recursively
.

However normally calendars will only timestamp a few thousands commitments at a
time. So the code to generate merkle trees and add them to the database isn’t
particularly efficient – it doesn’t have to be – and even worse, keeps
multiple copies of the entire tree in memory until written, one for each time
the fees are bumped with a replacement transaction.

I knew that code was going to fall over trying to timestamp hundreds of
millions of digests, so I quickly hacked up a more efficient incremental
import script

for the initial merkle tree that wrote to the database level-by-level
incrementally, keeping nothing in memory. The idea being that subsequent
timestamping could be done with much smaller, per-day, trees.

The User-Interface

In parallel Riccardo Casatta and Luca Vaccaro of
Eternity Wall, and Igor Barinov were working on the
code and graphics design for the database
UI. They did 100% of the work on the website – this description is second hand – but what they’ve told me is the site is
essentially a wrapper around the Internet Archive advanced search
API
, that additionally queries the
calendar I setup.

Timestamping the Merkle Tree

I gotta admit, I figured I’d get a few small donations; in the end I got dozens
of donations totalling 0.218 BTC! Those donations got combined into one output
in tx 564d27fc17068e8d4c997a86287fe79b37b07552b3fb5e3c11c1a3d4fd933882, with
the actual timestamp being 8465d34ede9e3387cfd7aacae880cfc86a5bdc603d8822e3b0e7c1369f8acfa8.

The final step of adding the transaction to the database was done manually with
the python-opentimestamps
library, about an hour and a half prior to when I was supposed to give a talk
and live demo
.

In fact, as I soon found out prepping for the demo, I’d managed to completely
miss an entire year in my initial merkle tree, and part of two other years, as
I forget to copy a few directories! For the live demo I wanted to have the
audience pick what I’d search for; I was out of time at that point so I crossed
my fingers and hoped it’d work the first try, which it fortunately did.

SHA1 is Good Enough for Timestamps!

A limitation of our approach is that we’re restricted to timestamping the
digests the Internet Archive API gives us, the strongest of which is SHA1.
While it is true that SHA1 has been broken, that break
isn’t relevant for timestamping: while it is possible to generate two messages
with the same SHA1 digest, both messages have to be generated simultaneously.
For the purpose of a timestamp proof this is totally OK! All we care about is
preimage attacks that find a
message with a specific hash digest. SHA1 is not vulnerable to those attacks,
with the one exception of Snefru-2 there are no examples of any modern
cryptographic hash function being vulnerable to pre-image attacks3.

What’s Next?

Browser Compatibility

Currently the database UI works fine on Chrome, but has issues with Firefox and
Safari; we’re working towards supporting all major browsers.

Fixed!

Improved Coverage

Unfortunately our initial timestamping effort wasn’t 100% complete – as we’ve
found later some of the items in the archive are entirely missing the “added-on”
metadata field that we used to find all items. The Internet Archive is also not
a consensus system, so random errors may have left some items without
timestamps as well. So we’re doing another pass to make sure we have 100%
coverage.

Mirrors

While you can download the raw LevelDB
database
of
Internet Archive timestamps, we could use a better way to mirror this work. In
particular, we need a format for which it’s easy to upload the timestamps
generated to the Internet Archive. We also need a format that can be downloaded
incrementally, even as new timestamps are added. This problem exists for
OpenTimestamps in general, so I hope to solve it for both use-cases at once.

Web Captures

Behind the scenes the Wayback Machine uses the Web ARChive (WARC) archive format to
archive web crawl data. These archives are stored in the archive as items, with
dozens of collections available such as the Common Crawl.

This means that we have timestamped the underlying internet crawl data as
part of this effort. However we have two problems:

  1. Much of the crawl data isn’t publicly accessible for various reasons such
    as embargo agreements. Without that raw data, third-parties like you or I can’t
    actually verify the timestamp proofs.

  2. The granularity of the timestamps isn’t on a per-capture basis, so even if
    you do get the raw data, it’s inconvenient to verify.

The second problem is obvious: we’ve only timestamped a few hundred million
individual files, while the Wayback Machine has captured over half a
trillion
4 individual web objects! Scaling up our effort to
the entire Wayback Machine is going to be a lot of work – and it’ll need the
direct involvement of the Internet Archive – but what I’d like to see in the
future is the ability for an advanced user to download a snapshot and all
resources referenced by the snapshot as some kind of WARC capture, extended
with timestamp proofs.

Footnotes

  1. In fact, as I’m on record as having been discussing crypto-currencies with Adam Back and Hal Finney back in 2001, I guess I’m a possible suspect for such a fraud!

  2. “Method of providing digital signatures”, Ralph Merkle, US Patent 4,309,569

  3. “Lessons From The History Of Attacks On Secure Hash Functions”, Zooko Wilcox, accessed 2017-02-24

  4. “Defining Web pages, Web sites and Web captures”, Vinnay Goel, Internet Archive Blogs, Oct 23rd 2016