Created by Materia for OpenMind Recommended by Materia
Start Rescuing the Forgotten Jewels of the Internet
21 November 2019

Rescuing the Forgotten Jewels of the Internet

Estimated reading time Time 4 to read

You only need to see how quickly a publication on a social network becomes obsolete to realize something intrinsic to digital activity: its ephemeral nature. For years now, UNESCO has been calling for the preservation of the “enormous treasure trove of information” produced online. The effort of different entities has allowed this mission to be launched, and the pioneering project is that of the Internet Archive organization: a massive rescue of the forgotten jewels of the Internet.

391 billion web pages, 20 million books and texts and more than 11.5 million audios, videos, images and software programs. This is the content of the enormous library of Internet Archive. And every day, thousands of other rescued items are added to the collection, which in addition to websites (here, for example, is the first saved version of OpenMind), has room for television channels, radio stations and academic articles.

Headquarters of Internet Archive in San Francisco. Credit: Girl2k

The WayBack Machine tool saves a replica of virtually any web page before it disappears (even at the request of users). A copy of all archived material occupies more than 45 petabytes. “And we store at least two copies of everything,” they say.

A resource for researchers

Why is so much work necessary? One reason is given by Mark Graham, the director of WayBack Machine: “Because if a web site goes down, a company goes out of business, a government changes, a content management system is changed without care, or the content of a web page is changed, that information may be forever lost,” he explains to OpenMind.            

“The web is obviously the media of our time. Content on the web is key to understanding society and will be an invaluable resource for future researchers,” adds Julien Masanès, former CEO of the Internet Memory Foundation, an institution for archiving web content on a European scale that was active until 2018.

Sometimes Internet Archive rescue operations are a real race against time. One of the most recent examples is that of the Yahoo! Groups pages, a set of public and restricted forums where users from all over the world have debated a variety of topics over the last few years. Yahoo has decided to delete all public content as of December 14, 2019 and the Internet Archive has quickly jumped into action. “The Internet Archive is on a mission to save as much as possible…,” it posted on LinkedIn.

Internet Archive has reviewed and edited the links on more than 14 million Wikipedia pages in 30 languages. Credit: Sai5

One of the criteria that guide the archivists of this institution is to pursue the “exhaustiveness” of the content, explains Graham. “There are more than 150,000 sources of “news” in the world,” he explains. And one important aspect is to find out the origin of the material, “to ensure the integrity of what is archived” and “to be able to trust that the source is the source and that the content has not been altered.”

Among the Internet Archive’s activities in recent years, one has been to review and edit the links on more than 14 million Wikipedia pages in 30 languages, store more than 11 million of them in its archives, and transform “130,000 book citations into direct links to 50,000 digitized volumes.” The goal, they insist, is to make the web “more reliable.”

Saving information from government restrictions

The commitment of this large, non-profit digital library based in San Francisco is also due to the concern that information from the digital world may suddenly disappear. The worst-case scenario imaginable, Graham says, is that of “nuclear war destroying vast amounts of humanities collected knowledge.” But digital heritage could also be compromised by governments that see a threat in the presence of a lot of documentary information, he adds.

Internet Archive transform 130,000 book citations into direct links to 50,000 digitized volumes. Credit: Dvortygirl

With this argument, the organization’s founder, Brewster Kahle, justified the creation of an entire copy of the archive in Canada shortly after Donald Trump won the U.S. election. He said the new president’s administration implied that there could be “greater restrictions.” “Government surveillance is not going away; indeed, it looks like it will increase,” he wrote in a post.

And, as Graham points out, “most countries in the world have no formal program to archive the digital content produced by their citizens or government.” “We currently miss large portions of public and valuable information that is published through platforms,” says Masanès. “Think about the importance of Twitter in the current political debate.”

Massive and selective sweeps

There are currently about 4.1 billion Internet users in the world, according to the International Telecommunication Union. Is it possible to store all the digital information they generate? How to choose what to store? Both Graham and Masanès agree that, with current resources, only a small part of the digital heritage can be saved.

Graham believes that there is content that “can or should be saved completely,” such as those of public administrations, NGOs and the academic world. But in other cases, such as in social networks, he considers it more affordable to determine selection criteria.

Mirror  servers of the Internet Archive in the Bibliotheca Alexandrina. Credit: Nikola Smolenski

The National Library of Spain, for example, does web archiving work based on both massive and selective gathering. As Mar Pérez, Director of Digital Processes and Services at the library, explains, for the last decade (relying on the Internet Archive until 2013) this public institution has carried out an annual sweep of all the sites registered with the .es domain and has stored copies of 1,900,000 sites. It also maintains collections of specific areas (national or regional press, websites, network publications, blogs, videos or, for example, digital materials related to a specific topic). “This rescue work is not well known now, but its importance will become clear in a few decades,” he says. “If we don’t keep what happens on the web, in 50 years we won’t have any record of what has happened in our time.”

Masanès believes that there is still work to be done, and among the guidelines to be followed, he indicates the creation of institutions specifically focused on digital archiving. After the foundation that led this initiative closed (according to him due to “lack of funding”), the existence of a single digital archive for the whole of Europe is lacking. “That’s really a pity since creating such an infrastructure would cost less than the budget of a mid-sized museum or public library. And we would only need one for the whole of Europe!” he says.

“We have to make preserving our digital heritage as important as we have made the preservation of books,” says Mark Graham.

Francesco Rodella


Comments on this publication

Name cannot be empty
Write a comment here…* (500 words maximum)
This field cannot be empty, Please enter your comment.
*Your comment will be reviewed before being published
Captcha must be solved