Heritage project › web archiving

Websites are created, updated or removed every days. Consequently, we decided to start this initiative with the perspective to preserve our knowledge and heritage.

What

Old, personnal or niche websites are targetted.

From a selected theme, we browse the Web and select appropriate ressources to be archived. For instance, it can be amateur websites dealing with radio ham or videos featuring the "hurdy gurdy" (see the projects).

How

For traditional web archiving (e.g full-page archiving, general purpose archiving), we use the NodeJS software Squidwarc which features a crawl engine and use a browser automation library called Puppeteer.

For some specific site and traget (e.g media only, picture hosting site) and for discovery, we develop archiving tools using the NodeJS Puppeteer library.

Websites are saved in a dedicated format (the WARC format) used for archiving, and stored in many places (a home server and a cloud hosting service ; or a public directory if possible).

Other tools we use :

Heritrix

An open-source Java program to archive an entire website to a WARC archive. It is developed and used by the Internet Archive organization.

HTTrack

A well-known, open-source, multi-platform program to scrap websites.

WARCreate

A Chrome and Firefox extension to create a WARC file from a webpage.

Squidwarc

A NodeJS open-source program using Puppeteer (Google Chrome or Chromium browser automation library) to archive sites.

Brozzler

Another open-source program developed by Internet Archive. It fills serious gaps of Heritrix for relatively new sites using more and more Javascript to rendre webpages. For that, it uses not an simple page downloader but instead a headless Chrome browser to render webpages.

Crocoite

Another open-source program using Chrome headless.

An in-house Chrome extension is being developed to help data extraction and archiving on visited webpages.

Websites can be added to the waiting list by contacting me (for the moment).

Contribute

At least, you can warn the r/Archiveteam community on Reddit and myself. And if you can, you may also send me the resource location to be archived.

Infrastructure

While we own numerous machines to archive websites, we use most of them for development and testing purposes. We rely on a fully owned Virtual Private Servers (VPS) situated in Paris and public cloud providers such as Scaleway, DigitalOcean or Amazon, populated in numerous regions, which make the main source of expenditure (5 to 20 Euros per month).

These VPS are manually managed yet ; but a multiplatform tool is being developped internally to setup machines depending on requested workload.

The OS and the web crawler (a fork of Squidwarc featuring logging capabilities) are monitored by a NodeJS app populating data to be filtered and displayed on Grafana dashboards.

For the software stack, we mainly rely on the NodeJS technology for the "core business", which is to run traditionnal crawls or site-specific crawls with internally developped tools. As the use of other tools used for discovering, management, or monitoring is fragmenented and complex, we plan to build a software which integrates seamlessly these tools to setup a distributed, web archiving-centric platform (thot project).

Legislation

We archive web contents which are publicly available under the right of the private copy. On the other hand, we may distribute the archive from french locations to the request of french archiving institution (they have an exception as a public archiving service), and other resources to the Internet Archive.

For legal reasons, archives are just tested but not shared with anyone, except with other recognized archiving organizations.

Reclamation

We are really aware of intellectual property. If your website was archived even when you never wanted that, please fill this form.

Complaint targets will be deleted from our servers.