Heritage project / web archiving

Websites are created, updated or removed every days. Consequently, we decided to start this initiative with the perspective to preserve our knowledge and heritage.

What

Old, personnal or niche websites are targetted.

How

We mainly use the web crawler engine developed and used by the Web Archive organization : Heritrix.

Websites are saved in a dedicated format (the WARC format) used for archiving, and stored in many places (a home server and a cloud hosting service).

Other tools we use :

Heritrix

An open-source Java program to archive an entire website to a WARC archive. It is developed and used by the Internet Archive organization.

HTTrack

A well-known, open-souce, multi-platform program to scrap websites.

WARCreate

A Chrome and Firefox extension to create a WARC file from a webpage.

Brozzler

Another open-source program developed by Internet Archive. It fill serious gaps of Heritrix for relatively new sites using more and more Javascript to rendre webpages. For that, it uses not an simple page downloader but instead a headless Chrome browser to render webpages.

Crocoite

Another open-source program using Chrome headless.

An in-house Chrome extension is being developed to help data extraction and archiving on visited webpages.

Websites can be added to the waiting list by contacting me (for the moment).

For legal reasons, archives are just tested but not shared with anyone, except with other recognized archiving organizations.

Status

As of beginning of july, the crawling system is about to be installed and tested.

Contribute

The only way to help is to give to Heritrix something to eat. You can contact me with the target link.

Legislation

We archive web content which is publicly available under the right of the private copy. On the other hand, we may distribute the archive to the request of french archiving institution (they have an exception as a public archiving service).

Claim

We are really aware of intellectual property. If your website was archived even when you never wanted that, please fill this form.

Complaint targets will be deleted from our servers.