Websites are created, updated or removed every days. Consequently, we decided to start this initiative with the perspective to preserve our knowledge and heritage.
Old, personnal or niche websites are targetted.
We mainly use the web crawler engine developed and used by the Web Archive organization : Heritrix.
Websites are saved in a dedicated format (the WARC format) used for archiving, and stored in many places (a home server and a cloud hosting service).
Other tools we use :
An open-source Java program to archive an entire website to a WARC archive. It is developed and used by the Internet Archive organization.
A well-known, open-souce, multi-platform program to scrap websites.
A Chrome and Firefox extension to create a WARC file from a webpage.
Another open-source program using Chrome headless.
An in-house Chrome extension is being developed to help data extraction and archiving on visited webpages.
Websites can be added to the waiting list by contacting me (for the moment).
For legal reasons, archives are just tested but not shared with anyone, except with other recognized archiving organizations.
As of beginning of july, the crawling system is about to be installed and tested.
The only way to help is to give to Heritrix something to eat. You can contact me with the target link.
We archive web content which is publicly available under the right of the private copy. On the other hand, we may distribute the archive to the request of french archiving institution (they have an exception as a public archiving service).
We are really aware of intellectual property. If your website was archived even when you never wanted that, please fill this form.
Complaint targets will be deleted from our servers.