A First Experience in Archiving the French Web

A First Experience in Archiving the French Web Serge Abiteboul, Grégory Cobéna (INRIA) Julien Masanès (BnF) Gérald Sédrati (Xyleme)

Organization • Web Archiving • Dépôt Légal (legal deposit) • Goal and scope • Similar Projects • Building the Archive • Frontier of the French Web • Site vs. Page archiving • Data acquisition • Importance of pages • Site-based importance • New measures • Experiments on ranking • Representing Changes Grégory Cobéna (INRIA)

Web Archiving

Dépôt légal (legal deposit) • Today • Book are archived since 1537, a decision by King Francois the 1st • Web is an important and valuable source of information • What is different? • The number of content providers (≈300 000 sites publishers vs. 5000 traditional publishers) • The quantity of information (millions of pages, plus video and audio content) • The quality of information (lots of information is not meaningful) • The relationship with the editors (freedom of publication vs. traditional ‘push’ model) • Updates and changes occur continuously • The perimeter “the contents published in France’’ does not apply easily on the Web Grégory Cobéna (INRIA)

Goal and Scope • Providing future generations with a representative archive of the cultural production for cultural, political, sociological studies etc. • The mission is to archive a wide range of material because nobody knows what will be of interest for future research • In traditional publication, publishers are filtering contents. The issue of selection comes again Grégory Cobéna (INRIA)

Similar Projects • Human selection based approach • select a few hundred sites and choose a periodicity of archiving • Australia[15] and Canada[11] • The Nordic experience • Use robot crawler to archive a significant part of the surface web • Sweden, Finland, Norway [2] • Problems are: • Lack of updates of archived pages between two snapshots • The deep or invisible Web [17,3] Grégory Cobéna (INRIA)

Goals: Cover a large portion of the web Automatic content gathering is necessary Adapt robots to provide a continuous archiving facility Have frequent versions of the sites, at least for the most “important” ones Research: The notion of “important’’ sites Building a coherent Web archive Discover and manage important sources of deep Web 1 2 Orientation of this experiment Grégory Cobéna (INRIA)

Building the Archive

The frontier of the French Web • The perimeter of the french Web is: “contents edited in France” • Many criteria may be used: • The French language -but many French sites use English -other French speaking countries or regions (e.g. Quebec) use French • Domain Name or resource locators .fr sites, but also .com or .org • Address of the site physical location of the web servers or address of the owner • Other criteria: BnF has little interest in commercial sites • Pure librarian driven does not scale- Pure automatic does not work: • The process should involve librarians and their expertise Grégory Cobéna (INRIA)

Site vs. Page archiving • The Web: • Physical granularity = HTML pages • +layout, images, … • The problem is inconsistent data and links • Read page P ; one week later, read pages pointed by P – may not exist anymore • Logical granularity? • Snapshot view of a web site • What is a site? • INRIA is www.inria.fr + www-rocq.inria.fr + … • www.free.fr is hosting many different sites • There are technical issues (rapid firing, …) Grégory Cobéna (INRIA)

Data acquisition • Crawl • For these experiments, we used Xyleme[19] crawler • Discovery • Web is more than 2 billion pages • French Web about 20 millions URLs • First experiments using <*.fr> • Refresh • Based on the change rate of the data • we use a site change rate based on the pages’ change rate • Important pages are refreshed more often • The change rate of pages is unknown Grégory Cobéna (INRIA)

Importance of pages

What is page importance? • “Le Louvre” homepage is more important than an unknown person’s homepage • Important pages are pointed by: • Other important pages • Many unimportant pages • Can be compared to bibliographical references • This leads to Google[5] definition of PageRank • Based on the graph and links structure • used with remarkable success • Useful, but not sufficient for Web archiving. We need to use other criteria as well Grégory Cobéna (INRIA)

Page Importance Computation • Importance • Link matrix L • Page importance is the fixpoint X of the equation L*X = X (i.e. important pages are pointed by important pages) • Storing the link matrix and computing page importance uses lots of resources • We developed[1] a new efficient technique to compute the fixpoint • Without having to store the Link matrix • Technique adapts to automatically to changes Grégory Cobéna (INRIA)

Using a stronger links semantic • Limitations of page importance • Traditional page importance works well when links have a strong semantic (e.g. the author links to web pages that he likes) • More and more web pages are automatically generated and most links have little semantics • Refresh at the page level presents drawbacks • So we use link topology between sites and not only between pages • We also use the internal structure of a Web site to determine which links are more important Grégory Cobéna (INRIA)

Site-based importance • The “Random Walk” model is used to determine the site internal links structure and assign an importance to each link • From there we define links between Web sites as follows: • Using the standard importance definition, and the “random walk” model, the importance of a Web Site is exactly the sum of its pages importance Grégory Cobéna (INRIA)

New criteria for importance • Frequent use of infrequent words (Find pages dedicated to a specific topic) • Text Weight (Find text pages with text content vs. raw data pages) • Others Grégory Cobéna (INRIA)

Validation of the ‘notoriety’ parameter’ • Blind experiment with 8 librarians • A list of 900 sites with notoriety parameter provided by Xyleme • 236 sites remained after exclusion of commercial sites and site no longer existing at the time of the test Grégory Cobéna (INRIA)

Does the ranking correlate with the librarian’s choices ? Random choice Grégory Cobéna (INRIA)

A model for Representing Changes Move from a discrete snapshot-type archive to a more continuous one

Representing changes • Goals • Provide an historical view of the Web • Issues • Have a persistent identification of web pages using their URL and date of crawl • Support temporal queries and provide means to efficiently access data • Handle mirror sites in order to save resources Grégory Cobéna (INRIA)

The “Site-Delta” representation of Changes • The “site-delta” is an XML document • It is used to manage metadata about documents, and in particular temporal metadata • Important aspects are: • storage efficiency • Keep crawled information and no duplicata • Use diff to understand changes when possible • management of versions and updates • Support for queries and browsing Grégory Cobéna (INRIA)

Browsing the Archive • The archive must be prepared in several steps: • Use local links instead of Internet links (problems occur with javascript, with sessions, …) • Fix inconsistent data and links • Integrate the notion of time in links • Advanced: summarize several snapshots of data into a single document? • Consider for example the News site www.lemonde.fr • We want to give access to all news articles in January 2002 (and their versions) Grégory Cobéna (INRIA)

Conclusion and Experiments • A crawl of the web was performed • We used between 2 to 8 PCs for Xyleme crawlers • We looked at more than 1 billion (most interesting) pages • We discovered 15 million *.fr pages (about 1.5%) • We discovered 150.000 *.fr sites • Discovery and refresh are based on page importance • Takes into account also the change rate of pages • We analyzed the relevance of page importance for librarians • Comparison with ranking by librarians • Strong correlation with their rankings • Next we plan to use classification and clustering techniques to refine the notion of site Grégory Cobéna (INRIA)

Merci Grégory Cobéna (INRIA)

Example <website url=“www.inria.fr”> <page url=“/index.html”> <document date=“2002-jan-04” status=updated file=“/data/fV453.htm”/> <document date=“2002-jan-22” status=updated file=“/data/hX678.htm”/> <document date=“2002-mar-02” status=unchanged file=“/data/hX678.htm”/> … </page> … </website> Grégory Cobéna (INRIA)

A First Experience in Archiving the French Web

A First Experience in Archiving the French Web

Presentation Transcript

Web Archiving in Norway

Web Archiving

Archiving the Mobile Web

Caught in the Web: Web Archiving at U of A Libraries

French LUP Experience

French First Level

Web Archiving @ The Internet Archive

Archiving Electronic Research Data: the ADS experience

Issues in Human Rights Web Archiving

Archiving and Preserving the Web

Web Archiving

Archiving and Preserving the Web

French First Level

WEB ARCHIVING IN THE BRITISH LIBRARY

“ WARP : Web Archiving Project”

Web Archiving

From web archiving to web collecting

Tool Academy: Web Archiving

The Web Archiving Service

Archiving and Preserving the Web

Issues in Human Rights Web Archiving

Archiving and Preserving the Web