260 likes | 397 Views
A First Experience in Archiving the French Web. Serge Abiteboul, Grégory Cobéna (INRIA) Julien Masanès (BnF) Gérald Sédrati (Xyleme). Organization. Web Archiving Dépôt Légal (legal deposit) Goal and scope Similar Projects Building the Archive Frontier of the French Web
E N D
A First Experience in Archiving the French Web Serge Abiteboul, Grégory Cobéna (INRIA) Julien Masanès (BnF) Gérald Sédrati (Xyleme)
Organization • Web Archiving • Dépôt Légal (legal deposit) • Goal and scope • Similar Projects • Building the Archive • Frontier of the French Web • Site vs. Page archiving • Data acquisition • Importance of pages • Site-based importance • New measures • Experiments on ranking • Representing Changes Grégory Cobéna (INRIA)
Dépôt légal (legal deposit) • Today • Book are archived since 1537, a decision by King Francois the 1st • Web is an important and valuable source of information • What is different? • The number of content providers (≈300 000 sites publishers vs. 5000 traditional publishers) • The quantity of information (millions of pages, plus video and audio content) • The quality of information (lots of information is not meaningful) • The relationship with the editors (freedom of publication vs. traditional ‘push’ model) • Updates and changes occur continuously • The perimeter “the contents published in France’’ does not apply easily on the Web Grégory Cobéna (INRIA)
Goal and Scope • Providing future generations with a representative archive of the cultural production for cultural, political, sociological studies etc. • The mission is to archive a wide range of material because nobody knows what will be of interest for future research • In traditional publication, publishers are filtering contents. The issue of selection comes again Grégory Cobéna (INRIA)
Similar Projects • Human selection based approach • select a few hundred sites and choose a periodicity of archiving • Australia[15] and Canada[11] • The Nordic experience • Use robot crawler to archive a significant part of the surface web • Sweden, Finland, Norway [2] • Problems are: • Lack of updates of archived pages between two snapshots • The deep or invisible Web [17,3] Grégory Cobéna (INRIA)
Goals: Cover a large portion of the web Automatic content gathering is necessary Adapt robots to provide a continuous archiving facility Have frequent versions of the sites, at least for the most “important” ones Research: The notion of “important’’ sites Building a coherent Web archive Discover and manage important sources of deep Web 1 2 Orientation of this experiment Grégory Cobéna (INRIA)
The frontier of the French Web • The perimeter of the french Web is: “contents edited in France” • Many criteria may be used: • The French language -but many French sites use English -other French speaking countries or regions (e.g. Quebec) use French • Domain Name or resource locators .fr sites, but also .com or .org • Address of the site physical location of the web servers or address of the owner • Other criteria: BnF has little interest in commercial sites • Pure librarian driven does not scale- Pure automatic does not work: • The process should involve librarians and their expertise Grégory Cobéna (INRIA)
Site vs. Page archiving • The Web: • Physical granularity = HTML pages • +layout, images, … • The problem is inconsistent data and links • Read page P ; one week later, read pages pointed by P – may not exist anymore • Logical granularity? • Snapshot view of a web site • What is a site? • INRIA is www.inria.fr + www-rocq.inria.fr + … • www.free.fr is hosting many different sites • There are technical issues (rapid firing, …) Grégory Cobéna (INRIA)
Data acquisition • Crawl • For these experiments, we used Xyleme[19] crawler • Discovery • Web is more than 2 billion pages • French Web about 20 millions URLs • First experiments using <*.fr> • Refresh • Based on the change rate of the data • we use a site change rate based on the pages’ change rate • Important pages are refreshed more often • The change rate of pages is unknown Grégory Cobéna (INRIA)
What is page importance? • “Le Louvre” homepage is more important than an unknown person’s homepage • Important pages are pointed by: • Other important pages • Many unimportant pages • Can be compared to bibliographical references • This leads to Google[5] definition of PageRank • Based on the graph and links structure • used with remarkable success • Useful, but not sufficient for Web archiving. We need to use other criteria as well Grégory Cobéna (INRIA)
Page Importance Computation • Importance • Link matrix L • Page importance is the fixpoint X of the equation L*X = X (i.e. important pages are pointed by important pages) • Storing the link matrix and computing page importance uses lots of resources • We developed[1] a new efficient technique to compute the fixpoint • Without having to store the Link matrix • Technique adapts to automatically to changes Grégory Cobéna (INRIA)
Using a stronger links semantic • Limitations of page importance • Traditional page importance works well when links have a strong semantic (e.g. the author links to web pages that he likes) • More and more web pages are automatically generated and most links have little semantics • Refresh at the page level presents drawbacks • So we use link topology between sites and not only between pages • We also use the internal structure of a Web site to determine which links are more important Grégory Cobéna (INRIA)
Site-based importance • The “Random Walk” model is used to determine the site internal links structure and assign an importance to each link • From there we define links between Web sites as follows: • Using the standard importance definition, and the “random walk” model, the importance of a Web Site is exactly the sum of its pages importance Grégory Cobéna (INRIA)
New criteria for importance • Frequent use of infrequent words (Find pages dedicated to a specific topic) • Text Weight (Find text pages with text content vs. raw data pages) • Others Grégory Cobéna (INRIA)
Validation of the ‘notoriety’ parameter’ • Blind experiment with 8 librarians • A list of 900 sites with notoriety parameter provided by Xyleme • 236 sites remained after exclusion of commercial sites and site no longer existing at the time of the test Grégory Cobéna (INRIA)
Does the ranking correlate with the librarian’s choices ? Random choice Grégory Cobéna (INRIA)
A model for Representing Changes Move from a discrete snapshot-type archive to a more continuous one
Representing changes • Goals • Provide an historical view of the Web • Issues • Have a persistent identification of web pages using their URL and date of crawl • Support temporal queries and provide means to efficiently access data • Handle mirror sites in order to save resources Grégory Cobéna (INRIA)
The “Site-Delta” representation of Changes • The “site-delta” is an XML document • It is used to manage metadata about documents, and in particular temporal metadata • Important aspects are: • storage efficiency • Keep crawled information and no duplicata • Use diff to understand changes when possible • management of versions and updates • Support for queries and browsing Grégory Cobéna (INRIA)
Browsing the Archive • The archive must be prepared in several steps: • Use local links instead of Internet links (problems occur with javascript, with sessions, …) • Fix inconsistent data and links • Integrate the notion of time in links • Advanced: summarize several snapshots of data into a single document? • Consider for example the News site www.lemonde.fr • We want to give access to all news articles in January 2002 (and their versions) Grégory Cobéna (INRIA)
Conclusion and Experiments • A crawl of the web was performed • We used between 2 to 8 PCs for Xyleme crawlers • We looked at more than 1 billion (most interesting) pages • We discovered 15 million *.fr pages (about 1.5%) • We discovered 150.000 *.fr sites • Discovery and refresh are based on page importance • Takes into account also the change rate of pages • We analyzed the relevance of page importance for librarians • Comparison with ranking by librarians • Strong correlation with their rankings • Next we plan to use classification and clustering techniques to refine the notion of site Grégory Cobéna (INRIA)
Merci Grégory Cobéna (INRIA)
Example <website url=“www.inria.fr”> <page url=“/index.html”> <document date=“2002-jan-04” status=updated file=“/data/fV453.htm”/> <document date=“2002-jan-22” status=updated file=“/data/hX678.htm”/> <document date=“2002-mar-02” status=unchanged file=“/data/hX678.htm”/> … </page> … </website> Grégory Cobéna (INRIA)