Issues in Monitoring Web Data

Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr Web Monitoring 2002

Organization • Introduction • What is there to monitor? • Why monitor? • Some applications of web monitoring • Web archiving • An experience: the archiving of the French web • Page importance and change frequency • Creation of a warehouse using web resources • An experience: the Xyleme Project • Monitoring in Xyleme • Queries and monitoring • Conclusion Web Monitoring 2002

1. Introduction Web Monitoring 2002

Billions of pages + millions of servers Query = keywords to retrieve URLs Imprecise; query results are useless for further processing Applications: based on ad-hoc wrapping Expensive; incomplete; short-lived, not adapted to the Web constant changes Poor quality Cannot be trusted: spamming, rumors… Often stale Our vision of it often out-of-date Importance of monitoring The Web Today Web Monitoring 2002

The HTML Web Structure Source : IBM, AltaVista, Compaq Web Monitoring 2002

HTML: Percentage covered by Crawlers Source: searchenginewatch.com Web Monitoring 2002

So much for the world knowledge… • Most of the web is not reached by crawlers (hidden web) • Some of the public HTML pages are never read • Most of what is on the web is junk anyway • Our knowledge of it may be stale • Do not junk the techno – improve it! Web Monitoring 2002

What is there to monitor? • Documents: HTML but also doc, pdf, ps… • Many data exchange formats such as asn1, bibtex… • New official data exchange format: XML • Hidden web: database queries behind forms or scripts • Multimedia data: ignored here • Public vs. private (Intranet or Internet+passwd) • Static vs. dynamic Web Monitoring 2002

What is changing? • XML is coming • Universal data exchange format • Marriage of document and database worlds • Standard query language: XQuery • Quickly growing on Intranet and very slowly on public web (less than 1%) • Web services are coming • Format for exporting services • Format for encapsulating queries • More semantics to be expected • RDF for data • WSDL+UDDI for services Web Monitoring 2002

What is not changing fast or even getting worse • Massive quantity of data – most of it junk • Lots of stale data • Very primitive HTML query mechanisms (keywords) • No real change control mechanism soon • Compare database queries (fresh data) with web search engines (possibly stale) • Compare: database triggers (based on push) to web notification services (most of the times based on pull/refresh) Web Monitoring 2002

The need to monitor the web • The web changes all the time • Users are often as interested in changes as by data – new products, new press articles, new price… • Discover new resources • Keep our vision of the web up-to-date • Be aware of changes that may be of interest, have impact on our business Web Monitoring 2002

Analogy: databases • Databases • Query: instantaneous vision of data • Trigger: alert/notification of some changes of interest • Web • Query: need monitoring to give correct answer • Monitoring: to support alert/notifications of changes of interest Web Monitoring 2002

Web vs. database monitoring • Quantity of data: larger on the web • Knowledge of data • structure and semantics known in databases • Reliability and availability • High in databases; null on the web • Data granularity • Tuple vs. page in HTML or element in XML • Change control • Databases: support from data sources/triggers • Web: no support; pull only in general Web Monitoring 2002

2. Some applications ofweb monitoring Web Monitoring 2002

Comparative shopping • Unique entry point to many catalogs • Data integration problem • Main issue: wrapping of web catalogs • Semi-automatic so limited to a few sites • Simpler and towards automatic with XML • Alternatives • Mediation when data change very fast • prices and availability of plane tickets • Warehousing otherwise  need to monitor changes Web Monitoring 2002

Web surveillance • Applications • Anti-criminal and anti-terrorist intelligence, e.g., detecting suspicious acquisition of chemical products • Business intelligence, e.g., discovering potential customers, partners, competitors • Find the data (crawl the web) • Monitor the changes • new pages, deleted pages, changes in a page • Classify information and extract data of interest • Data mining, text understanding, knowledge representation and extraction, linguistic… Very AI Web Monitoring 2002

Copy tracking • Example: a press agency wants to check that people are not publishing copies of their wires without paying Query to search engine Or specific crawl + pre-filter Filter 1 2 3 detection Flow of candidate documents Slice the document Web Monitoring 2002

Web archiving • We will discuss an experience in archiving the French web Web Monitoring 2002

Creation of a data warehouse with resources found of the web • We will discuss some work in the Xyleme project on the construction of XML warehouses Web Monitoring 2002

3. Web archiving An experience towards the archiving of the French web with Bibliothèque Nationale de France Web Monitoring 2002

Dépôt légal (legal deposit) • Books are archived since 1537, a decision by King Francois the 1st • The Web is an important and valuable source of information that should also be archived • What is different? • Number of content providers: 148000 sites vs. 5000 editors • Quantity of information: millions of pages + video/audio • Quality of information: lots of junk • Relationship with editors: freedom of publication vs. traditional ‘push’ model • Updates and changes occur continuously • The perimeter is unclear: what is the French web? Web Monitoring 2002

Goal and Scope • Provide future generations with a representative archive of the cultural production • Provide material for cultural, political, sociological studies • The mission is to archive a wide range of material because nobody knows what will be of interest for future research • In traditional publication, publishers are filtering contents. No filter on the web Web Monitoring 2002

Similar Projects • The Internet Archive www.archive.org • The Wayback machine • Largest collection of versions of web pages • Human selection based approach • select a few hundred sites and choose a periodicity of archiving • Australia and Canada • The Nordic experience • Use robot crawler to archive a significant part of the surface web • Sweden, Finland, Norway • Problems encountered: • Lack of updates of archived pages between two snapshots • The hidden Web Web Monitoring 2002

Orientation of our experiment • Goals: • Cover a large portion of the French web • Automatic content gathering is necessary • Adapt robots to provide a continuous archiving facility • Have frequent versions of the sites, at least for the most “important” ones • Issues: • The notion of “important’’ sites • Building a coherent Web archive • Discover and manage important sources of deep Web Web Monitoring 2002

First issue: the perimeter • The perimeter of the French Web: contents edited in France • Many criteria may be used: • The French language but many French sites use English (e.g. INRIA) + many French-speaking sites are from other French speaking countries or regions (e.g. Quebec) • Domain Name or resource locators; .fr sites, but many are also in .com or .org • Address of the site: physical location of the web servers or address of the owner • Other criteria than the perimeter • Little interest in commercial sites • Possibly interest in foreign sites that discuss French issues • Pure automatic does not work  involve librarians Web Monitoring 2002

Second issue:Site vs. Page archiving • The Web: • Physical granularity = HTML pages • The problem is inconsistent data and links • Read page P; one week later read pages pointed by P – may not exist anymore • Logical granularity? • Snapshot view of a web site • What is a site? • INRIA is www.inria.fr + www-rocq.inria.fr… • www.multimania.com is the provider of many sites • There are technical issues (rapid firing, …) Web Monitoring 2002

Importance of data Web Monitoring 2002

What is page importance? • “Le Louvre” homepage is more important than an unknown person’s homepage • Important pages are pointed by: • Other important pages • Many unimportant pages • This leads to Google definition of PageRank • Based on the link structure of the web • used with remarkable success by Google for ranking results • Useful but not sufficient for web archiving Web Monitoring 2002

Page Importance • Importance • Link matrix L • In short, page importance is the fixpoint X of the equation L*X = X • Storing the Link matrix and computing page importance uses lots of resources • We developed a new efficient technique to compute the fixpoint • Without having to store the Link matrix • Technique adapts to automatically to changes Web Monitoring 2002

Site vs. pages • Limitation of page importance • Google page importance works well when links have a strong semantic • More and more web pages are automatically generated and most links have little semantics • More limitation • Refresh at the page level presents drawbacks • So we also use link topology between sites and not only between pages Web Monitoring 2002

Experiments • Crawl • We used between 2 to 8 PCs for Xyleme crawlers for 2 months • Discovery and refresh based on page importance • Discovery • We looked at more than 1.5 billion (most interesting) web pages • We discovered more than 15 million *.fr pages – about 1.5% of the web • We discovered 150 000 *.fr sites • Refresh • Important pages were refreshed more often • Takes into account also the change rate of pages • Analysis of the relevance of site importance for librarians • Comparison with ranking by librarians • Strong correlation with their rankings Web Monitoring 2002

Issues and on going work:Other criteria for importance • Take into account indications by archivists • They know best -- man-machine-interface issue • Use classification and clustering techniques to refine the notion of site • Frequent use of infrequent words • Find pages dedicated to specific topics • Text Weight • Find pages with text content vs. raw data pages) • Others Web Monitoring 2002

4. Creation of a Warehouse from Web data The Xyleme Project Web Monitoring 2002

Xyleme in short • The Xyleme project • Initiated at INRIA • Joint work with researchers from Orsay, Mannheim and CNAM-Paris universities • The Xyleme company: www.xyleme.com • Started in 2000 • About 30 people • Mission: Deliver a new generation of content technologies to unlock the potential of XML • Here: focus on the Xyleme project Web Monitoring 2002

Goal of the Xyleme project • Focus is on XML data (but also handle HTML) • Semantic • Understand tags, partition the Web into semantic domains, provide a simple view of each domain • Dynamicity • Find and monitor relevant data on the web • Control relevant changes in Web data • XML storage, index and queries • Manage efficiently millions of XML documents and process millions of simultaneous queries Web Monitoring 2002

Web Corporate information environment with Xyleme Crawling & interpreting data XML Repository Repository Query Engine Xyleme Server Systematic updating publishing searches queries Information System Web Monitoring 2002

XML in short • Data exchange format • eXtensible Mark-up Language (child of SGML) • Promoted by W3C and major industry players • XML document: ordered labeled tree • Other essential gadgets: unicode, namespaces, attributes, pointers, typing (XML schema)… Web Monitoring 2002

XML magic in short • Presentation is given elsewhere (style-sheet) • Semantic and structure are provided by labels • So it is easy to extract information • Universal format understood by more and more softwares (e.g., exported by most databases + read by more and more editors) • More and more tools available Web Monitoring 2002

< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 XML Information System Ref  product/reference Name  product/designation Price  product/price It is easy to extract information Web Monitoring 2002

4.1 Xyleme:Functionality and architechture Web Monitoring 2002

The goal of Xyleme project: XML Dynamic Datawarehouse • Many research issues • Query Processor • Semantic Classification • Data Monitoring • Native Storage • XML document Versionning • XML automatic or user driven acquisition • Graphical User Interface through the Web Web Monitoring 2002

User Interface -------------------- I N T E R N E T ----------------------- Web Interface Xyleme Interface Acquisition & Crawler Change Control Semantic Module Loader Functional Architecture Query Processor Repository and Index Manager Web Monitoring 2002

Interface Change | Semantic Global Query Interface Change | Semantic Global Query Web Interface Crawler Global Loader E T H E R N E T Index Index Index Loader |Query|Version Repository Loader |Query|Version Repository DTDi,DTDj XML DOC extent DTDk,DTDl XML DOC extent DTDm, .. XML DOC extent DTDp ... XML DOC extent Architecture -------------------- I N T E R N E T ----------------------- Web Monitoring 2002

Prototype main choices • Network of Linux PCs • C++ on the server side • Corba for communications between PCs • HTTP + SOAP for communications for external communications • Exception for query processing Web Monitoring 2002

Scaling Parallelism based on • Partitioning • XML documents • URL table • Indexes (semantic partitioning) • Memory replication • Autonomous machines (PCs) • Caches are used for data flow Web Monitoring 2002

4.2 Xyleme:Data Acquisition Web Monitoring 2002

Data Acquisition • Xyleme crawler visits the HTML/XML web • Management of metadata on pages • Sophisticate strategy to optimize network bandwidth • importance ranking of pages • change frequency and age of pages • publications (owners) & subscriptions (users) • Each crawler visits about 4 million pages per day • Each index may create index for 1 million pages per day Web Monitoring 2002

4.3 Xyleme:Change Control Web Monitoring 2002

Change Management • Monitoring • subscriptions • continuous queries • versions • The Web changes all the time • Data acquisition • automatic and via publication Web Monitoring 2002

Subscription • They may request to be notified • at the time the event is detected by Xyleme • regularly, e.g., once a week • Users can subscribe to certain events, e.g., • changes in all pages of a certain DTD or of a certain semantic domain • insertion of a new product in a particular catalog or in all catalogs with a particular DTD Web Monitoring 2002

Issues in Monitoring Web Data