280 likes | 400 Views
Monitoring XML Data on the Web. Benjamin Nguyen , Serge Abiteboul, Grégory Cobéna and Mihaï Preda INRIA Rocquencourt, Projet Verso and Xyleme S.A. FRANCE Contact: firstname.lastname@inria.fr or mihai.preda@xyleme.com http://www-rocq.inria.fr/verso/ and http://www.xyleme.com.
E N D
Monitoring XML Data on the Web Benjamin Nguyen, Serge Abiteboul, Grégory Cobéna and Mihaï Preda INRIA Rocquencourt, Projet Verso and Xyleme S.A. FRANCE Contact: firstname.lastname@inria.fr or mihai.preda@xyleme.com http://www-rocq.inria.fr/verso/ and http://www.xyleme.com
Organization • Introduction • Query Subscription • Motivations • Subscription System Architecture • Subscription Language • Complex Event Detection Algorithm • Alerters • Conclusion SIGMOD'01 Santa-Barbara
A Dynamic Warehouse for the XML Data of the Web Xyleme A complex tissue in the vascular system of higher plants… functions chiefly in conduction but also in support and storage. -Webster
A brief look back… • 1999/2000: a group of researchers from • Inria Rocquencourt, Verso Group • U. of Mannheim, Database Group • U. of Orsay, IASI Group • CNAM, Vertigo Group • October 2000: creation of a start-up. http://www.xyleme.com/ SIGMOD'01 Santa-Barbara
The three aspects of Xyleme • Webhouse • Xyleme stores huge quantities of data (teraB) • Xyleme is more than a search engine (only index) or a mediator (only virtual data) • XML • Xyleme is focused on XML, i.e., trees • Dynamic • Xyleme is interested in data evolution/changes SIGMOD'01 Santa-Barbara
User Interface Xyleme Interface Acquisition & Crawler Change Control Semantic Module Loader Xyleme Global Architecture -------------------- I N T E R N E T ----------------------- Web Interface Query Processor Repository and Index Manager Runs on a cluster of Linux PCs. Implemented in C++ SIGMOD'01 Santa-Barbara
The Web changes all the time • Data acquisition + maintenance • keep the warehouse up-to-date: “Acquisition and Maintenance of XML Data from the Web”, L. Mignet, M. Preda, S. Abiteboul, B. Amann, A. Marian, Tech. Report • Version management • “Change-Centric Management of Versions in an XML Warehouse”, A.Marian,S. Abiteboul,G. Cobena, L. Mignet VLDB’01 • Change monitoring • query subscription SIGMOD'01 Santa-Barbara
Query Subscription • Users may subscribe to certain events • Changes in a page, a set of pages, • Changes in pages from a particular semantic domain, containing some specific words or with a particular DTD • Changes of particular elements somewhere (new products in a catalog) • Users may request to be notified • Immediately at the time the event is detected • Regularly, e.g., weekly • After a certain number of event detections • Users want to be notified • By email • Upon Login to our site SIGMOD'01 Santa-Barbara
Architecture Xyleme Query Processor documents Trigger Engine Xyleme Alerter Complex Event Detection Reporter Xyleme Reporter Subscription Manager SQL Xyleme Subscription Manager Web Browser SQL SIGMOD'01 Santa-Barbara
d document & alerts d/46 d/46,67 loading Step 1: Atomic Event Detection 5 millions of pages/day atomic event 46: URL matches pattern www.musee-orsay.fr/* atomic event 67: XML document contains the tag <painter> with the value “Monet” metadata manager HTML parser complex event detection XML loader SIGMOD'01 Santa-Barbara
Step2: Complex Event Detection Millions of alerts of pages/day Millions of subscriptions HTML parser complex event detection complex event 12: 67 & 46 (XML document contains the tag <painter> with value “Monet” and URL matches pattern www.musee-orsay.fr/*) XML loader SIGMOD'01 Santa-Barbara
notification/monitoring alerts triggers Millions of notifications/day notification/results clock Step 3: Notification Processor complex event detection Reporter continuous queries SIGMOD'01 Santa-Barbara
Subscription Language • SQL-like language. • Combines the use of monitoring queries and continuous queries. • The language can be extended by adding new types of atomic events. • Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report SIGMOD'01 Santa-Barbara
Example subscription myPaintings % what are the new painting entries in Musee d’Orsay site monitoring newPainting select URL where URL extends www.musee-orsay.fr/* and <painter> contains “Monet” % manage the changes in the expositions continuous delta Exposition select ... from ... where when monthly notify daily % send me a daily report SIGMOD'01 Santa-Barbara
C1 a4 a3 a7 a1 a4 a6 a5 C0 = a0 C1 = a0 a4 C2 = a0 a1 a3 C3 = a2 C4 = a4 a5 a6 a7 a0 a2 a4 Atomic Event Set Algorithm C2 C4 C1 C3 C0 SIGMOD'01 Santa-Barbara
a4 a4 a0 a0 a2 a2 Atomic Event Set Algorithm C1 C2 a3 C4 a7 S={a0 a2 a 4} a1 a6 Detected Events: a5 C0 C1 C3 C3 C0 a4 a4 SIGMOD'01 Santa-Barbara
Complexity results • A formal study has been conducted. • Experimental (simulation) values concur with this study • Results show that the algorithm is well suited for our application: • 10 million Complex Events • 1 million Atomic Events • 100 Atomic events detected per document 0.8 ms to process a document. ~2 million documents per day. SIGMOD'01 Santa-Barbara
Alerters • Each Alerter can be viewed as a plugin that acts on a document flow. • All sorts of Atomic events can be detected: URL pattern detection, Keywords, XML structure, Page rank… • Can be distributed. SIGMOD'01 Santa-Barbara
Conclusion and Perspectives • This work has been implemented and integrated in the Xyleme System. • The core of our system is reusable. • The system is expandable, and can be used to trigger various other modules: • versionning of documents • semantic classification SIGMOD'01 Santa-Barbara
Perspectives • Re-use of the core of our system. • Triggering of various other modules. • versioning documents • semantic classification SIGMOD'01 Santa-Barbara
HTML comes from SGML hypertext language fixed number of tags content and presentation are mixed very difficult to extract data from a page old standard XML also semistructured data not fixed not mixed very easy new standard The Coming of XML SIGMOD'01 Santa-Barbara
Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML = Semistructured Data <product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> Data + Structure Semistructured: more flexible XML SIGMOD'01 Santa-Barbara