700 likes | 857 Views
A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com http://www-rocq.inria.fr/verso http://www.xyleme.com. Organization. The Web and XML Xyleme 1. Data Acquisition and Maintenance 2. XML Repository
E N D
Xyleme, January 2001 -- Zurich A Dynamic Warehouse for the XML data of the WebSerge AbiteboulINRIA & Xyleme SASerge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.comhttp://www-rocq.inria.fr/verso http://www.xyleme.com
Organization • The Web and XML • Xyleme • 1. Data Acquisition and Maintenance • 2. XML Repository • 3. Semantic Data Integration • 4. Query Processing • 5. Query Subscription • Conclusion
Xyleme, January 2001 -- Zurich The Web and XML
The Web today • Terabytes of data • Private web: not publicly available pages • Deep web: data hidden behind forms • A lot of public pages • 1 billion in [06/2000] • several millions of servers
Browsing Search engines Google indexes more than 1 billion pages 11/00 in: list of words out: sorted list of URLs based on occurrence of words in documents based on the link structure of the web The Web today
The Web today • Queries: keywords to retrieve URLs • Imprecise • Query results cannot be directly processed • Difficult to extract data of interest • Applications: based on hand-made wrappers • Expensive • Incomplete • Short-lived, not adapted to the Web constant changes
HTML comes from SGML hypertext language fixed number of tags content and presentation are mixed very difficult to extract data from a page old standard XML also semistructured data not fixed not mixed very easy new standard The Coming of XML
The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System HTML HTML = Hypertext Language hard Text + presentation Where is the data ?
Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML = Semistructured Data <product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> easy Data + Structure Semistructured: more flexible XML
XML : Tree Types product-table • Semantics and structure are in paths • product-table/product/reference • product-table/product/price product reference price designation description
XML • Very active/noisy field - standards • schema (XML schema), stylesheet (XSL), resource description (RDF...) • WML (wap), MathML, SMIL (multimedia), RSS (news), RDF (metadata)... • How fast will XML conquer the web? • so far rather slow (about 1% now of the visible web; much more in intranets) • much faster since the arrival of Explorer 5.5
Xyleme, January 2001 -- Zurich A Dynamic Warehouse for the XML Data of the Web Xyleme
Xyleme • Warehouse • Xyleme stores huge quantities of data (teraB) • Xyleme is not a search engine (only index) or a mediator(only virtual data) • XML • Xyleme is focused on XML, i.e., trees • Dynamic • Xyleme is interested in data evolution/changes
Xyleme • September 1999: a group of researchers from • Inria Rocquencourt, Verso Group • U. of Mannheim, Database Group • U. of Orsay, IASI Group • CNAM, Vertigo Group • September 2000: creation of a start-up • November 2000: about 15 people
Web Corporate Information Today ad-hoc applications written by web-experts tailored for specific tasks and data. I.e. inflexible and expensive manual searches using browsers Information System manual updates
Web Corporate Information with Xyleme Crawling & interpreting data Repository Query Engine Xyleme-warehouse publishing searches Xyleme API queries updates Information System
Five Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain it up to date 2. Repository store this data and index it so that it can be processed efficiently 3. Query Processing support efficiently an SQL-style query language
Five Challenges - continued 4. Semantic Integration Understand DTD and tags, partition the Web into semantic domains, provide a simple view of each domain 5. Change Control Monitor the web and offer services such as Query Subscription
Challenges - continued • Scale to the web • Size of data: millions/billions of pages • Size of index: terabytes • Number of customers • thousands of simultaneous queries • millions of subscriptions
User Interface Xyleme Interface Acquisition & Crawler Change Control Semantic Module Loader Functional Architecture -------------------- I N T E R N E T ----------------------- Web Interface Query Processor Repository and Index Manager
Architecture • Cluster of PCs • Developed with Linux and C++ • Communications • local: Corba • external: HTTP • Distribution between autonomous machines
Change Control and Semantic Integration Change Control and Semantic Integration Acquisition and Maintenance Acquisition and Maintenance Index Index Index Loader |Query Loader |Query Repository Repository Repositorry Repository Architecture -------------------- I N T E R N E T ----------------------- E T H E R N E T
Xyleme, January 2001 -- Zurich 1. Data Acquisition and Maintenance
Goals • Discover XML pages on the web that are of interest for customers • For this crawl the web (HTML+XML) • Maintain them up to date • Do this under bounded resources
Life Cycle of a page in Xyleme • The URL of D is discovered as a link in another page (or published by a customer) • The page scheduler decides to read D • The meta data of D is read • type, last_date_update... • The document D is loaded • The document D is re(read) regularly
Main Issues • Loading of pages • we can load up to 5 millions of pages/day on a standard PC • main cost is Internet connection • Metadata management • Page scheduling • decide which page to read or refresh next
Metadata Management • Example: management of the link matrix • page i points to page j • for 1 billion URL, about 30 children/url • matrix has 30.109 edges (very sparse) • For each page that is read, • find the IDs of the 30 children • 50 pages/second 1500 database calls/second
Page Scheduling • Decide which page to read next • discovery (read first) and refresh (read again) Based on: • Importance of the page • read often important pages • also used to order query results • Change rate of the page • don’t read a page that is probably up-to-date
Page Scheduling for Refresh • Determine refresh frequency fifor each page i to minimize a cost function • Minimize Under the constraint 1…N costi(fi)G1…N fi where costi(fi), penalty for page i, depends on the estimated importance and staleness of the page
Cost Function costi(fi), penalty for page i, depends on the estimated importance and staleness of the page • Importance of the page • link structure • pub/sub • Staleness of the data • penalty for being out of date • penalty for aging
Evaluation of Change Rate • Based on the Last Date of Change • provided by HTTP header of the page • in general reliable but … • Based on the number M of changes detected the last N times the pages was refreshed • limits: do not know the actual number of changes First one more precise
Page Importance: Link Structure • Intuition: a page is important if many important pages reference it : fixpoint • Link Matrix • M(i,j) if page i refers to page j • M is a 109 109 matrix • out(i) : the outdegree of page i • Fixpoint • W0(k) = 1/N (initialization) • Wm(k) = i [M(i,k) * Wm-1(i)/out(i) ]
Page Importance : Algorithm M(i,-) Wm Wm-1(k) k += out(k) Wm(k) • M(i,-) is stored as a list • computation of Wm (line/line) • for i = 1 to N do • [ read M(i,-) ; • process the line ]
Page Importance: Fixpoint • Techniques for fixpoint convergence • Some results • convergence is fast (OK after 10) • simple precision suffices • possible on a standard PC • Distribution and incremental evaluation
Page Importance: Refresh Standard importance for HTML/XML pages HTML pages are useful only to discover XML Taking pub/sub into account circle = HTML square = XML triangle = pub/sub
Xyleme, January 2001 -- Zurich 2. XML Repository
Storing XML documents • Relational store (e.g., Oracle 8i) • binary long objects: not possible to access directly elements • very typed data and Tables: efficient • otherwise: too many joins and inefficient • Object database store (ODMG) • better adapted • XML Native storage: Natix
Natix Repository • Goal • minimize I/O for direct access and scanning • efficient direct accesses using indexing • good compaction but not at the cost of access • Efficient storage of trees • use fixed length storage pages • variable length records inside a page • Main issue: tree balancing
Tree Balancing Record 1 Record 2 Record 3
Tree Balancing - continued Large collections may use several records
Xyleme, January 2001 -- Zurich 3. Semantic Data Integration
Web Heterogeneity • Semantic domains, e.g., cinema • Many possible types for data in this domain, many DTDs • Semantic Integration • one abstract DTD for the domain • gives the illusion that the system maintains an homogeneous database for this domain 1 domain = 1abstract DTD
Cluster DTDs and Documents Relationship is not visible unless one knows the relationships between story and tale.
Discover the Domains Cluster DTDs sharing similar « tags » using data mining techniques (frequent item sets) and linguistic tools (e.g., thesaurus, heuristics to extract words from composite words or abbreviations, etc.) to obtain domains adtd1 cdtd1 . cdtd2 . cdtd3 . adtd2 cdtd4 . cdtd5 . cdtd6 . cdtd7 . cdtd8 . cdtd9 . cdtd10 . adtd4 Many concrete DTDs Fewer abstract DTDs
Wordnet: Useful Relationships • Synonyms One concept, two terms • Hypernyms / Hyponyms two concepts linked • through generalization/specialization • - e.g., vehicle & car • Meronyms / Holonyms two concepts linked • through composition/inclusion • - e.g., country & city
Choose an Abstract DTD / Domain • Automatically • The analysis of a cluster, leads to « clusters of tags » • Use a thesaurus (e.g., Wordnet) to build a hierarchy from the clusters of tags • Manually • Performed by a domain expert • Hybrid
Mapping Concrete to Abstract • For each concrete DTD in a domain, find how it relates to the abstract DTD: • Associate concrete tags to abstract tags using linguistic tools • Provide relationships between paths in the concrete and abstract DTD E.g.: cdtd3/œuvre/nom/prénom and adtd2/book/author/name/firstname • Possibly automatic, manual or hybrid
Xyleme, January 2001 -- Zurich 4. Query Processing
Xyleme Query Language • Today: A mix of OQL and XQL • Tomorrow: the future W3C standard • Example select product/name, product/price from doc in catalogue, product in doc/product where product//components contains “flash” and product/description contains “camera”
Data Distribution • Cluster of documents = physical collection of documents ( semantic domain) Distribution • Storage machine • in charge of a cluster of documents • Index machine • index for a cluster