1 / 70

Organization

A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com http://www-rocq.inria.fr/verso http://www.xyleme.com. Organization. The Web and XML Xyleme 1. Data Acquisition and Maintenance 2. XML Repository

iren
Download Presentation

Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Xyleme, January 2001 -- Zurich A Dynamic Warehouse for the XML data of the WebSerge AbiteboulINRIA & Xyleme SASerge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.comhttp://www-rocq.inria.fr/verso http://www.xyleme.com

  2. Organization • The Web and XML • Xyleme • 1. Data Acquisition and Maintenance • 2. XML Repository • 3. Semantic Data Integration • 4. Query Processing • 5. Query Subscription • Conclusion

  3. Xyleme, January 2001 -- Zurich The Web and XML

  4. The Web today • Terabytes of data • Private web: not publicly available pages • Deep web: data hidden behind forms • A lot of public pages • 1 billion in [06/2000] • several millions of servers

  5. Browsing Search engines Google indexes more than 1 billion pages 11/00 in: list of words out: sorted list of URLs based on occurrence of words in documents based on the link structure of the web The Web today

  6. The Web today • Queries: keywords to retrieve URLs • Imprecise • Query results cannot be directly processed • Difficult to extract data of interest • Applications: based on hand-made wrappers • Expensive • Incomplete • Short-lived, not adapted to the Web constant changes

  7. HTML comes from SGML hypertext language fixed number of tags content and presentation are mixed very difficult to extract data from a page old standard XML also semistructured data not fixed not mixed very easy new standard The Coming of XML

  8. The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System HTML HTML = Hypertext Language hard Text + presentation Where is the data ?

  9. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML = Semistructured Data <product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> easy Data + Structure Semistructured: more flexible XML

  10. XML : Tree Types product-table • Semantics and structure are in paths • product-table/product/reference • product-table/product/price product reference price designation description

  11. XML • Very active/noisy field - standards • schema (XML schema), stylesheet (XSL), resource description (RDF...) • WML (wap), MathML, SMIL (multimedia), RSS (news), RDF (metadata)... • How fast will XML conquer the web? • so far rather slow (about 1% now of the visible web; much more in intranets) • much faster since the arrival of Explorer 5.5

  12. Xyleme, January 2001 -- Zurich A Dynamic Warehouse for the XML Data of the Web Xyleme

  13. Xyleme • Warehouse • Xyleme stores huge quantities of data (teraB) • Xyleme is not a search engine (only index) or a mediator(only virtual data) • XML • Xyleme is focused on XML, i.e., trees • Dynamic • Xyleme is interested in data evolution/changes

  14. Xyleme • September 1999: a group of researchers from • Inria Rocquencourt, Verso Group • U. of Mannheim, Database Group • U. of Orsay, IASI Group • CNAM, Vertigo Group • September 2000: creation of a start-up • November 2000: about 15 people

  15. Web Corporate Information Today ad-hoc applications written by web-experts tailored for specific tasks and data. I.e. inflexible and expensive manual searches using browsers Information System manual updates

  16. Web Corporate Information with Xyleme Crawling & interpreting data Repository Query Engine Xyleme-warehouse publishing searches Xyleme API queries updates Information System

  17. Five Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain it up to date 2. Repository store this data and index it so that it can be processed efficiently 3. Query Processing support efficiently an SQL-style query language

  18. Five Challenges - continued 4. Semantic Integration Understand DTD and tags, partition the Web into semantic domains, provide a simple view of each domain 5. Change Control Monitor the web and offer services such as Query Subscription

  19. Challenges - continued • Scale to the web • Size of data: millions/billions of pages • Size of index: terabytes • Number of customers • thousands of simultaneous queries • millions of subscriptions

  20. User Interface Xyleme Interface Acquisition & Crawler Change Control Semantic Module Loader Functional Architecture -------------------- I N T E R N E T ----------------------- Web Interface Query Processor Repository and Index Manager

  21. Architecture • Cluster of PCs • Developed with Linux and C++ • Communications • local: Corba • external: HTTP • Distribution between autonomous machines

  22. Change Control and Semantic Integration Change Control and Semantic Integration Acquisition and Maintenance Acquisition and Maintenance Index Index Index Loader |Query Loader |Query Repository Repository Repositorry Repository Architecture -------------------- I N T E R N E T ----------------------- E T H E R N E T

  23. Xyleme, January 2001 -- Zurich 1. Data Acquisition and Maintenance

  24. Goals • Discover XML pages on the web that are of interest for customers • For this crawl the web (HTML+XML) • Maintain them up to date • Do this under bounded resources

  25. Life Cycle of a page in Xyleme • The URL of D is discovered as a link in another page (or published by a customer) • The page scheduler decides to read D • The meta data of D is read • type, last_date_update... • The document D is loaded • The document D is re(read) regularly

  26. Main Issues • Loading of pages • we can load up to 5 millions of pages/day on a standard PC • main cost is Internet connection • Metadata management • Page scheduling • decide which page to read or refresh next

  27. Metadata Management • Example: management of the link matrix • page i points to page j • for 1 billion URL, about 30 children/url • matrix has 30.109 edges (very sparse) • For each page that is read, • find the IDs of the 30 children • 50 pages/second  1500 database calls/second

  28. Page Scheduling • Decide which page to read next • discovery (read first) and refresh (read again) Based on: • Importance of the page • read often important pages • also used to order query results • Change rate of the page • don’t read a page that is probably up-to-date

  29. Page Scheduling for Refresh • Determine refresh frequency fifor each page i to minimize a cost function • Minimize Under the constraint 1…N costi(fi)G1…N fi where costi(fi), penalty for page i, depends on the estimated importance and staleness of the page

  30. Cost Function costi(fi), penalty for page i, depends on the estimated importance and staleness of the page • Importance of the page • link structure • pub/sub • Staleness of the data • penalty for being out of date • penalty for aging

  31. Evaluation of Change Rate • Based on the Last Date of Change • provided by HTTP header of the page • in general reliable but … • Based on the number M of changes detected the last N times the pages was refreshed • limits: do not know the actual number of changes First one more precise

  32. Page Importance: Link Structure • Intuition: a page is important if many important pages reference it : fixpoint • Link Matrix • M(i,j) if page i refers to page j • M is a 109 109 matrix • out(i) : the outdegree of page i • Fixpoint • W0(k) = 1/N (initialization) • Wm(k) = i [M(i,k) * Wm-1(i)/out(i) ]

  33. Page Importance : Algorithm M(i,-) Wm Wm-1(k) k += out(k) Wm(k) • M(i,-) is stored as a list • computation of Wm (line/line) • for i = 1 to N do • [ read M(i,-) ; • process the line ]

  34. Page Importance: Fixpoint • Techniques for fixpoint convergence • Some results • convergence is fast (OK after 10) • simple precision suffices • possible on a standard PC • Distribution and incremental evaluation

  35. Page Importance: Refresh Standard importance for HTML/XML pages HTML pages are useful only to discover XML Taking pub/sub into account circle = HTML square = XML triangle = pub/sub

  36. Xyleme, January 2001 -- Zurich 2. XML Repository

  37. Storing XML documents • Relational store (e.g., Oracle 8i) • binary long objects: not possible to access directly elements • very typed data and Tables: efficient • otherwise: too many joins and inefficient • Object database store (ODMG) • better adapted • XML Native storage: Natix

  38. Natix Repository • Goal • minimize I/O for direct access and scanning • efficient direct accesses using indexing • good compaction but not at the cost of access • Efficient storage of trees • use fixed length storage pages • variable length records inside a page • Main issue: tree balancing

  39. Tree Balancing Record 1 Record 2 Record 3

  40. Tree Balancing - continued Large collections may use several records

  41. Xyleme, January 2001 -- Zurich 3. Semantic Data Integration

  42. Web Heterogeneity • Semantic domains, e.g., cinema • Many possible types for data in this domain, many DTDs • Semantic Integration • one abstract DTD for the domain • gives the illusion that the system maintains an homogeneous database for this domain 1 domain = 1abstract DTD

  43. Cluster DTDs and Documents Relationship is not visible unless one knows the relationships between story and tale.

  44. Discover the Domains Cluster DTDs sharing similar « tags » using data mining techniques (frequent item sets) and linguistic tools (e.g., thesaurus, heuristics to extract words from composite words or abbreviations, etc.) to obtain domains adtd1 cdtd1 . cdtd2 . cdtd3 . adtd2 cdtd4 . cdtd5 . cdtd6 . cdtd7 . cdtd8 . cdtd9 . cdtd10 . adtd4 Many concrete DTDs Fewer abstract DTDs

  45. Wordnet: Useful Relationships • Synonyms One concept, two terms • Hypernyms / Hyponyms  two concepts linked • through generalization/specialization • - e.g., vehicle & car • Meronyms / Holonyms two concepts linked • through composition/inclusion • - e.g., country & city

  46. Choose an Abstract DTD / Domain • Automatically • The analysis of a cluster, leads to « clusters of tags » • Use a thesaurus (e.g., Wordnet) to build a hierarchy from the clusters of tags • Manually • Performed by a domain expert • Hybrid

  47. Mapping Concrete to Abstract • For each concrete DTD in a domain, find how it relates to the abstract DTD: • Associate concrete tags to abstract tags using linguistic tools • Provide relationships between paths in the concrete and abstract DTD E.g.: cdtd3/œuvre/nom/prénom and adtd2/book/author/name/firstname • Possibly automatic, manual or hybrid

  48. Xyleme, January 2001 -- Zurich 4. Query Processing

  49. Xyleme Query Language • Today: A mix of OQL and XQL • Tomorrow: the future W3C standard • Example select product/name, product/price from doc in catalogue, product in doc/product where product//components contains “flash” and product/description contains “camera”

  50. Data Distribution • Cluster of documents = physical collection of documents ( semantic domain) Distribution • Storage machine • in charge of a cluster of documents • Index machine • index for a cluster

More Related