Organization

Xyleme, January 2001 -- Zurich A Dynamic Warehouse for the XML data of the WebSerge AbiteboulINRIA & Xyleme SASerge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.comhttp://www-rocq.inria.fr/verso http://www.xyleme.com

Organization • The Web and XML • Xyleme • 1. Data Acquisition and Maintenance • 2. XML Repository • 3. Semantic Data Integration • 4. Query Processing • 5. Query Subscription • Conclusion

Xyleme, January 2001 -- Zurich The Web and XML

The Web today • Terabytes of data • Private web: not publicly available pages • Deep web: data hidden behind forms • A lot of public pages • 1 billion in [06/2000] • several millions of servers

Browsing Search engines Google indexes more than 1 billion pages 11/00 in: list of words out: sorted list of URLs based on occurrence of words in documents based on the link structure of the web The Web today

The Web today • Queries: keywords to retrieve URLs • Imprecise • Query results cannot be directly processed • Difficult to extract data of interest • Applications: based on hand-made wrappers • Expensive • Incomplete • Short-lived, not adapted to the Web constant changes

HTML comes from SGML hypertext language fixed number of tags content and presentation are mixed very difficult to extract data from a page old standard XML also semistructured data not fixed not mixed very easy new standard The Coming of XML

The X23 new camera replaces the X22 . It comes equipped with a flash (worth by itself 53.99 $) and provides great quality for only 359.99 $. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System HTML HTML = Hypertext Language hard Text + presentation Where is the data ?

Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML = Semistructured Data <product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> easy Data + Structure Semistructured: more flexible XML

XML : Tree Types product-table • Semantics and structure are in paths • product-table/product/reference • product-table/product/price product reference price designation description

XML • Very active/noisy field - standards • schema (XML schema), stylesheet (XSL), resource description (RDF...) • WML (wap), MathML, SMIL (multimedia), RSS (news), RDF (metadata)... • How fast will XML conquer the web? • so far rather slow (about 1% now of the visible web; much more in intranets) • much faster since the arrival of Explorer 5.5

Xyleme, January 2001 -- Zurich A Dynamic Warehouse for the XML Data of the Web Xyleme

Xyleme • Warehouse • Xyleme stores huge quantities of data (teraB) • Xyleme is not a search engine (only index) or a mediator(only virtual data) • XML • Xyleme is focused on XML, i.e., trees • Dynamic • Xyleme is interested in data evolution/changes

Xyleme • September 1999: a group of researchers from • Inria Rocquencourt, Verso Group • U. of Mannheim, Database Group • U. of Orsay, IASI Group • CNAM, Vertigo Group • September 2000: creation of a start-up • November 2000: about 15 people

Web Corporate Information Today ad-hoc applications written by web-experts tailored for specific tasks and data. I.e. inflexible and expensive manual searches using browsers Information System manual updates

Web Corporate Information with Xyleme Crawling & interpreting data Repository Query Engine Xyleme-warehouse publishing searches Xyleme API queries updates Information System

Five Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain it up to date 2. Repository store this data and index it so that it can be processed efficiently 3. Query Processing support efficiently an SQL-style query language

Five Challenges - continued 4. Semantic Integration Understand DTD and tags, partition the Web into semantic domains, provide a simple view of each domain 5. Change Control Monitor the web and offer services such as Query Subscription

Challenges - continued • Scale to the web • Size of data: millions/billions of pages • Size of index: terabytes • Number of customers • thousands of simultaneous queries • millions of subscriptions

User Interface Xyleme Interface Acquisition & Crawler Change Control Semantic Module Loader Functional Architecture -------------------- I N T E R N E T ----------------------- Web Interface Query Processor Repository and Index Manager

Architecture • Cluster of PCs • Developed with Linux and C++ • Communications • local: Corba • external: HTTP • Distribution between autonomous machines

Change Control and Semantic Integration Change Control and Semantic Integration Acquisition and Maintenance Acquisition and Maintenance Index Index Index Loader |Query Loader |Query Repository Repository Repositorry Repository Architecture -------------------- I N T E R N E T ----------------------- E T H E R N E T

Xyleme, January 2001 -- Zurich 1. Data Acquisition and Maintenance

Goals • Discover XML pages on the web that are of interest for customers • For this crawl the web (HTML+XML) • Maintain them up to date • Do this under bounded resources

Life Cycle of a page in Xyleme • The URL of D is discovered as a link in another page (or published by a customer) • The page scheduler decides to read D • The meta data of D is read • type, last_date_update... • The document D is loaded • The document D is re(read) regularly

Main Issues • Loading of pages • we can load up to 5 millions of pages/day on a standard PC • main cost is Internet connection • Metadata management • Page scheduling • decide which page to read or refresh next

Metadata Management • Example: management of the link matrix • page i points to page j • for 1 billion URL, about 30 children/url • matrix has 30.109 edges (very sparse) • For each page that is read, • find the IDs of the 30 children • 50 pages/second  1500 database calls/second

Page Scheduling • Decide which page to read next • discovery (read first) and refresh (read again) Based on: • Importance of the page • read often important pages • also used to order query results • Change rate of the page • don’t read a page that is probably up-to-date

Page Scheduling for Refresh • Determine refresh frequency fifor each page i to minimize a cost function • Minimize Under the constraint 1…N costi(fi)G1…N fi where costi(fi), penalty for page i, depends on the estimated importance and staleness of the page

Cost Function costi(fi), penalty for page i, depends on the estimated importance and staleness of the page • Importance of the page • link structure • pub/sub • Staleness of the data • penalty for being out of date • penalty for aging

Evaluation of Change Rate • Based on the Last Date of Change • provided by HTTP header of the page • in general reliable but … • Based on the number M of changes detected the last N times the pages was refreshed • limits: do not know the actual number of changes First one more precise

Page Importance: Link Structure • Intuition: a page is important if many important pages reference it : fixpoint • Link Matrix • M(i,j) if page i refers to page j • M is a 109 109 matrix • out(i) : the outdegree of page i • Fixpoint • W0(k) = 1/N (initialization) • Wm(k) = i [M(i,k) * Wm-1(i)/out(i) ]

Page Importance : Algorithm M(i,-) Wm Wm-1(k) k += out(k) Wm(k) • M(i,-) is stored as a list • computation of Wm (line/line) • for i = 1 to N do • [ read M(i,-) ; • process the line ]

Page Importance: Fixpoint • Techniques for fixpoint convergence • Some results • convergence is fast (OK after 10) • simple precision suffices • possible on a standard PC • Distribution and incremental evaluation

Page Importance: Refresh Standard importance for HTML/XML pages HTML pages are useful only to discover XML Taking pub/sub into account circle = HTML square = XML triangle = pub/sub

Xyleme, January 2001 -- Zurich 2. XML Repository

Storing XML documents • Relational store (e.g., Oracle 8i) • binary long objects: not possible to access directly elements • very typed data and Tables: efficient • otherwise: too many joins and inefficient • Object database store (ODMG) • better adapted • XML Native storage: Natix

Natix Repository • Goal • minimize I/O for direct access and scanning • efficient direct accesses using indexing • good compaction but not at the cost of access • Efficient storage of trees • use fixed length storage pages • variable length records inside a page • Main issue: tree balancing

Tree Balancing Record 1 Record 2 Record 3

Tree Balancing - continued Large collections may use several records

Xyleme, January 2001 -- Zurich 3. Semantic Data Integration

Web Heterogeneity • Semantic domains, e.g., cinema • Many possible types for data in this domain, many DTDs • Semantic Integration • one abstract DTD for the domain • gives the illusion that the system maintains an homogeneous database for this domain 1 domain = 1abstract DTD

Cluster DTDs and Documents Relationship is not visible unless one knows the relationships between story and tale.

Discover the Domains Cluster DTDs sharing similar « tags » using data mining techniques (frequent item sets) and linguistic tools (e.g., thesaurus, heuristics to extract words from composite words or abbreviations, etc.) to obtain domains adtd1 cdtd1 . cdtd2 . cdtd3 . adtd2 cdtd4 . cdtd5 . cdtd6 . cdtd7 . cdtd8 . cdtd9 . cdtd10 . adtd4 Many concrete DTDs Fewer abstract DTDs

Wordnet: Useful Relationships • Synonyms One concept, two terms • Hypernyms / Hyponyms  two concepts linked • through generalization/specialization • - e.g., vehicle & car • Meronyms / Holonyms two concepts linked • through composition/inclusion • - e.g., country & city

Choose an Abstract DTD / Domain • Automatically • The analysis of a cluster, leads to « clusters of tags » • Use a thesaurus (e.g., Wordnet) to build a hierarchy from the clusters of tags • Manually • Performed by a domain expert • Hybrid

Mapping Concrete to Abstract • For each concrete DTD in a domain, find how it relates to the abstract DTD: • Associate concrete tags to abstract tags using linguistic tools • Provide relationships between paths in the concrete and abstract DTD E.g.: cdtd3/œuvre/nom/prénom and adtd2/book/author/name/firstname • Possibly automatic, manual or hybrid

Xyleme, January 2001 -- Zurich 4. Query Processing

Xyleme Query Language • Today: A mix of OQL and XQL • Tomorrow: the future W3C standard • Example select product/name, product/price from doc in catalogue, product in doc/product where product//components contains “flash” and product/description contains “camera”

Data Distribution • Cluster of documents = physical collection of documents ( semantic domain) Distribution • Storage machine • in charge of a cluster of documents • Index machine • index for a cluster

Organization

Organization

Presentation Transcript

Organization Change / Organization Development

ORGANIZATION

Organization

Organization of the Organization

Organization

Organization

Organization

Organization

Organization

Organization

organization

Organization

Organization

Organization

Organization

Organization

Organization

ORGANIZATION

Organization…

Organization

Organization:

Organization