690 likes | 846 Views
XML Warehousing and Xyleme. S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002. Organization. The context and motivations XML warehouse Xyleme: An XML warehouse Zooms on some aspects of the technology Scaling Mass storage of XML XML query processing
E N D
XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002
Organization • The context and motivations • XML warehouse • Xyleme: An XML warehouse Zooms on some aspects of the technology • Scaling • Mass storage of XML • XML query processing • Semantic integration • Web page ranking • Query subscription • Xyleme : the company, in very brief
The context The Web and XML are changing dramatically the world of distributed information
The Web of yesterday • Protocol: HTTP • Documents: HTML • Millions of independent web sites and billions of documents • Browsing and keyword search (full-text indexing) • Publication of databases using forms • Data management with the Web • HTML is primarily for humans • Data management applications on the Web • Based on hand-made wrappers • Expensive, incomplete, short-lived, not adapted to the Web constant change • No real support for distributed data management!
What is changing Information used to live in islands and a lot of its value was wasted • Different formats: relational, meta data, documents and text, data exchange formats… • A Web standard for data exchange, XML, is fixing it • XML can capture all kinds of information over a wide spectrum of information • XML comes with a family of emerging standards: XML schema, XSL/T, Xquery, domain specific schemas… • Different computers, platforms, languages, applications • Web services, e.g., SOAP, are fixing it • SOAP allows ubiquitous computing on the Internet • SOAP comes with a family of emerging standards: WSDL, UDDI
What is changing • XML and Web services provide a uniform access to information, independent of platform, system, language, communication protocol and data format… • The dream for distributed data management • The gathering, integration, consolidation, analysis of distributed information become feasible at a much lower cost
Minimal structure (1) XML covers the information spectrum Structured Data Hierarchy + Meta data Books Contracts Catalogs Bank accounts Emails Financial Reports Insurance Policies Economical Analysis Derivatives Inventory Political analysis Insurance Claims Financial News Sports News Resumes
XML covers the information spectrum • Very structured information such as databases • Most DBMS now export in XML • Semi-structured data such as data exchange formats (ASN.1, SGML), e.g., technical documentation • Documents • Meta-data: Author, date, status • Existing structure in them: chapter, section, table of content and index • Possibly tagging of elements in it (citation, lists) • Links to other documents • Meta data for unstructured data such as images and sound • Plain text XML
XML’s asset: the marriage of text and structure labeled ordered trees where leaves are text • Marriage of document and database worlds • Marriage of full text indexing (keyword search) and structure indexing (SQL-style query) • Is it the ultimate data model? No • Purely syntax – more semantics needed • Is it OK for now? Definitely yes (because it is a standard)
product-table product reference price designation description XML’s asset: typing • Applications need typing and XML data can be typed if needed (DTD and XML schema) • Trees • Logical Granularity – neither page or document level – but the piece of information that is needed • Semantics and structure are in tags and paths • product-table/product/reference • product-table/product/price
The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>. The new robot <b>R2D2</b>… Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System HTML HTML hard Text + presentation - Where is the data ?
<product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML XML easy Data + Structure = Semistructured (presentation elsewhere)
(2) Web services and ubiquitous distributed computing • Possibility to activate a method on some remote web server • Exchange information in XML: input and result are in XML • Ubiquitous XML distributed computing infrastructure • 2 main applications • E-commerce • Access to remote data • With XML and Web services, it is possible • To get information from virtually anywhere • To provide information to virtually anywhere
Accessing remote information Query some data services that provide candidate genes Heterogeneous formats, protocols, etc. Gene banks Application using gene banks processing Use some processing services processing processing
Same with web services Uniform access to information Query some data services that provide candidate genes Web Gene banks Application using gene banks processing Use some processing services processing processing
XML and Web services • Exchange of information • E-commerce, B2B, G2C • Cooperative work • Information brokers • Web sites, portals • Content publication in general • Mediation mode: get the XML pages when needed • Warehouse mode: load them in advance
Advantages of a warehouse approach • Allows for support of complex query processing with high performance • Allows for complex analysis of the data • Allows for enriching the information • Allows for better monitoring of information • Allows for versioning, archiving, temporal queries if needed • Mediator approach is preferable or compulsory in some applications • Supply chain • Comparative shopping • Typically for volatile information such as plane ticket price
Main functionalities Admin GUI User GUI Access Reporting Sub User GUI Editing & Pub View & Integration Enrichment Feeding Repository Exploitation API API Warehousing Analysis (data warehouse) (OLAP)
Main functionalities(1) Feeding • Loading from the Web (Internet and Intranet) • Web search • Web crawl • Access Web data via forms or Web services • Plug-ins to load from • File systems, document management systems • Data bases, LDAP • Newsgroup, emails • Other applications • Extraction and transformation • XSL-T or Xquery mappings for XML sources • XML-izers to load data from other formats • Monitoring of the feeding
Main functionalities(1) Feeding – continued • User feeding • Document editing • Meta data editing • Using WebDAV protocol • Publication • By GUI or from programs (SOAP-based API)
Main functionalities(2) Repository • Storage of massive volume of XML (terabytes) • Indexing of massive volume of XML • By structure • By full-text • Linguistic support: stemming, synonyms, etc. • Very efficient XML query processing • Importance ranking • Monitoring of the warehouse (support for subscriptions) • Access control and security • Versioning, archiving • Recovery • No full transaction mechanism
Main functionalities(3) Enrichment • Global organization • Global schema management • Management of collections • Incorporate domain ontologies and thesauri • Document classification • Cleaning by filtering out documents from collections, etc. • Document enrichment • Concept extraction and tagging • Cleaning inside de document • Summarization, etc. • Relationships between documents • Tables of contents • Tables of index • Cross referencing, etc.
Main functionalities(4) View and integration • View management • Document restructuring/mapping • Schema to schema mapping • Semantic integration • Manual for complex ones and (semi-) automatic for simple ones • Tools to analyze a set of schemas • Tools to integrate them • Processing for queries on integration view • Management of virtual data in a mediator style
Functionalities(5) Exploitation • Access to the warehouse • Browsing • Querying by keywords, XPaths or Xquery • Temporal queries • Query subscription • Reporting • Generation of complex reports with pointers to documents, counts, abstracts… • Organized by collections, content, domains… • By GUI or from programs (Web service-based API)
Loading from /u/news/* start now Transformation By some XML-izer X flow Monitoring of Y flow Concept Tagging off flow Indexing flow Storage in Collection Z flow Classification off flow Admin: Specify the lifecycle of information in the warehouse starting from its acquisition • Specify with parameters (in red): documents to process • Add from a toolbox, some processing to apply (in pink) • Specify when processing should be applied (in green)
Specifying the enrichment • What processing should be performed • Applications that come with the system • Arbitrary processing provided as Web services • Interface of services • XML input: the documents or collection of documents in the warehouse to be processed • XML output: the result • Where to plug the result • Where to store the new documents (collections, names) • Where to put enrichments in existing documents • When to start the processing • At the time the document is loaded • At some later time, assuming some information has already been gathered (dependencies)
User: queries and reporting Choose the collections of interest Choose the criteria of selection Choose what to extract as a result FROM CLAUSE WHERE CLAUSE SELECT CLAUSE Quantity of results Preference ranking and possible relaxation PREFER CLAUSE Classify/group results for presentation and drilling ORGANIZE CLAUSE Choose presentation style STYLE CLAUSE
Example From collections MuséeRodin, WebMuseum, LACMA Where Art_Item/ artist [Name=“Rodin”] Select Name, Owner, Annotations Prefer • Rodin in title page • Owner is public or owner is in France • Get first 20 Organize as • Art_Item/material sculpture, painting, others • Owner Present as …
XylemeAn XML warehouse Zooms on some aspects of the technology
Xyleme: a dynamic XML warehouse • Scaling • Feeder • E.g., loading with a single PC millions of Web documents per day – and scale up with more machines • Repository • E.g., storing and indexing of tera Bytes of XML (other formats, e.g., pdf) • Enrichment • E.g., tools (together with partner) for classification and concept extraction • View and semantic integration • E.g., a suite of tools of XML integration • Exploitation • E.g., access via SOAP and graphic interfaces
The scaling • Size of data: billions of XML documents • Size of data and index: terabytes • Number of customers • thousands of simultaneous queries • millions of subscriptions An architecture based on distribution
Architecture • Cluster of PCs • Runs on Linux and C++ (also Solaris) • Communications • local: Corba (Orbacus) • external: HTTP, SOAP • Distribution between autonomous machines
User Interface Xyleme Interface Acquisition & Crawler Change Control Semantic Module Loader Functional architecture -------------------- I N T E R N E T ----------------------- Web Interface Query Processor Repository and Index Manager
Change Control and Semantic Integration Change Control and Semantic Integration Acquisition and Maintenance Acquisition and Maintenance Index Index Index Loader |Query Loader |Query Repository Repository Repositorry Repository Architecture and scaling -------------------- I N T E R N E T ----------------------- E T H E R N E T
2. Data Acquisition and Maintenance of Web pages (internet or intranet)
Crawl le Web • Discover HTML/XML pages on the web (intranet or internet) • Parse/load pages and follow links • Manage metadata for the known pages • Do this under bounded resources • Network bandwidth • Memory and disk resources • Tested on the Internet in October 2001 • Millions of pages crawled per day on each crawler • Up to 10 crawlers and close to 1 billion HTML/XML pages discovered in a couple of months
Optimization Page Scheduling • Optimization problem • Decide which page to crawl or refresh next to optimize the quality of the warehouse • Criteria: • Read more often important pages • Based on customer’s preferences • Page importance can also be used to order query results • Don’t read a page that is probably up-to-date • Uses an estimate of the change frequency for each page • Advantages • Have a fresh view of useful portions of information
Page scheduling • Determine which page to read next • minimize a particular cost function under some constraint (bandwidth of crawlers) • The penalty for a page takes into account: • importance of the page (to be defined next) • customer needs (obtained via pub/sub) • staleness of the data • penalty for being out of date • penalty for aging • The page scheduler fully controls the crawling • vs. random crawling in classic search engines
Page Importance • Based on customer’s criteria and on the link structure of the web • Intuition:a page is important if many important pages reference it • Fixpoint definition: importance vector Imp • Proposed by IBM; used by search engines such as Google • Link matrix: M(i,j) if page i refers to page j • Outdegree of page i: out(i) • Imp0(k) = 1/N (initialization) • Impm(k) = i [M(i,k) * Impm-1(i)/out(i) ] (iteration) • Imp is the limit
Page Importance • Novel technology developed by Xyleme • Patent pending • On-line evaluation of page importance • Use much less resources • Faster reaction to changes on the web
Storing XML • Document systems • Good for keyword search • No or inefficient support for structure search • Relational store (e.g., Oracle 8i) • Well adapted for some applications • Very typed data and Tables: efficient • Otherwise: too many joins and inefficient • Object database store (e.g., Excellon) and Native XML databases (e.g., Tamino) • Same issues • Xyleme XML Native storage
Repository • Goal • minimize I/O for direct access and scanning • efficient direct accesses both with fulltext indexing and structure indexing • good compaction but not at the cost of access • Efficient storage of trees • use fixed length storage pages • variable length records inside a page • Main issue: tree balancing
Tree Balancing Record 1 Record 2 Record 3
Tree Balancing Large collections may use several records
Classification • Based on word occurrences in document and statistical resources • Classification by semantic domain • Classification by language • Use the XXX classifier
Semantic Integration • Web Heterogeneity • Many possible types for data in a particular domain, many DTDs • Semantic Integration • one abstract DTD for the domain • gives the illusion that the system maintains an homogeneous database for this domain 1 domain = 1abstract DTD