440 likes | 585 Views
Integrating Keyword Search into XML Query Processing. XML Query Language (XML-QL) Extending XML-QL with Keyword Search Extended XML-QL Implementation Using RDBMS. Presentation By: Alex Kremer Ariel Rosenblatt. Bibliography (well-formed, but invalid). Bibliography
E N D
Integrating Keyword Search into XML Query Processing XML Query Language (XML-QL) Extending XML-QL with Keyword Search Extended XML-QL Implementation Using RDBMS Presentation By: Alex Kremer Ariel Rosenblatt
Bibliography(well-formed, but invalid) • Bibliography • Article elements are from different sources • Same information, but using different XML Scheme / DTDs (Document Type Descriptors)
XML Queries • XML is becoming the Data Storage and Exchange Format of choice in many applications • Handling of XML data requires a rich and powerful Query Language • Allow for querying the content and structure of an XML document • Varying or unknown structures can make formulating queries very difficult
XML Queries: Why not SQL/OQL • XML is not rigidly structured • In XML the schema can exists with the data as tag names • If DTD is not available, schema is build while the document is parsed • Missing elements or multiple occurrences of the same element • This flexibility is crucial for EDI (Electronic Document Interchange)
XML Query Requirements W3C Working Group • Goals: • Support different usage scenarios • Define data model + query operators • Define query language syntax • Interoperate with other XML working groups
XML Query Requirements: Usage Scenarios • Human-readable documents • Manuals, Books, Articles • Data-oriented documents • XML representation of: • Database data, Object data, … • XML representation might be either: • Physical or Virtual
XML Query Requirements: Usage Scenarios Contd. • Mixed model documents: • Hybrid of document oriented and data-oriented • Catalogues, Patient health records, … • Administrative data: • Configuration files, User profiles, Administrative logs
XML Query Requirements: Usage Scenarios Contd. • Filtering streams: • On-line: filtering / extracting / transforming / routing, of XML data streams • Logs of email messages, Network packets, Stock market data, Newswire feeds • Document Object Model (DOM) • Perform queries on DOM structures to return sets of nodes that meet the specified criteria
XML Query Requirements: Usage Scenarios Contd. • Multiple syntactic environments for queries embedded in: • URL, XML, JSP or ASP pages, a string in a general-purpose programming language • …
XML Query Requirements: Interoperability • Results must be returned in a DOM compatible manner • XPath (used in XPointer and XSLT) • XPath expressibility and search facilities should be used in query syntax • Usage of XML Schema (XSDL) and/or DTD
XML Query Languages: Proposals to W3C • XQL (heavily based on XPath) • XML-QL
XML-QL • It is declarative • It is “relational complete”; in particular it can express joins • Simple enough to enable optimizations • It can extract data from existing XML documents and construct new documents (transformations)
XML-QL: Syntax • WHERE clause specifies how to filter data from the input XML dataset • CONSTRUCT clause specifies how to assemble the query results in XML WHERE ( xml-pattern [ ELEMENT_AS $elem_var ] )* IN url, ( predicate )* CONSTRUCT xml-pattern | $variable
XML-QL: Example #1 • Yields the following result WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography.xml”, $N like *Florescu* CONSTRUCT <result> $E </result>
XML-QL Explained:The Data Model • A Set of XML documents must be represented (XML Data Set) • XML elements in a dataset can be partitioned according to their types • Need to represent information in a loss-less manner (original data set must be recreatable from the representation)
XML-QL Explained:Data Model Representation ID00 Bibliography: article article article article ID14 ID01 ID04 ID08 id id link id link “3” “http:…” “4” “http:…” “6” title title date id link author author author author “20000815” “1” “http:…” ID05 ID06 ID07 ID09 ID10 ID12 “@article… Florescu… }” source title name name name name “A Query…” “Alon L…” “Integr…” ID02 ID03 ID11 ID13 “Daniela Florescu” “Daniela Florescu” “Donald K…” “XML Query…” “W3C”
XML-QL Explained:Data Model Representation • Dataset D is represented as a graph GD: • Nodes: • Element e node Ne uniquely labeled IDe • Data value v leaf Lv uniquely labeled v • Edges: • (Ne, Ne’) labeled with the tag of e’, if e’ is directly nested within e (<e><e’>…</e’></e>) • (Ne, Lv) labeled with “”, if v is directly contained within e (<e>v</e>) • (Ne, Lv) labeled with attribute name a, if v is the value of atribute a of element e (<e a=“v”>…</e>)
XML-QL Explained:Query Processing • An XML pattern can be also modeled by a graph • Some labels in the graph are now variables • The result of the evaluation of query q on the input D,is: • Each mapping from the graph Gq to the graph GDwhich preservers the constant labels • This mapping induces a substitution of the variables in the query on the set of constant values
XML-QL Explained:A Query Graph for Example #1 WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography.xml”, $N like *Florescu* CONSTRUCT <result> $E </result> article title author name $T “*Florescu*”
XML-QL Explained:Query Processing, Example #1 ID00 Bibliography: article article article article ID014 ID01 No <author> ID04 ID08 id id link id link “3” “http:…” “4” “http:…” “6” title title date id link author author author author No <name> “name” is an attribute “20000815” “1” “http:…” ID05 ID06 ID07 ID09 ID10 ID12 “@article… Florescu… }” source title name name name name “A Query…” “Alon L…” “Integr…” ID02 ID03 ID11 ID13 “Daniela Florescu” “Daniela Florescu” “Donald K…” “XML Query…” “W3C” article Match! Add ID08 to Results $E = ID08 $T = “Integrating Keyword Search…” title author name $T “*Florescu*”
XML-QL: Advanced QueriesExample #2 (More Florescu) WHERE <article> <*><author><name>$N</name></author></*> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography.xml”, $N like *Florescu* CONSTRUCT <result> $E </result> union WHERE <article> <*><author><_ name=$N></_></author></*> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography.xml”, $N like *Florescu* CONSTRUCT <result> $E </result> We now look for articles where the author name can be also an attribute!, result Back
XML-QL: Disadvantages • We need to know the XML structure in order to query • We can still perform more efficient queries, where we get all the information available, but • These queries can easily grow very complex as seen previously
XML-QL: Keyword Search Extension • Addition of special predicate called containsto XML-QL • Tests the existence of a given word within an XML element • Works on partially known or not-known XML structure • Allows querying several XML documents with different structure
Extended XML-QL: The contains Predicate • The contains predicate has 4 arguments, ($E, word, depth, location): • $E is an XML element variable • Word – the word we are searching for • Depth is an integer expression limiting the depth at which the word is found within the element • Location is a boolean expression over the set of constants, • {tag_name, attribute_name, content, attribute_value}
Extended XML-QL:Example #3 • We can use the extended XML-QL to formulate a query which yields the same result as Example #2 WHERE <article> <author></author> ELEMENT_AS $A <title>$T</title> <article> ELEMENT_AS $E IN “bibliography.xml”, contains($A, “Florescu”, 3, content or attribute_value) CONSTRUCT <result> $E </result> Back
Extended XML-QL:Example #4 • We are able to query unstructured data (full text search) within a set of articles: WHERE <article></article> ELEMENT_AS $E IN “bibliography.xml”, contains($E, “Florescu”, 3, any) CONSTRUCT <result> $E </result> Yielding the result
Implementing the containspredicate • The authors suggest an implementation of the XML-QL extension on top of a Commercial RDBMS: • Oracle 8, IBM DB2, MS-SQL, …
Implementation Using RDBMS • Reasons: • Easy to implement an extended XML query processor • Universally available • RDBMS allow to mix XML data and other (relational data) • Very good performance over large volumes of data
Relational Support forFull-text Indexing • Use of extended Inverted Files to implement: • The contains predicate • Finding of relevant XML data sources (URLs) in a distributed environment • We will use RDBMS to implement Inverted Files
Inverting Files • For our needs the inverted file will contain tuples of the following format: • <word, elID, depth, location> • Examples from bibliography.xml: • <“article”, elID01, 0, tag> • <“id”, elID01, 1, attr> • <“Requirements”, elID01, 2, value>
Storing Inverted Files in RDBMS: Unique Internal elIDs • Unique element IDs are modeled as records containing: • Document locators (URLs) • Element locators within the document • Using absolute positions (start, end) • Using unique identifiers specified by DTD (explicit id attribute) • Why not XPointer?
Storing Inverted Files in RDBMS: Unique elID Schemes • After normalization the authors propose the following scheme: • Elements(elID, docid, start_pos, end_pos, type, id_val) • Documents(docid, URL) • From this point elID can be used as an internal key used for faster processing
Storing Inverted Files in RDBMS • Natural way – using scheme: • contains(elID, word, depth, location) • Huge! We partition it into word tables for each keyword <word> in the dataset: • <word>(elID, depth, location) • Virtually all IR (Information Retrieval) systems use partitioning by word Back
Storing Inverted Files in RDBMS: Further Partitioning • We use further partitioning to optimize the query processing: • The type (tag) of the element is usually known at predicate evaluation time • by looking at the XML pattern of the query • We further partition the individual <word> tables by the type of the element they are in: • <word>-<type>(elID, depth, location) • Table examples: Name-author, Florescu-name bibliography.xml Back
Implementation: Extended XML-QL Query Processing • Two Ways: • Replicating the whole XML data in an RDBMS • XML-QL processing is entirely performed in an RDBMS • Distributed XML Query Processing • only index (contains) is stored in an RDBMS
Replicating the XML Data in an RDBMS • The binary table approach: • For each type (tag name or attribute name), a table is built with the following scheme: • <type>(parent, element, value) • The parent element contains the element of type <type> • element is null if a <type> has no sub-elements or if <type> is an attribute name (in that case we are usually interested in the value) bibliography.xml
Replicating the XML Data in an RDBMS: XML-QL Queries • Every XML-QL query can be translated into an equivalent SQL query • The SQL query will process the binary tables of the replicated XML Data Back
XML-QL to SQL: Example #5 (from Example #1) WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography.xml”, $N like *Florescu* CONSTRUCT <result> $E </result> SELECT article.element FROM article, author, name, title WHERE article.element = author.parent AND author.element = name.parent AND article.element = title.parent AND /* title exists */ name.value like “Florescu”
Extended XML-QL to SQL: Keyword Search • Processing the contains predicate involves usage of inverted file tables • The word-type table has to be joined with the previous result • The word-type table is the resulting table of the word by type partitioning
Extended XML-QL to SQL: Example #6 WHERE <article> <author></author> ELEMENT_AS $A <title>$Ttext</title> ELEMENT_AS $T <article> ELEMENT_AS $E IN “bibliography.xml”, contains($A, “Florescu”, 3, any) contains($T, “Integrating”, 3, any) CONSTRUCT <result> $Ttext </result> SELECT title.value FROM article, author, name, title, Florescu-author, Integrating-title WHERE article.element = author.parent AND author.element = Florescu-author.elID AND article.element = title.parent AND title.element = Integrating-title.elID
Distributed XML Query Processing • XML data can be indexed in RDBMS, but • The XML data cannot be stored in the RDBMS • Reasons: volume (entire www) or legal • The mediator (query interface): • Uses inverted files in RDBMS, but • Accesses the data sources to compute the full query result (Expensive!) • Load relevant documents/elements into RDBMS and process the query as described before (XML-QL to SQL)
Distributed XML Query Processing: Elements Retrieval • Use of Inverted Files for the retrieval of relevant documents/elements: • Evaluate contains predicates to disqualify irrelevant elements • Further reduce the dataset needed to process the remaining basic XML-QL query • This is an optimization since retrieval of remote data is expensive • Load the relevant documents/elements
Distributed XML Query Processing: Reducing Retrieval WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography.xml”, $T like *XML* CONSTRUCT <result> $N </result> • Get the intersection of elIDs sets from: • author-article • name-article • title-article • XML-article
Conclusions • XML-QL can be extended to support keyword search • Use of RDBMS: • Inverted Files can be stored an queried using an RDBMS • XML data itself can be replicated and queried in the RDBMS • Keyword search and overall XML query processing can be carried out very efficiently • Data structure influence: • The more structure is known, the faster a query will be executed • Totally unstructured queries can be executed very fast • The more structure is known, the higher is the quality of the query results