410 likes | 429 Views
Learn about the various indexing techniques for XML data, including structural index, content index, and adaptive path index. Understand the requirements, data guides, and variations of each technique.
E N D
XML Indexing Techniques Requirements Dataguide and Variation Index Fabric Adaptative Path Index Node Numbering scheme Compact Structural Summary Conclusion
Requirements • XML Queries involve navigating data using regular path expressions.(e.g., XPath) • /Livre//Auteur[@specialite="informatique"]) • Accessing all elements with same name string. • Ancestor-descendant relationship between elements. • Content based access on values included in text.
Index Types • Structural index • Accessing all elements of given name • Ancestor-descendant and parent-child relationship between elements • Content index • Accessing elements containing given keywords • Supporting most text search functionalities
Classically based on inverted lists For each term, gives the doc.ID + localization Several variations allows different search types Offset, Relative, Proximity Generally stored in a B+-Tree to optimize search for a given word Size is an important issue Memory and Disk (word, localization) Fixed entry (word repeated) (word, Frequency, (localization)*) Variable length entry Words Localization - t1 : doc1-100, doc1-300, doc3-200, … - t2 : doc2-30, doc4-70, … - t3 : doc4-87, doc5-754, … Classical Content Index
Support of element addressing Doc.ID should include NodeId (Xpath) + Offset Index size becomes very large XPath are long Support of typed data Integer, float, simple types of XML schema Requires classical indexes for certain elements Query processing Structural joins Text search Exact search Support of updates Incremental updates would be a plus Problem with XML
Evaluation Criteria • Identifiers • Per node or per document • Descendant/Ancestor Search • By join algo. • By graph traversal • By OID comparison • Keyword Search • By element scan • By B-tree traversal • Update • Incremental • Index size • Entry number • Entry size
Goldman & Widom VLDB97 Dynamic schemas helps in query formulation Concise and accurate structural summaries Every path in the database has one and only one corresponding path in the DataGuide with the same sequence of labels A legal label path: Restaurant/Name Target set for e=Restaurant/Entree is Ts(e) = {6,10,11}. DocId can be added to identifiers 2-Dataguide and Variation
2,3 4 6,10,11 5,9 7 8 8 Targeted dataguide Dataguide Principle • To achieve conciseness • a DataGuide describes every unique label path of a source exactly once. • To ensure accuracy • a DataGuide encodes no label path that does not appear in the source. • And for convenience • a DataGuide itself be an object (OEM or XML).
Dataguide Evaluation • Identifier • One per node • Descendant/Ancestor Search • By graph traversal • Keyword Search • By element scan • Update • Insertion is incremental • Deletion is complex • Index size • Entry number : Linear for tree; can be exponential in number of DB nodes • Entry size : number of elements for a path
T-Index • [Milo & Suciu, LNCS 1997] • T-index stands for Template-index • A path template t has the form • T1 x1 T2 x2 … Tn xn • where each Ti is either a regular path expression or one of the following two place holders P (any Path) and F (any Formula) • //restaurant/ x P y /Address/City z F u • A query path q is obtained from t by instantiating: • P by any path ; F by any formula
Principle • T-index indexes all sequences of objects connected by a sequence of path expressions defined by a template. • Particular cases : • 1-index indexes = template any path P • Indexes all objects reachable through an arbitrary path expression P from a root: • two nodes are equivalent (same entry) if the set of paths into them from the root is the same. • 1-index is a non-deterministic version of the strong data guide • 2-index indexes = template P x P • all pairs of objects connected by an arbitrary path expression P
Building a T-index • Group objects into equivalence classes containing objects that are indistinguishable w.r.t to a class of paths defined by a path template • Finer equivallence classes are more efficient to construct using bi-simulation • Construct a non deterministic automaton • states represent the equivalence classes • transitions correspond to edges between objects in those classes. • T-index can be used to answer queries of more general forms than the template
3-Adaptative Path Index (APEX) • Adaptative Path Index for XML [Chung et.al. SIGMOD 2002] • Summarize paths that appear frequently in query workload • Maintain all paths of length 1 • Efficient for partial match paths • Incremental update of index
APEX details • Each node has an identifier (nid) • Required paths for indexing ({label}+some composed paths) • APEX = Graph (structural summary) + hash tree (incoming required paths to nodes of Graph) • Hash tree is used to find nodes of graph for given label path, also for incremental update • Determine frequently used path from query workload using sequential pattern mining
APEX Example XML data structure APEX Hash tree and Graph
APEX Evaluation • Identifiers • One per node • Descendant/Ancestor Search • Hash tree access if required or graph traversal or join • Keyword Search • Not supported • Update • Insertion is incremental • Index size (two structures) • Entry number : Linear in number of nodes • Entry size : number of elements for a path
4-Index Fabric • [Cooper et al. .A Fast Index for Semistructured Data.. VLDB, 2001] • Extension of dataguide for text search • Keeps all label paths starting from the root • Encode each label path with data value as a string • Use efficient index for strings to store it (Patricia trie) • Perform queries on keywords for elements as string search • Does not keep information on non-terminal nodes
Trié : Key Value A Patricia trie is a simple form of compressed trie which merges single child nodes with their parents More efficient for long keys (non-common postfix in one node) Patricia Trié Trie = A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.
Doc 1:<invoice> <buyer> <name>ABC Corp</name> <address>1 Industrial Way</address> </buyer> <seller> <name>Acme Inc</name> <address>2 Acme Rd.</address> </seller> <item count=3>saw</item> <item count=2>drill</item> </invoice> Doc 2: <invoice> <buyer> <name>Oracle Inc</name> <phone>555-1212</phone> </buyer> <seller> <name>IBM Corp</name> </seller> <item> <count>4</count> <name>nail</name> </item> </invoice> Exemple
Search on Paths • Example of queries: • /invoice/buyer/name/[ABC Corp] • /invoice/buyer//[ABC Corp] • A key lookup operator search for the path key corresponding to the path expression. • If path expands to infinite number of tags • start by using a prefix key lookup operator, • then navigate through children to check the rest
Fabric Evaluation • Identifiers • One per document • Descendant/Ancestor Search • As string search; do not keep order of elements • Keyword Search • By Patricia trie leaves if expanded; value index otherwise • Update • Insertion is incremental • Deletion is complex • Index size (index stored with document) • Entry number : Linear for tree • Entry size : number of elements for a path
5-Node Numbering Scheme • Used for indexing elements • Node Identifier (NID) element • The NID aims at replacing structural joins by simple function computation: • check parent & ancestor relationships • is_parent(NID1,NID2), is_ancestor(NID1,NID2) • determine parent & children • get_parent(NID1), get_children(NID1)
Virtual nodes (1) • [Lee & Yoo Digital Libraries 99] • Document structure mapped on a k-ary tree • Node identifier assigned according to the level-order tree traversal • parent(i) = (i-2)/k + 1 • child(i,j) = k(i-1) + j + 1
Virtual nodes (2) • NID can be used to address elements in index of elements • Only certain nodes (e.g., leaves) have to be indexed as parent nodes can be determined by computation • Problems: • arity of tree – may be variable and large • determination of real existence of parent/child • update when arity increases ?
[Dietz82] Identification of nodes Identifier = preorder rank||postorder rank X ancestor of Y <=> pre(X) < pre(Y) and post(X) > post(Y) Example 1<5 and 7>3 => (1,7) ancestor (5,3) XML trees node pre/post numbering (1,7) (6,6) (2,4) (7,5) (3,1) (5,3) (4,2)
[Li&Moon VLDB 2001] Identify each node by a pair of numbers <order, size> as follows: For a tree node y of parent x: order(x) < order(y) order(y)+size(y) =< order(x) + size(x) For two sibling nodes x and y, if x is the predecessor of y in preorder traversal then order(x) + size(x) < order(y) Interval encoding (1,100) (41,10) (10,30) (45,5) (25,5) (11,5) (17,5) Size keeps space for updates
Relative Region Coordinates (1) • [Kha & Yoshikawa IEEE Data Engin. 2001] • A RRC of a node n of an XML tree is a pair [sp-sn,sp-en] of addresses in the region of parent, i.e., relative to parent start Parent Child s e
Relative Region Coordinates (2) • Absolute region coordinate (ARC) • Relative to root begin (from byte Nth to Mth) • Allow to extract the XML data • Can be derived from RRCs of parents and self: • Begin = (parentsself)s –(k-1) • End = (parents)s +e(self)–(k-1) • Advantages • Updates are kept local to a region • To access parent-child efficiently • A B-tree like structure is maintained (à la Natix).
Xyleme • Generate a form of dataguide per cluster • Generalized DTD • Manage a label and value index (full index) • Keep document ID and element ID • Two forms of element ID: • Bit structured scheme: structure position • Prefix-postfix scheme: left-deep traversal • Stores XML DOM trees in pages • NATIX (Mannheim Univ.) technology
6-Compact Structural Summary • [Bremer & Gertz Tech Report 2003] • Compact addressing of words in XML doc. • Encode XPath as reference to a path in a document guide (path set, DTD or schema)
Naïve XML Indexing (Word,docId,(XPath)*) Example book/chapter[2]/resume/section[3] article/author/name Difficulties: Index size ! Processing time ! Intersection of lists Problem: How to memorize the location of a word inside an element ? Solution [Bremer & Gertz 02] Encode the XPath as a reference to a path in a document guide (path sequence or schema) Managing a Compact Index
dbI Article*II techreport VI title III text IV db Sect* V techreport article article Document Guide title text sect sect sect XPath Encoding • XPath encoded as a path ID (PID) of structure (N,(p1,p2, ...) • N being a node identifier in the guide • (p1, p2, ...) being indices for repetitive ancestors from root to N PID : (V, (1, 3)) /db/article[1]/text/sect[3]
PID order : IV,(1))<(V,(1,2)) <(V,(1,3)). Pre-order relationship X Parent Y PID(X) < PID(Y) Compact PID encoding Path number Integer (short) Repetitive node log2(n) bits Compact PID Encoding : (V, (1, 3))/db/article[1]/text/sect[3] db techreport article article title text sect sect sect PID Ordering and Encoding 2 children : 1 bit 1 child : 0 bit 3 children : 2 bits Total : 3 bits
Entry Word (stem) || Address Address is : PID || (offset in element)* Example City (V(1,3); (9, 36)) Index Implementation <livre> <titre>Les Misérables, Tome 1 : Fantine</titre> <auteur>Victor Hugo</auteur> <histoire> 1815. Alors que tous les aubergistes de la ville l'ont chassé, le bagnard Jean Valjean est hébergé par Mgr Myriel ( que les pauvres ont baptisé, d'après l'un de ses prénoms, Mgr Bienvenu). L'évêque de la ville de Digne, l'accueille avec bienveillance, le fait manger à sa table et lui offre un bon lit. …. </histoire> </livre>
XQuery Text Evaluator • Normalize the query through thesaurus • Translation • Synonyms • Conceptualization • Access to the text index • Intersection, union, difference of PIDs • Access to the relevant elements from PIDs • Verification of relevance
7-Conclusion • Various indexing techniques for XML • Main dimensions of variations • Structural summary • Dataguide, Schema guide, Generalized DTD • Identification of nodes (NID) • Should keep parent-child relationship • Should be stable to updates • Index of keywords • Should be compact • Should give NID and offset of instances
Classification XML Indexing Methods Numbering Scheme Text Search Graph Traversal RRC Hierarchy T-Index Pre/Post Order Fabric Dataguide APEX Interval Encoding
Index for XQuery Text • Facilitate the retrieval of: • Non stop words • Suffixes, prefixes • Location of words in elements • Relevant nodes for a search • Entries should focus on elements • Word [(docId, NID)*]
Trreguide patterns Book Book Author Category Author Category @speciality Company Address @speciality Company Address City City (b) (a)