Structural indexes of XML Databases

Structural indexes of XML Databases Dr. Vu Le Anh lavu@ntt.edu.vn

Outline • Motiviation • Regularqueries processing over XML datasets • Indexes over XML datasets • Structural indexes • Structural indexes for distributed XML datasets • Summary

NCBI GEO dataset • GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. • About 600 gigabyte (Feb - 2009). Data are stored in XML datasets A map of gene is written in XML file, and its XML graph.

Virtual observatory • A collection of interoperating data archives and software tools which utilize the internet to form a scientific research environment in which astronomical research programs can be conducted. • IVOA (International Virtual Observatory Alliance)  Building an international community • Using very big XML datasets for storing, exchanging data

Problem • Efficient query processing over Big (Distributed) XML - Databases • Two “interesting” ideas: • Storing the XML databasein relational database. RewritingXML a az XML queries  SQL and Datalog. Rewriting and combining the results. • Indexing the XML database. Using the indexes for query processing.

Data Graph – Data Model forXML • Data graph:directed, rooted, labelled graph. : set of nodes. : set of label values : set of edges : set of basic edges. : set of reference edges. : the root. : labeling function

<CSDepartment> <PhDStudents> <Student id="s1"> <Name>John</Name> <Papers> <Paper id="pp1"> <Title>ABC</Title> <Author>Dr.Ben</Author> <Author idref="p1"> </Author> </Paper> </Papers> </Student> <Student id="s2"> <Name>Tom</Name> </Student> </PhDStudents> <Professors> <Professor id="p1"> … … <Name>Dr. Kiss</Name> <Papers> <Paper idref="pp1"> </Paper> <Paper> <Title>DEF</Title> </Paper> </Papers> </Professor> <Professor id="p2"> <Name>Dr. Baker</Name> <Papers> <Paper> <Title>XYZ</Title> </Paper> </Papers> </Professor> </Professors> </CSDepartment> PublicationXML document

XML - Datagraph

Regular queries • Query language for XML: • XQuery, XPath, UnQL, Lorel, XQL, XML-QL, etc. • Build around regular expressions. • 3 basic operations: • Concatation: .or/ • Union: | • Interation: * • For short: _ - some label value // - (_)* some sequence of label values • Example: //(Student | Professor)//Paper/Title

Regular queries • Pair of nodes (u, v) matches R regular query, if there is a rout fromu tov, in which the label sequence of the rout matching R. • Theresult of R : Ithe input-set and Otheoutput-set , (u, v) matchesR} • General case:I={root} ésO={V}. • EveryR regular expression can be represented by a finite, not determined automata (NFA), which computes L(R) language. Query graphis the graph representing the automata.

Query processing based on the automata * B D • The query graph of //B/D: • Input: I={0}; Output: O={0,1,…,15} q0 q1 q2 q0 A 0 1 A q0 q0 8 B q0 2 B q0 6 C 9 A 13 D 3 D q0 q1 7 A 10 B 14 E 4 A q0 q2 C q0 q2 5 11 D 12 E 15 F The result = {(0,3),(0,11),(0,13)}

Transform toEdgeLabeled graph Node labeled graph Edge labeled graph Query graph is a edge labeled graph. Transform data graph to edge labeled graph.

State-Data (SD) graph • SD graph = Query graph JOINING Data graph • SD graph may be not connective. • SD-Nodes: (data-node, state-node) • SD- labeled edges: Constructing from the matching of labels of data-edges and node-edges.

Joining R:= a/(b|c)*/a and data graph b Query graph: Data graph: 1 a b a a s0 s1 s2 2 c a 3 5 c a SD-graph: 4 1,s0 1,s1 a a 2,s1 2,s2 2,s0 a b c 5,s1 3,s0 3,s1 Result: (1,4) , (1,5) a a a 4,s2 5,s2 4,s1

SD-graph representation on relational database [KissVu05] • Main results: • The data graph and query graph can be represented by tables • SD graph (table) = Joining data table and query table. • Computing the result based on the SD-table. • Regular query processing DATALOG + SQL • Building the index to support SQL computation.

1. Step: Transform data graph to edge labeled graph

2. step: Query graph representation

3. lépés: Using DATALOG, SQL for the computation

4. step: Computation in Relational Databases results: {4,5,6}

Classes of XML indexes • Indexing the basic values • The basis values are indexing (Ex: data(//emp/salary)) • Using B+-tree • Indexing the text values • Keywords should be indexed • Indexes for XML -Tree • Quickly checking and computing the label sequence of rout between some pair of nodes. • Applying it for near-tree XML datasets. • Structural indexes. • Simulating the datagraph by smaller one to reduce the cost of computation

Tree preorder/postorder walkingfor computing(pre(x),post(x)) XML-tree pre/post computing[Dietz82] (1,7) (6,6) (2,4) • x is a descendent of y <=> • pre(x) < pre(y) és • post(x) > post(y) (7,5) (3,1) (5,3) (4,2)

Every x node: (order(x), size(x)) Tree- Structure Improvement [Li&Moon VLDB 2001] (1,100) • x is a descendent of y <=> • order(x) < order(y) és • order(y) <=order(y) + size(x) (41,10) (10,30) (45,5) (25,5) (11,5) (17,5)

Regular query processing over XML –tree and near tree • Very efficient  based on tree-structured indexes • [KissVu06]: Applying for near-tree XML dataset • Link graph: Connecting between link nodes. • Using tree-structured indexes for the basic structure

Family of Structural indexes

1-index [Milo & Suciu, LNCS 1997] Idea: Grouping all “equivalent” data-nodes into an index-node. Computing the index nodes  bi-simulation equivalent ≡ekvivalencia helyett. • Index graph is smaller than the data-graph • Working for every regular queries. • A bi-simulation computing = PTIME.

y1 y2 Bisimulation • A  bi-simulation: • x1 és x2 have the same label • Ifx1 x2 and (y1,x1) is an edge, then there exists edge (y2,x2), in which y1  y2.  b b  x1 x2 a a

Example1-index paper paper 1 1 13 section section title 14 section 2 2,4,8,13 section 8 4 section 3 15 exp exp title exp algorithm title algorithm 16 10 15,16 3,5,9,14 6,10 6 9 algorithm title 5 title 18 about proof proof 17 11 17,18 proof 7 7 11 about about uses proof 12 12 /paper/section/algorithm uses Data Graph 1-index

Using 1-index? • Good:Working for all regular queries. • Bad: Not small enough !!! • Idea: The index graph is designed only for the most frequently in use queries. The index graph is very small now !!! • New equivalent relationship between nodes should be defined • If the query is not support, re-check on the data graph

Structural indexes and a given set of queries • Important : • //a0/a1/…/ai (i<=k), not longer than k • A(k)-index • Dinamikus indexek • APEX, D(k)-index • //S0/S1/…/Sk, SAPE queries • DL-1, DL-A*(k)-index • Forward-backwardqueries • F&B-index

A(k)-Index [Kaushik et al. 02] • A //a0/a1/…/ai (i<=k) • Ak-biszimulation. • Ak(k-biszimuláció): • u 0 v, ha u és v if they have same label, • u kv ifu k-1vand • If (u’,u) is an edge, there exists edge (v’,v): u’ k-1v’ • If (v’,v) is an edge, there exists edge (u’,u): u’ k-1 v’

1 imdb 2 5 movie tv {1} {1} {1} imdb imdb imdb 3 6 8 director director director tv {5} tv {5} tv {5} movie movie movie {2} {2} {2} director {6,8} 4 7 9 director {3} director director {3,6,8} {3} {6,8} director name name name Data graph name name name name {4,7,9} {4,7,9} {4} {7,9} A(2)-index (1-index) A(1)-index A(0)-index A(k)-index

Split Operation R R R R A B A B A B A B C1 C2 C3 C1 C2,C3 C1 C2,C3 C1,C2,C3 C4,C5,C6 C4 C5 C6 C4 C5,C6 C4,C5,C6 Adatgráf A(2) (=1-index) A(1) A(0)

Refinement (1. step) R R R R A B A B A B A B C1 C2 C3 C1 C2,C3 C1 C2,C3 C1,C2,C3 C4,C5,C6 C4 C5 C6 C4 C5,C6 C4,C5,C6 Data gráph A(2) (=1-index) A(1) A(0)

Refinement (2. step) R R R R A B A B A B A B C1 C2 C3 C1 C2,C3 C1 C2,C3 C1,C2,C3 C4,C5,C6 C4 C5 C6 C4 C5,C6 C4,C5,C6 Data graph A(2) (=1-index) A(1) A(0)

DL-1-index [KissVu06] //S0/S1/…/Sk (SAPE = Simple Alternation Path Expression). Dinamikus index (Dynamic labelling).

A //(d|e)/f SAPE query Data Graph A SAPE query: //(d|e)/f R := S0/S1 S0= { d,e } ; S1= { f } A (4,9), (5,10), (6,11)és (7,12) matchingR. The result: TG(R) = {9,10,11,12} a 0 c 1 b 2 b 3 4 d 5 e 6 d 7 d 8 e f 9 f 10 f 11 12 f 13 g

Example: DL 1-index support //(K|L) és //(B|C)/Equeries A A A A 0 0 0 0 K,L,M,N K,L M,N K,L M,N 1 2 3 4 1,2,3,4 1,2 3,4 1,2 3,4 K M N L B,C,D B,C C,D B,C 5 6 7 8 5,6 7,8 5,6 5,6,7,8 7 8 B C C D C D E,F E E,F E 9 10 11 12 9,10,11,12 11,12 9,10 9,10 11 12 E E F E F E (c) (d) (a) (b) DL-1-index at the begin. The data graph andthe 1-index are the same. R1= //(K|L) support R2= //(B|C)/E Support

A DL-A*(k)-index [KissVu06] • TheA(i)-index is a special case of DL-A*(k). • DL-A*(k)-indexsupport for a given not longer k SAPE queries.

DL-A*(1)-index support A//(K|L)and//(B|C)/Equeries thebegin index: A //(K|L) - refinement: 0 1 2 3 4 N K M L 5 6 7 8 D B C C //(B|C)/E -refinement: 9 10 11 12 E E E F Data graph

Experiments • DL-1 vs.1-index • DL-A*(k)vs. A(k)-index • 2 datasets: • XMark: 100 Mb, 1.681.342 nodes. • TreeBank:82Mb, 2.437.667 nodes.

Distributed XML-tree • XML- tree = Fragments – sub trees. • Serversstores some fragments. • There are linking edges between fragments. • Questions:Finding efficient protocol for regular query processing? Waiting time – Computing time • Applying structural indexes?

//a/b//aprocessingon XML –tree using 2 servers

Flow modell (SPIDER algoritmus) • Beginning from the root. • (F, q)  (F’, q’): • Processing on Fstops. • Processing on F’ with state q’. • If finish processing over F’, then send the result to F. • F continues Waiting time!

2 phasesparallel modell Servers: Computing every possible states on it own site. Sending to the coordinator the link edge Coordinator examines the link edges and request the results from servers Severs send the results to coordinator. The computing time !!!

1- phaseparallel model [KissVu07] • The coordinator builds the structuralTree-index for whole systemfor determine connective (F,q)states. • Processing on the index first for computing connective states Good:Efficient processing Bad:The index may be big.

StructuralTree-index F0 0 A q0 A F0 F3 AB ε 1 A 8 B F5 AC Fa-index q0 D F1 A F2 B F3 2 B 6 C 13 A D q0 q1 q0 A ε F2 F4 F1 B F4 D F5 3 D 7 A 10 B E 14 q0 q1 q0 4 A 5 C 11 D E 12 F 15 Connective states: (F0,q0), (F1,q0), … (F2,q1), (F2,q2): is not connective

Experiments • 19 Linux local-servers. • Waiting time: 1IP : 2P : SP= 1 : 1.94 : 37.52 • Computingtime: 1IP : 2P : SP = 1 : 1.77 : 2.75

Native XML database systemshttp://www.rpbourret.com/xml/XMLDatabaseProds.htm#native Termék Fejlesztő License Adatbázistípus Qizx/db XMLMind Commercial Proprietary Sedna XML DBMS ISP RAS MODIS Free Proprietary Sekaiju / Yggdrasill Media Fusion Commercial Proprietary SQL/XML-IMDB QuiLogic Commercial Proprietary (native XML and relational) Sonic XML Server Sonic Software Commercial Object-oriented (ObjectStore). Tamino Software AG Commercial Proprietary. Relational through ODBC. TeraText DBS TeraText Solutions Commercial Proprietary TEXTML Server IXIASOFT, Inc.Commercial Proprietary TigerLogic XDMS Raining Data Commercial Pick Timber University of Michigan Open Source (non-commercial only) Shore, Berkeley DB TOTAL XML Cincom Commercial Object-relational Virtuoso OpenLink Software Commercial Proprietary. Relational through ODBC XDBM Matthew Parry, Paul Sokolovsky Open Source Proprietary XDB ZVON.org Open Source Relational (PostgreSQL) XediX TeraSolution AM2 Systems Commercial Proprietary X-Hive/DB X -Hive Corporation Commercial Proprietary. Relational through JDBC Xindice Apache Software Foundation Open Source Proprietary xml.gax.com GAX Technologies Commercial Proprietary Xpriori XMS Xpriori Commercial Proprietary XQuantum XML Database Server Cognetic Systems Commercial Proprietary XStreamDB Native XML Database Bluestream Db. Soft. Corp. Commercial Proprietary Xyleme Zone Server Xyleme SA Commercial Proprietary

Summary • Big XML is used in many applications • Our problem: Efficient processing regular queries over XML databases. 3.Two ideas: • Using Relational databases • Building special indexes for XML databases

Structural indexes of XML Databases

Structural indexes of XML Databases

Presentation Transcript

XML Databases

XML Databases

XML Databases

XML and Databases

Native XML Databases

Structure Indexes for XML

XML and Databases

XML and Databases

XML and Databases

XML and Databases

An Overview of Indexes and Databases

Native XML Databases

XML Databases

XML and Databases

XML Databases

XML and Databases

XML and Databases

USE OF INDEXES IN EBSCO DATABASES

Searching Indexes and Databases

Structure Indexes for XML