530 likes | 719 Views
Structural indexes of XML Databases. D r. Vu Le Anh lavu@ntt.edu.vn. Outline. Mo tiviation Regular queries processing over XML datasets Indexes over XML datasets Stru ctural indexes Stru ctural indexe s for distributed XML d atasets Summary. NCBI GEO dataset.
E N D
Structural indexes of XML Databases Dr. Vu Le Anh lavu@ntt.edu.vn
Outline • Motiviation • Regularqueries processing over XML datasets • Indexes over XML datasets • Structural indexes • Structural indexes for distributed XML datasets • Summary
NCBI GEO dataset • GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. • About 600 gigabyte (Feb - 2009). Data are stored in XML datasets A map of gene is written in XML file, and its XML graph.
Virtual observatory • A collection of interoperating data archives and software tools which utilize the internet to form a scientific research environment in which astronomical research programs can be conducted. • IVOA (International Virtual Observatory Alliance) Building an international community • Using very big XML datasets for storing, exchanging data
Problem • Efficient query processing over Big (Distributed) XML - Databases • Two “interesting” ideas: • Storing the XML databasein relational database. RewritingXML a az XML queries SQL and Datalog. Rewriting and combining the results. • Indexing the XML database. Using the indexes for query processing.
Data Graph – Data Model forXML • Data graph:directed, rooted, labelled graph. : set of nodes. : set of label values : set of edges : set of basic edges. : set of reference edges. : the root. : labeling function
<CSDepartment> <PhDStudents> <Student id="s1"> <Name>John</Name> <Papers> <Paper id="pp1"> <Title>ABC</Title> <Author>Dr.Ben</Author> <Author idref="p1"> </Author> </Paper> </Papers> </Student> <Student id="s2"> <Name>Tom</Name> </Student> </PhDStudents> <Professors> <Professor id="p1"> … … <Name>Dr. Kiss</Name> <Papers> <Paper idref="pp1"> </Paper> <Paper> <Title>DEF</Title> </Paper> </Papers> </Professor> <Professor id="p2"> <Name>Dr. Baker</Name> <Papers> <Paper> <Title>XYZ</Title> </Paper> </Papers> </Professor> </Professors> </CSDepartment> PublicationXML document
Regular queries • Query language for XML: • XQuery, XPath, UnQL, Lorel, XQL, XML-QL, etc. • Build around regular expressions. • 3 basic operations: • Concatation: .or/ • Union: | • Interation: * • For short: _ - some label value // - (_)* some sequence of label values • Example: //(Student | Professor)//Paper/Title
Regular queries • Pair of nodes (u, v) matches R regular query, if there is a rout fromu tov, in which the label sequence of the rout matching R. • Theresult of R : Ithe input-set and Otheoutput-set , (u, v) matchesR} • General case:I={root} ésO={V}. • EveryR regular expression can be represented by a finite, not determined automata (NFA), which computes L(R) language. Query graphis the graph representing the automata.
Query processing based on the automata * B D • The query graph of //B/D: • Input: I={0}; Output: O={0,1,…,15} q0 q1 q2 q0 A 0 1 A q0 q0 8 B q0 2 B q0 6 C 9 A 13 D 3 D q0 q1 7 A 10 B 14 E 4 A q0 q2 C q0 q2 5 11 D 12 E 15 F The result = {(0,3),(0,11),(0,13)}
Transform toEdgeLabeled graph Node labeled graph Edge labeled graph Query graph is a edge labeled graph. Transform data graph to edge labeled graph.
State-Data (SD) graph • SD graph = Query graph JOINING Data graph • SD graph may be not connective. • SD-Nodes: (data-node, state-node) • SD- labeled edges: Constructing from the matching of labels of data-edges and node-edges.
Joining R:= a/(b|c)*/a and data graph b Query graph: Data graph: 1 a b a a s0 s1 s2 2 c a 3 5 c a SD-graph: 4 1,s0 1,s1 a a 2,s1 2,s2 2,s0 a b c 5,s1 3,s0 3,s1 Result: (1,4) , (1,5) a a a 4,s2 5,s2 4,s1
SD-graph representation on relational database [KissVu05] • Main results: • The data graph and query graph can be represented by tables • SD graph (table) = Joining data table and query table. • Computing the result based on the SD-table. • Regular query processing DATALOG + SQL • Building the index to support SQL computation.
4. step: Computation in Relational Databases results: {4,5,6}
Classes of XML indexes • Indexing the basic values • The basis values are indexing (Ex: data(//emp/salary)) • Using B+-tree • Indexing the text values • Keywords should be indexed • Indexes for XML -Tree • Quickly checking and computing the label sequence of rout between some pair of nodes. • Applying it for near-tree XML datasets. • Structural indexes. • Simulating the datagraph by smaller one to reduce the cost of computation
Tree preorder/postorder walkingfor computing(pre(x),post(x)) XML-tree pre/post computing[Dietz82] (1,7) (6,6) (2,4) • x is a descendent of y <=> • pre(x) < pre(y) és • post(x) > post(y) (7,5) (3,1) (5,3) (4,2)
Every x node: (order(x), size(x)) Tree- Structure Improvement [Li&Moon VLDB 2001] (1,100) • x is a descendent of y <=> • order(x) < order(y) és • order(y) <=order(y) + size(x) (41,10) (10,30) (45,5) (25,5) (11,5) (17,5)
Regular query processing over XML –tree and near tree • Very efficient based on tree-structured indexes • [KissVu06]: Applying for near-tree XML dataset • Link graph: Connecting between link nodes. • Using tree-structured indexes for the basic structure
1-index [Milo & Suciu, LNCS 1997] Idea: Grouping all “equivalent” data-nodes into an index-node. Computing the index nodes bi-simulation equivalent ≡ekvivalencia helyett. • Index graph is smaller than the data-graph • Working for every regular queries. • A bi-simulation computing = PTIME.
y1 y2 Bisimulation • A bi-simulation: • x1 és x2 have the same label • Ifx1 x2 and (y1,x1) is an edge, then there exists edge (y2,x2), in which y1 y2. b b x1 x2 a a
Example1-index paper paper 1 1 13 section section title 14 section 2 2,4,8,13 section 8 4 section 3 15 exp exp title exp algorithm title algorithm 16 10 15,16 3,5,9,14 6,10 6 9 algorithm title 5 title 18 about proof proof 17 11 17,18 proof 7 7 11 about about uses proof 12 12 /paper/section/algorithm uses Data Graph 1-index
Using 1-index? • Good:Working for all regular queries. • Bad: Not small enough !!! • Idea: The index graph is designed only for the most frequently in use queries. The index graph is very small now !!! • New equivalent relationship between nodes should be defined • If the query is not support, re-check on the data graph
Structural indexes and a given set of queries • Important : • //a0/a1/…/ai (i<=k), not longer than k • A(k)-index • Dinamikus indexek • APEX, D(k)-index • //S0/S1/…/Sk, SAPE queries • DL-1, DL-A*(k)-index • Forward-backwardqueries • F&B-index
A(k)-Index [Kaushik et al. 02] • A //a0/a1/…/ai (i<=k) • Ak-biszimulation. • Ak(k-biszimuláció): • u 0 v, ha u és v if they have same label, • u kv ifu k-1vand • If (u’,u) is an edge, there exists edge (v’,v): u’ k-1v’ • If (v’,v) is an edge, there exists edge (u’,u): u’ k-1 v’
1 imdb 2 5 movie tv {1} {1} {1} imdb imdb imdb 3 6 8 director director director tv {5} tv {5} tv {5} movie movie movie {2} {2} {2} director {6,8} 4 7 9 director {3} director director {3,6,8} {3} {6,8} director name name name Data graph name name name name {4,7,9} {4,7,9} {4} {7,9} A(2)-index (1-index) A(1)-index A(0)-index A(k)-index
Split Operation R R R R A B A B A B A B C1 C2 C3 C1 C2,C3 C1 C2,C3 C1,C2,C3 C4,C5,C6 C4 C5 C6 C4 C5,C6 C4,C5,C6 Adatgráf A(2) (=1-index) A(1) A(0)
Refinement (1. step) R R R R A B A B A B A B C1 C2 C3 C1 C2,C3 C1 C2,C3 C1,C2,C3 C4,C5,C6 C4 C5 C6 C4 C5,C6 C4,C5,C6 Data gráph A(2) (=1-index) A(1) A(0)
Refinement (2. step) R R R R A B A B A B A B C1 C2 C3 C1 C2,C3 C1 C2,C3 C1,C2,C3 C4,C5,C6 C4 C5 C6 C4 C5,C6 C4,C5,C6 Data graph A(2) (=1-index) A(1) A(0)
DL-1-index [KissVu06] //S0/S1/…/Sk (SAPE = Simple Alternation Path Expression). Dinamikus index (Dynamic labelling).
A //(d|e)/f SAPE query Data Graph A SAPE query: //(d|e)/f R := S0/S1 S0= { d,e } ; S1= { f } A (4,9), (5,10), (6,11)és (7,12) matchingR. The result: TG(R) = {9,10,11,12} a 0 c 1 b 2 b 3 4 d 5 e 6 d 7 d 8 e f 9 f 10 f 11 12 f 13 g
Example: DL 1-index support //(K|L) és //(B|C)/Equeries A A A A 0 0 0 0 K,L,M,N K,L M,N K,L M,N 1 2 3 4 1,2,3,4 1,2 3,4 1,2 3,4 K M N L B,C,D B,C C,D B,C 5 6 7 8 5,6 7,8 5,6 5,6,7,8 7 8 B C C D C D E,F E E,F E 9 10 11 12 9,10,11,12 11,12 9,10 9,10 11 12 E E F E F E (c) (d) (a) (b) DL-1-index at the begin. The data graph andthe 1-index are the same. R1= //(K|L) support R2= //(B|C)/E Support
A DL-A*(k)-index [KissVu06] • TheA(i)-index is a special case of DL-A*(k). • DL-A*(k)-indexsupport for a given not longer k SAPE queries.
DL-A*(1)-index support A//(K|L)and//(B|C)/Equeries thebegin index: A //(K|L) - refinement: 0 1 2 3 4 N K M L 5 6 7 8 D B C C //(B|C)/E -refinement: 9 10 11 12 E E E F Data graph
Experiments • DL-1 vs.1-index • DL-A*(k)vs. A(k)-index • 2 datasets: • XMark: 100 Mb, 1.681.342 nodes. • TreeBank:82Mb, 2.437.667 nodes.
Distributed XML-tree • XML- tree = Fragments – sub trees. • Serversstores some fragments. • There are linking edges between fragments. • Questions:Finding efficient protocol for regular query processing? Waiting time – Computing time • Applying structural indexes?
Flow modell (SPIDER algoritmus) • Beginning from the root. • (F, q) (F’, q’): • Processing on Fstops. • Processing on F’ with state q’. • If finish processing over F’, then send the result to F. • F continues Waiting time!
2 phasesparallel modell Servers: Computing every possible states on it own site. Sending to the coordinator the link edge Coordinator examines the link edges and request the results from servers Severs send the results to coordinator. The computing time !!!
1- phaseparallel model [KissVu07] • The coordinator builds the structuralTree-index for whole systemfor determine connective (F,q)states. • Processing on the index first for computing connective states Good:Efficient processing Bad:The index may be big.
StructuralTree-index F0 0 A q0 A F0 F3 AB ε 1 A 8 B F5 AC Fa-index q0 D F1 A F2 B F3 2 B 6 C 13 A D q0 q1 q0 A ε F2 F4 F1 B F4 D F5 3 D 7 A 10 B E 14 q0 q1 q0 4 A 5 C 11 D E 12 F 15 Connective states: (F0,q0), (F1,q0), … (F2,q1), (F2,q2): is not connective
Experiments • 19 Linux local-servers. • Waiting time: 1IP : 2P : SP= 1 : 1.94 : 37.52 • Computingtime: 1IP : 2P : SP = 1 : 1.77 : 2.75
Native XML database systemshttp://www.rpbourret.com/xml/XMLDatabaseProds.htm#native Termék Fejlesztő License Adatbázistípus Qizx/db XMLMind Commercial Proprietary Sedna XML DBMS ISP RAS MODIS Free Proprietary Sekaiju / Yggdrasill Media Fusion Commercial Proprietary SQL/XML-IMDB QuiLogic Commercial Proprietary (native XML and relational) Sonic XML Server Sonic Software Commercial Object-oriented (ObjectStore). Tamino Software AG Commercial Proprietary. Relational through ODBC. TeraText DBS TeraText Solutions Commercial Proprietary TEXTML Server IXIASOFT, Inc.Commercial Proprietary TigerLogic XDMS Raining Data Commercial Pick Timber University of Michigan Open Source (non-commercial only) Shore, Berkeley DB TOTAL XML Cincom Commercial Object-relational Virtuoso OpenLink Software Commercial Proprietary. Relational through ODBC XDBM Matthew Parry, Paul Sokolovsky Open Source Proprietary XDB ZVON.org Open Source Relational (PostgreSQL) XediX TeraSolution AM2 Systems Commercial Proprietary X-Hive/DB X -Hive Corporation Commercial Proprietary. Relational through JDBC Xindice Apache Software Foundation Open Source Proprietary xml.gax.com GAX Technologies Commercial Proprietary Xpriori XMS Xpriori Commercial Proprietary XQuantum XML Database Server Cognetic Systems Commercial Proprietary XStreamDB Native XML Database Bluestream Db. Soft. Corp. Commercial Proprietary Xyleme Zone Server Xyleme SA Commercial Proprietary
Summary • Big XML is used in many applications • Our problem: Efficient processing regular queries over XML databases. 3.Two ideas: • Using Relational databases • Building special indexes for XML databases