430 likes | 567 Views
Cooperative XML Answering By Wesley.W.Chu and Shaorong Liu. Aditya Chintaluri ITCS6050. Roadmap. XML today Introduction of Query Relaxation XML Data Model XML Query Model XML Query answer Relaxation Types XPRESS Coop approach to Query Relaxation System Architecture Query Language
E N D
Cooperative XML AnsweringByWesley.W.Chu and Shaorong Liu Aditya Chintaluri ITCS6050
Roadmap • XML today • Introduction of Query Relaxation • XML Data Model • XML Query Model • XML Query answer • Relaxation Types • XPRESS • Coop approach to Query Relaxation • System Architecture • Query Language • Relational Index-XTAH • Query Relaxation Flow
Roadmap • XML content similarity • Weighted term frequency • Inverse element frequency • Vector Space Model • XML Answering Machine System Architecture • CoXML Answering Testbed • Summary
XML today • XML is more flexible and scalable over HTML and simple compared to SGML. It is a textual rep of data which is in a tree like hierarchical format. • Many languages have been proposed to cope with the tree like data format but with a lot of bugs. • To overcome this problem the concept of “Query relaxation”has been introduced. This helps us cope with • Large datasets available on internet where relaxation is important. • Querying differently structure databases from multiple data sources.
XML Data Model • Tree like structure • Each element is a node and the element-element connection is called an edge. • Each node ‘u’ is identified by (id, label,<text>) . • ID denotes a unique identification to each node ex:(1,2…13) • Label names the node ex:(1-article, 2-title, 3-author) etc • <text> is optional which describes the content associated with each node
XML Query Model • We consider every query as a twig T(root,V,E) where: • Root is the root node of the query twig. • V is the no of nodes rep by (id, value, <text>). • E is the no of edges in the twig. An edge between $u-$v is rep as e$u-$v which is a parent child relationship. • In a twig we use T.root, T.V, T.E to denote the roots, nodes and edges respectively. • For a node V in a twig (V Є T.V) we use $v.id, $v.value, $v.text to denote the id, value and content of node respectively. • $u is a query node and u is a data node.
XML Query Answer • For an XML data tree D and a query twig T the answer to the query ATD can be defined as set of nodes such that: • For all $VЄ T.V there exist unique node u such that $u.label=u.label & $u.content≠null & u.cont is a database style value constraint then the text of data node u.text satisfies the value constraint. • For all e$u,$v Є T.E let u, v be the data nodes in ATD corresponding to the query node $u $v . Then structural relation between u,v should satisfy the edge constraint e $u,$v
XML Relaxation Types • Value relaxation • Expands a value scope to allow matching of additional answers • Structure relaxation • Derives approximate answers by relaxing the constraints on node or edge of a query twig. We focus more on structure relaxation.3 types of relaxations are there: • Node relaxation-Relabeling nodes to similar or equivalent nodes acc to domain knowledge using method rel($u,l) • Edge relaxation-Here a parent-child edge can be relaxed to ancestor-descendant edge by using gen(e$u$v)
XML Relaxation Types • Node deletion-Nodes can be deleted to get app answers by using del($v) when $v is a leaf node it can simply be deleted. When it is an internal node the children of v will be connected to the parent of V using ancestor-descendant relation
XPRESS • We can perform query relaxation by transforming XML schema into relational tables with converted schema and then applying relational query relaxation in techniques. Fig. 4. The processing flow of XML query relaxation via schema conversion
XPRESS • XPRESS is called as XML processing & relaxing in relational storage system. • Firstly we extract the XML schema info such as DTD using tools like XML spy. • The XML schema is then transformed into relational schema using XPRESS. • Then the XML docs are parsed and mapped into tuples and inserted into relational databases. • Relaxation techniques are then applied and further semi structured queries are also relaxed. • Finally results in relational format are converted back to xml using XPRESS
Cooperative approach to Query Relaxation • Query relaxation is user specific and may change according to requirements and relaxations. • Query relaxation does not provide full control that May lead to undesirable results. • It provides approximate answers which have to be ranked acc to structure and value constraints • To overcome these limitations we come up with a cooperative approach to query relaxation. • Firstly we develop a relaxation language and secondly we introduce a relaxation structure that clusters twigs into multi groups based on relaxation types and distances. • A semantic based tree editing distance to evaluate XML structure similarities is proposed
The Co-op Xml System Architecture • Once a query is posted the relaxation engine passes it to the XML database engine and if enough answers are found matching the query they are sent to the ranking module which ranks the answers according to their relevance and the submits the results to the user. • If there are no enough answers found the relaxation engine based on the user specific relaxation constructs and controls consults the relaxation indexes for the best relaxed query. • This query is resubmitted to the XML database engine and the related results are ranked by the ranking module and the final results are sent over to the user. This process is repeated until the query cannot be relaxed further.
XML Query Relaxation Language • We propose a relaxation language that enables the user to specify app conditions and control the relaxation process unlike other languages. • A relaxation enabled query is defined as Q(T,R,C,S) where: • T – is a twig • R – is a set of constructs specifying which conditions to be approximated in T when needed • C – Boolean combination of relaxation control stating how the query will be relaxed. • S – stop condition telling when to terminate the relaxation process
XML Query Relaxation Language • We first search the query if the answer is exactly met no relaxation is required or the query is relaxed continuously until the stop condition is met or it cannot be relaxed further. • QT, QR, QC, QS are the relaxation twig, relaxation constructs, control and stop conditions respectively. • Methods used for relaxation are: • rel(u), where u ЄQ:T :V , specifies that node u may be relabeled when needed. • del(u), where uЄ Q:T :V , specifies that node u may be deleted if necessary; • gen(eu;v), where eu;v Є Q:T :E, specifies that edge eu;v may be generalized when needed
XML Query Relaxation Language • Relaxation is a conjunction of all the following conditions • If node u cannot be relabeled, deleted or eu,v cannot be generalized • Node u is preferred to be relabeled to the labels in the order (l1…ln) • A set of unacceptable labels for node u • Relaxation orders for constructs in R to be (r1…rn) • A stop condition is achieved if: • At Least(n) - n is a +ve integer which decides the max no of answers to be retrieved. • d(Q,T,T`) ≤ τ where T` is relaxed twig and τ is the dist threshold which specifies relaxation should be terminated if dist between T and T` reaches the threshold
XML Query Relaxation Language • Above is an example of a relaxation enabled query. Here we can see • R which is the relaxation constraints specified by the user. • Also C which is the condition specified about how the relaxation will be performed. • Finally S which specifies the stop condition
XML Relational Index • XML type abstraction hierarchy (XTAH) is introduced which uses a type abstraction hierarchy (TAH) to provide systematic relaxation guidance. • TAH represents objects in multi levels where the higher level objects are more general than lower level • The above figure is a TAH for brain tumor sizes where. A query can be relaxed by modifying conditions by moving up the tree(generalization) or moving down the tree which is (specialization).
XML Relational Index • XTAH-XML type abstraction Hierarchy is introduced for XML models where a twig structure T is denoted as XTT • Represents relaxed twigs T at different levels of relaxation depending upon the operations & distance between them. There are 2 types of node • Internal : Represents a cluster of relaxed twigs that use similar operations and are closer to each Other • Leaf : It is a relaxed twig of T
XML Relational Index • Nodes have a unique ID. Internal nodes are prefixed with I and leaf nodes with T`. • For a relaxation operation r Ir be an internal node with label {r}. Ir represents a cluster of relaxation twigs whose common operation is r
XML Relational Index • As the relaxed twig belongs to one cluster and the twig may have multiple relaxation operations not all relaxed twigs may be in the group Ir Ex: T2` uses operation gen(e$1,$2) and gen(e$4,$5) Is not included in the internal node I7 which uses gen(e$4,$5) as T2` may belong to either I4 or I7 but is closer to I4 • To overcome this issue we add a virtual link from Ir to Ik where Ik is not a descendant from Ir but all the twigs within IK use the operation r. So by the usage of virtual links all the twigs are connected to Ir within the virtual groups
Query Relaxation Flow • Given a relaxation enabled query Q{T,R,C,S} and an XTAH T. The algorithm first searches for exactly matched answers. Enough no of answers require no need of relaxation. • If relaxation needed the algorithm eliminates internal nodes with unacceptable operations, unacceptable node labels and rejected relaxation types. This can be effectively carried out by using node labels and virtual links
Query Relaxation Flow • Based on relaxation constructs and control a search is performed on the relaxed query that satisfies the users specifications from XTAH. • Iterative search is performed for further relaxation, which searches for relaxed queries close to the original query by dist. • Finally the answers are ranked based on similarity to structure and content conditions
XML Content Similarity • Denoted by cont_sim(A,Q) • A is the answer and Q is the query • XML content similarity is measured by the “vector space model” which is based on: • Weighted term frequency • Inverted weight frequency
XML Content Similarity • Weighted Term Frequency • Weights and importance is assigned to the occurrence of a particular phrase in the query. • Ex: Occurrence in the title has more weight than in a paragraph. • Weighted term frequency for a term t in a data node v is denoted as tfw(v,t) as • M represents no of paths in the data node v containing term t and f(v,pj,t) is the frequency of term t occurred in node v
XML Content Similarity • This process assigns weight to a term t in the data node v based on occurrence frequency and occurrence path. • For data node v and term t occurrence path p=v1.v2.vk I for term t in v, where vk is a descendant node of v • Let w(p) and w(vi) denote weight for path p and node vi respectively • Weight of path p=v1.v2..vk is a function of weights of the nodes on the path w(p)=f(w(v1)..w(vk)) with following properties. • F(w(v1),w(v2)…w(xk)) is monotonically increasing function with respect to w(vi) (1≤i≤K)- this states that the weight of path is increasing if weight of any node is increasing • F(w(v1),w(v2)…w(vs)))=0 if any w(vi)=0(1≤i≤k) – if weight of any node on path is zero then weight of node is zero
XML Content Similarity • Inverse element Frequency • Inverse element frequency or iefdistinguishes terms with different discriminative powers. • Given query Q and term t let $u be a node in twig Q.T & tЄ$u.cont • Dn is the set of data nodes such that their contents matches the structure content of $u • The more frequently term t occurs in data nodes less discriminative power t has • Here N1 denotes no of nodes in set DN and N2 represents no of nodes in DN that contain the term t in their text parts
XML Content Similarity • Vector space model • It is a conjunction of weighted term and inverse element frequency. • Given a data node v and a query node u the content similarity can be measured by the following equation • t represents the term in the content condition of $u, m(t) is the modifier prefixed with term t and w(m(t)) is the weight for the term specified by the users.
XML Content Similarity • Semantic based structure distance • Used to measure tree-tree similarities which can be done by measuring the structural distance between answer A and query Q. • Struct_dist(A,Q) editing distance between twig Q.T and the least relaxed twig T`. • D(Q.T, T`) which is the total cost of operations to relax T to T`
XML Content Similarity • Node relabel – rel(u,l) • Node deletion – del(u) • Edge generalization – gen(ev,u)
XML Content Similarity • Relevancy Ranking Model • This is defined as the function of 2 factors namely structure distance and content similarity denoted by sim(A,Q) which is given by the following equation: • Here α is a constant between 0&1
XML Answering Machine System Architecture • Data Source Mediator DSM: It provides a virtual database interface to query diff data sources with diff schema. • Query parser Mediator PM: This parses the queries from application layer and transforms them into query representation objects
XML Answering Machine System Architecture • Relaxation Mediator RM: • This is the basic structure which consists of a preprocessor, the manager and the post processor
XML Answering Machine System Architecture • A Relaxation enabled query is first presented to the pre processor where relaxation constructs are transformed into XML constructs. • All relaxation control operation are forwarded to the relaxation manager.
XML Answering Machine System Architecture • The modified query is presented to the underlying databases for execution. • If no answers are found relaxation is applied using XTAH until the stop condition is met or the query is no longer relaxable. • The final answers are returned to the post processor • Directory Mediator – DM: This provides locations characteristics and functionalities of all mediators in the system and is used by peer mediators for locating a mediator to perform specific function.
XML Answering Machine System Architecture • XTAH mediator – XTM: • This provides 3 separate but interlinked functions to peer mediators XTAH directory, editor, Management. • The XTAH directory is searchable by the XML query tree structure • XTAH management facilitates client mediators with traversal functions and data extraction functions. • XTAH editor helps the user edit XTAH’s acc to their needs
CoXML Answering Testbed • Query Parser: Checks the syntax of a query. If correct, creates a query representation object. • Preprocessor:Transforms relaxation constructs to XML constructs. • Relaxation Manager:It builds a relaxation structure based on specific relaxation constructs, obtains required query conditions from XTAH manager, modifies the query accordingly and extracts exact answers
CoXML Answering Testbed • Database Manager: This interacts with XML database engine and returns exactly matched answers • XTAH Manager: Selects the appropriate XTAH based on the query tree structure • Post Processor: Takes unsorted answers as input and ranks them
Conclusion • XML query relaxation can be done by direct conversion or schema conversion. • Data can be lost using schema conversion and also it does not support XML structure relaxation hence we prefer XML model approach. • We developed an XML system that cooperates with user specific query answering. • A query relaxation language is developed that allows users to specify approximate conditions • XTAH relaxation index structure is developed which combines twigs into groups based on relaxation types and distances. • A ranking model is introduced which combines both content and structure similarities in evaluating overall answer match. • Finally a mediator based CoXML architecture is presented
References • 1. S. Amer-Yahia, C. Botev, and J. Shanmugasundaram. TeXQuery: A Full-Text Search Extension to XQuery. In WWW, 2004. • 2. S. Amer-Yahia, S. Cho, and D. Srivastava. XML Tree Pattern Relaxation. In EDBT, 2002. • 3. S. Amer-Yahia, N. Koudas, A. Marian, D. Srivastava, and D. Toman. Structure and Content Scoring for XML. In VLDB, 2005. • 4. S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie, and J. S. (Eds). XQuery 1.0: An XML Query Language. http://www.w3.org/TR/xquery/. • 5. S. Chaudhuri and L. Gravano. Evaluating Top-k Selection Queries. In VLDB, 1999. • 6. W. Chu, H. Yang, K. Chiang, M. Minock, G. Chow, and C. Larson. CoBase: A Scalable and Extensible Cooperative Information System. J. Intell. Inform. Syst., 6(11), 1996. • 7. W. W. Chu, Q. Chen, and A. Huang. Query Answering via Cooperative Data Inference. J. Intelligent Information Systems (JIIS), 3(1):57–87, 1994. • 8. P. Ciaccia, M. Patella, and P. Zezula. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In VLDB, 1997. • 9. T. Finin, D. McKay, R. Fritzson, and R. McEntire. KQML: An Information and Knowledge Exchange Protocol. In K. Fuchi and T. Yokoi, editors, Knowledge Building and Knowledge Sharing. Ohmsha and IOS Press, 1994. • 10. W. B. Frakes and R. Baeza-Yates. Information Retreival: Data Structures and Algorithms. Prentice Hall PTR, 1992.
References • 11. N. Fuhr and K. Gro¯johann. XIRQL: A Query Language for Information Retrieval in XML Documents. In SIGIR, 2001. • 12. T. Gaasterland. Cooperative Answering Through Controlled Query Relaxation. IEEE Expert, 12(5):48–59, 1997. • 13. L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked Keyword Search Over XML Document. In SIGMOD, 2003. • 14. Y. Kanza, W. Nutt, and Y. Sagiv. Queries with Incomplete Answers Over Semistructured Data. In ACM PODS, 1999. • 15. Y. Kanza and Y. Sagiv. Flexible Queries Over Semistructured Data. In PODS, 2001. • 16. D. Lee and W. W. Chu. CPI: Constraints-Preserving Inlining Algorithm for Mapping XML DTD to Relational Schema. J. Data and Knowledge Engineering, Special Issue on Conceptual Modeling, 39(1):3–25, 2001. • 17. D. Lee, M. Mani, F. Chiu, andW.W. Chu. Nesting-based Relational-to-XML Schema Translation. In WebDB, 2001.
References • 18. D. Lee, M. Mani, and W. W. Chu. Schema Conversions Methods between XML and Relational Models. Knowledge Transformation for the Semantic Web, 2003. • 19. S. Liu, W. Chu, and R. Shahinian. Vague Content and Structure Retrieval(VCAS) for Document-Centric XML Retrieval. In WebDB, 2005. • 20. S. Liu and W. W. Chu. CoXML: A Cooperative XML Query Answering System. In Submitted to ICDE, 2007. • 21. S. Liu, Q. Zou, and W. Chu. Configurable Indexing and Ranking for XML Information Retrieval. In SIGIR, 2004. • 22. I. Manolescu, D. Florescu, and D. Kossmann. Answering XML Queries on Heterogeneous Data Sources. In VLDB, 2001. • 23. A. Marian, S. Amer-Yahia, N. Koudas, and D. Srivastava. Adaptive Processing of Top-k Queries in XML. In ICDE, 2005. • 24. M. Mitra, A. Singhal, and C. Buckley. Improving Automatic Query Expansion. In SIGIR, 1998. • 25. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. • 26. T. Schlieder. Schema-Driven Evaluations of Approximate Tree Pattern Queries. In EDBT, • 2002. • 27. T. Schlieder and H. Meuss. Querying and Ranking XML Documents. Journal of American • Society for Information Science and Technology, 53(6):489. • 28. J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational • Databases for Querying XML Documents: Limitations and Opportunities. In VLDB, 1999. • 29. A. Theobald and G. Weikum. Adding Relevance to XML. In WebDB, 2000. • 30. A. Trotman and B. Sigurbjornsson. Narrowed Extended XPath I NEXI. In INEX 04 Workshop, • 2004. • 31. K. Zhang and D. Shasha. Simple Fast Algorithms for the Editing Distance Between Trees • and Related Problems. SIAM J. Comput., 18(6):1245– 1262, 1989.