480 likes | 559 Views
Flexible and Efficient XML Search with Complex Full-Text Predicates. Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego. Introduction.
E N D
Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego
Introduction • Need for complex full-text predicates beyond simple keyword search • Library of Congress (LoC) • Biomedical data • ACM, IEEE publications • INEX data collection • Wikipedia XML data set SIGMOD, June 2006
XML real fragment from LoChttp://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml bill legis-session congress-info legis legis-desc nbr sponsors Congress on education and workforce, comments to appropriate services. legis-body action Jefferson and services … HR2739 House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson 109th action-desc on May 2, 2004 Joe Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson SIGMOD, June 2006
Query with complex FT predicates • Document fragments (nodes) that contain the keywords “Jefferson”and“education” and satisfy the predicates • within a window of 10 words, • with “Jefferson” ordered before “education” SIGMOD, June 2006
bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document SIGMOD, June 2006
bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document SIGMOD, June 2006
Existing languages • Many XML full-text search languages • expressive power, semantics, scores [BAS-06] • XQFT-class W3C’s XQuery Full-Text (XQFT), NEXI, XIRQL, JuruXML, XSearch, XRank, XKSearch, Schema Free XQuery • Efficient query evaluation limited to • Conjunctive keyword search (no predicates) • Full-text predicates in isolation • Need for a universal optimization framework • Guarantee the universality of the solution SIGMOD, June 2006
Contributions • Formal semantics for XQFT-class • Unified framework • Capture family of tf*idf scoring methods • Structure-aware algorithms to efficiently evaluate XQFT-class languages • XFT full-text algebra • Enable new optimizations inspired by relational rewritings SIGMOD, June 2006
Talk Outline • Motivation & Contributions • Formalization of XML full-text search • Efficient evaluation • Experiments • Conclusion SIGMOD, June 2006
Formalization: design goals • Capture existing full-text languages • Language semantics in terms of • keyword patterns • pattern matches • predicates evaluated through matches • Manipulate tuples • enable relational query evaluation and rewritings SIGMOD, June 2006
Formalization: patterns • Pattern = tuple of simultaneously matching keywords • Query expression: “Jefferson”and“education” • within a window of 10 words, • with “Jefferson” ordered before “education” SIGMOD, June 2006
Formalization: patterns • Formalization specifies • patterns ← conjunction of keywords • set of patterns ← disjunction of keywords • exclusion patterns ← negation of keywords • No matches in the document SIGMOD, June 2006
bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006
bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006
bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006
bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006
Formalization: matching tables • Matching table represents • Nested relation • Each node in the document • Each pattern in the query • Set of matches SIGMOD, June 2006
bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matching tables SIGMOD, June 2006
XFT Algebra • Similar to relational algebra • Manipulate matching tables • Leverage relational query evaluation + optimization techniques • XFT operators • construct matching table Rk for each keyword k get(k) • manipulate matching tables R1or R2 R1andR2 R1minus R2 σtimes(R), σordered(R), σwindow(R), σdistance(R) SIGMOD, June 2006
Query: Nodes that contain the keywords “Jefferson”and“education” within a window of 10 words, with “Jefferson” ordered before “education” × XFT Algebra Benefit: equivalent query rewritings SIGMOD, June 2006
Talk Outline • Motivation & Contributions • Formalization of XML full-text search • Efficient evaluation • Experiments • Conclusion SIGMOD, June 2006
Query evaluation: AllNodes 5 • Straightforward implementation of the XFT algebra • Each node is considered separately • Each tuple is self-contained • Relational-style evaluation • Joins → equi-joins • Predicates → selections on set of matches SIGMOD, June 2006
bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document 1 1.3 1.1 1.3.1 1.3.2 1.1.2 1.1.3 1.1.1 1.3.1.2 1.2 1.2.2 1.2.2.2 SIGMOD, June 2006
× SIGMOD, June 2006
× SIGMOD, June 2006
× Predicate operates one tuple at a time SIGMOD, June 2006
bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document 1 1.3 1.1 1.3.1 1.3.2 1.1.2 1.1.3 1.1.1 1.3.1.2 1.2 1.2.2 1.2.2.2 SIGMOD, June 2006
Query evaluation: SCU 5 • AllNodes = straightforward algorithm • Reduce size of intermediate results • structural relationships between nodes • avoid redundant match representation • SCU = Smallest Containing Unit SIGMOD, June 2006
Matching tables → SCU tables → captures same information SIGMOD, June 2006
× SIGMOD, June 2006
× • Equi-join does not work • Need to compute LCA SIGMOD, June 2006
× 1.1 is the LCA of 1.1.3 and 1.1.1 SIGMOD, June 2006
× SIGMOD, June 2006
× SIGMOD, June 2006
× SIGMOD, June 2006
× • Postorder • Stack supports single scan SIGMOD, June 2006
SCU summary 5 • Equivalent to AllNodes • Structure-awareness reduces size of intermediate results • Increase computation cost • Compute LCAs of nodes • Match propagation • Stack-based techniques SIGMOD, June 2006
Related work on LCA for XML • LCA for conjunctive keyword search • XRank [GSBS-03] • Schema-free XQuery [LYJ-04] • XKSearch [XP-05] • Shortcomings • No postprocessing, not compositional • Input in document order • Output postorder traversal • Support for complex predicates is not straightforward SIGMOD, June 2006
Talk Outline • Motivation & Contributions • Formalization of XML full-text search • Efficient evaluation • Experiments • Conclusion SIGMOD, June 2006
Experimental goals • AllNodes vs. SCU • AllNodes: redundant representation • SCU: smaller sizes, more computation • SCU Overhead • Stack • Match propagation • Benefit of Rewritings • Relational-style rewritings SIGMOD, June 2006
Experimental setup • Centrino 1.8GHz with 1GB of RAM • XMark generated datasets • Size ranges from 50 MB – 300 MB SIGMOD, June 2006
Varying document size (q1 - query without predicates) Experiments: AllNodes vs. SCU • q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and get(“ship”) SIGMOD, June 2006
Experiments: SCU Overhead • Queries • q4 = σwindow>1(“See”, “internationally”, “description”, “charges”, “ship”) (q1) • q5 = σwindow>90000000(“See”, “internationally”, “description”, “charges”, “ship”) (q1) • Recall that • q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and get(“ship”) SIGMOD, June 2006
Varying query predicates (not pushed) Experiments: SCU Overhead • q4 always true → no match propagation, just the stack overhead • q5 always false → propagate all matches SIGMOD, June 2006
Experiments: Benefit of Rewritings • Queries • q2 = σorderedE(“See”, “internationally”, “description”, “charges”, “ship”) (q1) • q3 = push selections in q2 • Recall that • q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and get(“ship”) SIGMOD, June 2006
Varying document size (query with predicates) Experiments: Benefit of Rewritings • 40% improvement for relational-like query rewritings SIGMOD, June 2006
Conclusion • A unified logical framework for XML full-text search languages • Algebra admits • Efficient algorithms for operator evaluation • Rewritings of queries into more efficient forms • Facilitate XML joint optimizations of queries on both structure and text search • Future work • Score-aware logical framework SIGMOD, June 2006
Thank you! 5 SIGMOD, June 2006