1 / 25

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries. Yifei Lu 1 , Wei Wang 1 , Jianxin Li 2 and Chengfei Liu 2 1 University of New South Wales 2 Swinburne University of Technology. XML Keyword Search. User: I want to find data mining paper coauthored by Jiawei Han.

kalyca
Download Presentation

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XClean: Providing Valid Spelling Suggestions for XML Keyword Queries Yifei Lu1, Wei Wang1,Jianxin Li2andChengfei Liu2 1 University of New South Wales 2 Swinburne University of Technology

  2. XML Keyword Search User: I want to find data mining paper coauthored by Jiawei Han DBLP Query: jiawiminning paper paper book title author title author author Mining concept author link mining author Eric Jiawei Han Jian Pei Jiawei Han Manning 2

  3. Challenges • Must offer highly plausible suggestion • The suggested query should have non-empty results • Must be highly efficient 3

  4. Poor Suggestion 4

  5. Empty Result 5

  6. Empty Result Query: jiawiminning • Pu and Yu [PVLDB08] will suggest “jian manning” • Worse than “jiawei mining” • No meaningful connection DBLP paper paper book title author title author author Mining concept author link mining author Eric Jiawei Han Jian Pei Jiawei Han Manning 6

  7. Problem Definition • Data • A set of XML document trees • Form a single tree by adding a virtual root node. • Query • = { jiawi minning} • Candidate Query Space • Query Cleaning • Find top-k queries from the Candidate Query Space • Rank by jiawi minning jiawei mining jian Confusion Set: Valid words in vocabulary, with edit distance ≤ threshold manning 7

  8. Ranking Candidate Queries • How to model • By Bayes’ Theorem • Rank by Query Likelihood Model Error Model 8

  9. Error Model • Modeling Typographical Errors • The more similar the more likely • Similarity measured by Edit Distance • Independence Assumption binding mining running minning linking manning ed=1 finding Edit Distance ed=2 9

  10. Query Likelihood Model • Modeling Query Generation Probability • A good query finds good results • is a set of disjoint entities (sub-trees) • Measure the query likelihood on each entity • Aggregate through all entities DBLP paper paper book r2 r3 r1 title author title author author Mining concept link mining author Jian Jiawei Han Jiawei Han Manning Entity Prior 10 (assume uniform)

  11. Language Modeling • Modeling query likelihood on entities • Extract text in the sub-tree • Build a Language Model r1 …… DBLP booktitle paper author Data mining and knowledge discovery title Jiawei Han Mining concept drifting data Smoothing is used to avoid zero probability 11

  12. Finding the entities • How to find the entities • Each entity is a potential search result • Different semantics can be applied • SLCA, ELCA, etc. • Specific Return Type • One for each query • Popular type • But not too deep DBLP paper paper book title author title author author Mining concept link mining author Eric Jiawei Han Jiawei Han Manning p=/DBLP/paper 12

  13. Summary: Ranking Framework Error Model Query likelihood on each entity Entity Prior 13

  14. Algorithm • Naïve Algorithm • Enumerate all possible candidate queries • Find the entities and compute the score for each candidate query • Problems: • Multiple passes of data • Not all candidates are needed DBLP paper paper book author title author 1. Jiawei mining 2. Jian mining 3. Jiawei Manning 4. Jian Manning author link author Jian Jiawei author Jiawei Manning Jian 14

  15. XClean Example 1 DBLP Query: jiawiminning 1.1 1.2 1.3 paper paper book 1.1.1 1.2.1 1.2.2 1.3.1 authors author authors title 1.2.2.1 1.1.1.1 1.2.1.1 1.1.1.2 1.3.1.1 1.3.2.1 author author author author jiawei mining 1.3.2.1.1 1.3.1.1.1 1.1.1.1.1 1.1.1.2.1 manning jian jiawei jian p3 p2 p1 p4 p1 jiawei 1.1.1.1.1 1.2.2.1 p2 jian 1.1.1.2.1 1.3.1.1.1 p3 mining 1.2.1.1 p4 manning 1.3.2.1.1 15

  16. XClean Example 1 DBLP Query: jiawiminning 1.1 1.2 1.3 paper paper book 1.1.1 1.2.1 1.2.2 1.3.1 authors author authors title 1.2.2.1 1.1.1.1 1.2.1.1 1.1.1.2 1.3.1.1 1.3.2.1 author author author author jiawei mining 1.3.2.1.1 1.3.1.1.1 1.1.1.1.1 1.1.1.2.1 manning jian jiawei jian p3 p2 p1 p4 p1 jiawei 1.1.1.1.1 1.2.2.1 “Jiawei mining” is generated “Jian mining” is skipped p2 jian 1.1.1.2.1 1.3.1.1.1 p3 mining 1.2.1.1 p4 manning 1.3.2.1.1 16

  17. XClean Example 1 DBLP Query: jiawiminning 1.1 1.2 1.3 paper paper book 1.1.1 1.2.1 1.2.2 1.3.1 authors author authors title 1.2.2.1 1.1.1.1 1.2.1.1 1.1.1.2 1.3.1.1 1.3.2.1 author author author author jiawei mining 1.3.2.1.1 1.3.1.1.1 1.1.1.1.1 1.1.1.2.1 manning jian jiawei jian p2 p4 p1 1.1.1.1.1 1.2.2.1 jiawei “jian manning” is generated p2 jian 1.1.1.2.1 1.3.1.1.1 p3 mining 1.2.1.1 p4 manning 1.3.2.1.1 17

  18. Experiment Settings • Algorithms • XClean • PY08: Pu and Yu [PVLDB08] • SE1: Search Engine 1 • SE2: Search Engine 2 • Measures • Mean Reciprocal Rank • Precision@N • Time 18

  19. Experiment Settings • Datasets • Queries • Clean: original clean queries • INEX: 285 • DBLP: 49 • Random: random edit operations on each keyword • Rule: replace each word with a common misspelling 19

  20. Experiment Results • Mean Reciprocal Rank (MRR) 20

  21. Experiment Results • Precision@N • Percentage of queries for which the correct suggestion is in top-N suggestions 21

  22. Experiment Results • Time • Query processing time 22

  23. Conclusion • Contributions • A probabilistic framework for keyword query cleaning on XML database. • An Error Model based on edit distance • A Query Likelihood Model that exploits XML tree structures and keyword search semantics • Future work • Concatenation/Splitting of words • Cognitive Errors 23

  24. Thank you! Questions? 24

  25. XClean Algorithm • Find variants for each query keyword , and compute the error probability • Retrieve the XML nodes containing each variant through an inverted index • The nodes of all variants of form a virtual list • Find the entity nodes that have at least one child node from each virtual list • Compute the for each candidate query found in each entity • Accumulate the scores in a global hash table • Output top-k candidate queries 25

More Related