170 likes | 299 Views
Processing XML Keyword Search by Constructing Effective Structured Queries. Jianxin Li , Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology, Australia. Outline. Motivation of Keyword Search in XML Brief Review of Related Work Existing Problems
E N D
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology, Australia
Outline • Motivation of Keyword Search in XML • Brief Review of Related Work • Existing Problems • Construct Structured Query Templates • Ranking Function • Processing Algorithms • Conclusions
Motivation of XML Keyword Search • Keyword search is easy-to-use • Users don’t need to know the structure of XML data and specific query languages. • The XML data with different structures can be searched equivalently by a keyword query because it doesn’t specify the structures of the retrieved results.
Brief Review of Related Work • We focus on 4 references using label and term as keyword query format: • [YunyaoLi2004VLDB] Schema-Free XQuery. • [DanielaFlorescu2002ComputerNetworks] Integrating keyword search into XML query processing. • [SaraCohen2003VLDB] XSEarch: A semantic search engine for XML. • [WeidongYang2007CIT] Schema-aware keyword search over xml streams. • Other relevant work can be found in our paper.
Brief Review of Related Work • All the four work utilized label and term as keyword query format. • The difference: the first three work shared the similar basic strategy that first retrieves the relevant keyword lists and then merges them into the results; while the last one first generate a big template that covers all the kinds of results w.r.t. XML schema and then cache the possible results over xml streams. Template-based strategy can obtain better performance[WeidongYang2007CIT]!
Existing Problems • [WeidongYang2007CIT] was used to query over XML streams, which is not enough because of the challenges: • Different templates may exist in one XML data repository. • Users prefer to see part of the results, e.g., top k results. • Domain knowledge can be helped to process the labels with the same meaning. • Therefore, it is required to study the problem of applying template-based keyword search strategy to XML data repository.
Construct Structured Query Templates • Example: There are two data sources that conform to t1 and t2 respectively. Schema t1 Schema t2 Keyword query – (year:2006, title:xml, author:philip)
Construct Structured Query Templates • Identifying context of keywords • Determine master entities using labels in keyword query and XML schema. • Generate FOR clause for each entity. • Judge the occurrences of every label under each master entity. • Once a time – Generate WHERE clauses • More than once – First cluster and then generate WHERE clauses.
Step 1: determine master entity and its corresponding label set • Q1 = “For $b in bibliography/books/book” • Q2 = “For $a in bibliography/articles/article” • Step 2: only one occurrence of each label in each master entity. • Q1 += “Where $b/year=‘2006’ and $b/title.contains(xml) and $b/author.contains(philip)” • Q2 += “Where $a/year=‘2006’ and $a/title.contains(xml) and $a/author.contains(philip)” Schema t1 Keyword query – (year:2006, title:xml, author:philip)
Step 1: determine master entity and its corresponding label set • Q = “For $bi in bibliography/bib” • Step 2: only two occurrences of each label in the master entity. Cluster title and author using book and article respectively • Q1 += Q + “For $bo in $bi/book” • Q2 += Q + “For $a in $bi/article” • Step 3: only one occurrence of each label in each cluster. • Q1 += “Where $bi/year=‘2006’ and $bo/title.contains(xml) and $bo/author.contains(philip)” • Q2 … Schema t2 Keyword query – (year:2006, title:xml, author:philip)
Construct Structured Query Templates • Identifying returned nodes • Step1: If the cardinality of a master entity satisfies “*” and no cluster operation is activated, we take the master entity as a return node in constructed queries; • Step 2: If the cardinality of a master entity satisfies “*” and clusters are generated, we first check the root node of each cluster in a recursive procedure (back to step 1); • Step 3: If the cardinality of a master entity does not satisfy “*”, we will probe its ancestor nodes one by one until this kind of node exists or the root of the xml schema.
Schema t1 Schema t2 • Master entities are the returned nodes. • Q1 += “$b” • Q2 += “$a” • Roots of clusters are the returned nodes. • Q1 += “$bo” • Q2 += “$a” The constructed queries can be read in our paper! Keyword query – (year:2006, title:xml, author:philip)
Ranking Function • vm is the master entity nodes; • ω(vi, ti) is calculated by using tf*idf weight model. Feature of the function: The Score() consists of two parts ContextScore() and tf*idf weight, and the former is the upper bound of the score of the results.
Processing Strategy • Algorithm 1 is used to generate structured queries with their corresponding context score. • Algorithm 2 is used to schedule the query plan according to the conditions: • Users’ requirements, e.g., number of results; • Context scores of all generated queries; • And the intermediate results.
Experiments • Dataset: • Sigmod record • three variant of DBLP • Keyword Queries: • q1 (author:David, title:XML) • q2 (year:2002, title:XML)
Experimental Results q1 q2 q2(k = 20) q1(k = 10)
Conclusions • XBridge is proposed to process keyword query over XML data repository, which can efficiently find the top k results by evaluating generated structured queries. • A precise ranking function is provided to evaluate the relevance of the results. • Limitation of this work: • We take XML schema as tree patterns; • We didn’t consider reference relationships of XML data.