400 likes | 528 Views
Adaptive XML Search. Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology. Outline. Motivation Key-Tag Search Multi-Ranker Model Ranking Support vector machine in voting SpyNB Framework (RSSF) Experiments Conclusions and Ongoing Work.
E N D
Adaptive XML Search Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology
Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work
Why we need XML Search Engine? • Different nature of HTML and XML data • HTML data • Hyperlink-intensive • Declarative languages • Tags have no semantic meaning • XML data • Self-describing tags • Extra structural information • XML search engines retrieve more accurate fragments
Why we need XML Search Engine? • Web searching • Document paradigm • Matching keywords Vs documents • Return links to whole document (web page) • XML searching • Query Keywords maybe tags or data values • Structure of XML document is diverse, e.g. DBLP and Shakespeare • Not return whole document: 100Mb or larger • Return fragments
DBLP <dblp> <incollection mdate="2002-01-03" key="books/acm/kim95/AnnevelinkACFHK95"> <author>Jurgen Annevelink</author> <title>Object SQL - A Language for the Design and Implementation of Object Databases.</title> <pages>42-68</pages> <year>1995</year> <booktitle>Modern Database Systems</booktitle> <url>db/books/collections/kim95.html</url> </incollection> ….
Shakespeare • <SPEECH> • <SPEAKER>OCTAVIUS CAESAR</SPEAKER> • <LINE>No, my most wronged sister; Cleopatra</LINE> • <LINE>Hath nodded him to her. He hath given his empire</LINE> • <LINE>Up to a whore; who now are levying</LINE> • <LINE>The kings o' the earth for war; he hath assembled</LINE> • <LINE>Bocchus, the king of Libya; Archelaus,</LINE> • <LINE>Of Cappadocia; Philadelphos, king</LINE> • <LINE>Of Paphlagonia; the Thracian king, Adallas;</LINE> • <LINE>King Malchus of Arabia; King of Pont;</LINE> • <LINE>Herod of Jewry; Mithridates, king</LINE> • <LINE>Of Comagene; Polemon and Amyntas,</LINE> • <LINE>The kings of Mede and Lycaonia,</LINE> • <LINE>With a more larger list of sceptres.</LINE> • </SPEECH>
Research Ideas • In Information Retrieval community, many ranking techniques are developed • Weighted keywords • Vector space • Searching and ranking XML as plain text using IR techniques is possible but • Too simple • Do not use the advantage of XML data • Can achieve better accuracy using features of XML data: • Structures • Tag semantics
Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work
Key-Tag Query vs. XQuery • Keywords in Web search engine vs. SQL • The goals of key-tag query and XQuery are different • Key-Tag Query • Simple • Easy to understand • Flexible Too complicate for ordinary users!! XQuery: for $x in doc(“some.xml") where $x/author[(.ftcontains(‘Mary’)] return $x/title Key-Tag Query: <author>Mary</author> Will users input such complex XQuery in search engines?
Key-Tag Search Query Tag Key • <author>Mary</author> • For example,
If there is a fragment: • <b> • <c> • <b>b</b> • </c> • </b> F1: <b>b</b> F2: <b><c><b>b</b></c></b> F1 will be the answer • If there is a fragment: • F1: • <a> • <b>b</b>---------(B1)</a> • F2: • <a> • <c> • <b>b</b>----------(B2) • </c> • </a> Key-Tag Query Semantics • A fragment is considered as a result candidate if at least one key-tag is found in it. • If F1 and F2 both contain the same instance of key-tag and F1 is a subtree of F2, F1is chosen to be the only answer. • For example, a query <b>b</b>
Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work
Introduction to MRM • Handle diversified XML documents and user preferences
1 2 n AR1 AR2 … ARn STR DAT DFT CUS W1 Similarity Granularity NEW Feature1 … Feature2 Feature3 Multi-Ranker Model User Profiles RSSF Adaptive Ranking Level (AR) w11 w12 w13 w14 Standard Ranking Level (XR) NEW Feature Ranking Level Keyword Access Path Element Order Category Sibling Children Distance+ Distance- Tag Attribute
Adaptive Ranking Level (AR) • AR maintains a feature vector,, which adapts to the four XRs, the vector is weighted and trained by RSSF • = (STR, DAT, DFT, CUS, STR, DAT, DFT, CUS) • The adaptive ranking of fragments is calculated by: W * , where W is generated by RSSF, we will introduce it later.
Standard Ranking Level (XR) • Four XRs • Structure ranker (STR): focus on ranking XML fragments based on their structure • Data ranker (DAT): ignore the structure and rank the XML fragments with their textual data • System default ranker (DFT): a balance of structure and data ranker • Customized ranker (CUS): system administrator selects low-level feature for tuning, in our experiment, the low-level features are randomly pick
Predefine that: Academic category {article, title, author} Sport category {team, player, match, year} … Category Vector for Q: <2/3, 0> Category Vector for F: <1, 1/4> Category similarity = distance of sqrt((1/3)2+(1/4)2)=0.4167 Keyword similarity = Access similarity = 3/7 Path similarity = 3/4 Element similarity = 2/7 Feature Ranking Level For example, Q = {<author>Mary</author>, <title>XML</title>} • Similarity Features • Keyword • Access • Path • Element • Order • Category Order in Q: author > title Sibling order in F: author>title, author>year, title>year, first>last Ancestor order similarity = 0 Sibling order similarity = 1/4
Feature Ranking Level For example, Q = {<author>Mary</author>, <title>XML</title>} • Granularity Features • Sibling • Children • Distance+ • Distance- • Tag • Attribute • Involves statistical data in the database The length of the path from root to farthest leaf dblp/article/author/first: length = 4 The length of the path from root to nearest leaf dblp/article/title: length = 3 Number of fragments whose roots are dblp Number of tags whose parent are dblp Number of tag in F: 7 Number of attributes in F: 0
Highlights of MRM • Highly Flexible • Add or remove of new features or new XR is straightforward • Only require to update the feature vector, • “Ranking Level Independence” • Analogous to data independence in relational model
Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work
Features of RSSF • Input: set of labeled fragment • Output: a trained ranker • Naïve Bayes is a successful algorithm for learning to classify text documents • Require small amount of training data, both positive and negative samples • In our setting, we only have labeled and unlabeled data, we extend the Naïve Bayes with spying technique to obtain the negative training samples
Ranking SVM Techniques • Find a vector that makes the inequality holds: F1 < F2 <F3
Voting Spy Naïve Bayes P1 P2 P3 Estimated Negative Training Naïve Bayes… Training Completed Positive Unclassified Negative
P1 P2 P3 F11 Voting Spy Naïve Bayes The Final Estimated Negative is…… F11 F11 F11 F12 F12 F13 F14 Positive Unclassified Negative
Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work
Effect of Varying Voting Threshold X: voting threshold Y: Relative average rank of labeled fragments: new average rank / original average rank
Effectiveness of Low-Level Features on XR • In this experiment, we remove individual low-level feature from STR and DAT rankers and measure the new precision • The queries we use can be found in the appendix of the proposal
Comparison with TopX • TopX is a searching engine for XML data available online • State-of-the-art XML search engine • We measure the MAP and precison@k • MAP: mean average precision • precison@k: top k precision Average precision over 100 recall points for each query. Then, take the average. Number of top k relevant results k
Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work
Further remarks • Searching and ranking XML data are important, since existing Web search engines cannot handle them well • We present effective approach to perform adaptive XML searching and ranking by extending traditional IR techniques by considering different features of XML data
Ongoing Work – INEX 2007 • The Initiative for Evaluation of XML retrieval (INEX) • A community which aims to provide large test data and scoring method for researchers to evaluate their retrieval systems • It is getting attention recently • We participate INEX in 2006 and 2007 • INEX 2007 Collection is a Wikipedia XML Corpus with a set of 659388 XML documents • We are running experiments using their data and queries
Ongoing Work – Merging • Displaying a list of fragments one by one to the user may not be adequate in XML setting. • Fragments may be scattered on the list • Duplicated fragments in different structures • Refine a search query to obtain more and better results • Ideas: Make use of the schema information (DTD) and consider the fragments as entities and merge them in a concise way
My Publications • Ho-Lam LAU and Wilfred NG. A Multi-Ranker Model for Adaptive XML Searching. Accepted and to appear: VLDB Journal. (2007). • Ho-Lam LAU and Wilfred NG. Towards an Adaptive Information Merging Using Selected XML Fragments. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 1013-1019, (2007). • James CHENG and Wilfred NG. A Development of Hash-Lookup Trees to Support Querying Streaming XML. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 768-780, (2007). • Wilfred NG and James CHENG. An Efficient Index Lattice for XML Query Evaluation. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 753-767, (2007). • Wilfred NG and Ho-Lam LAU. A Co-Training Framework for Searching XML Documents. Information Systems, 32(3), pp. 477-503, (2007). • Yin YANG, Wilfred NG, Ho-Lam LAU and James CHENG . An Efficient Approach to Support Querying Secure Outsourced XML Information. Conference on Advanced Information Systems Engineering. CAiSE 2006, Lecture Notes in Computer Science Vol. 4007, Luxembourg, pp. 157-171, (2006). • Wilfred NG and Ho-Lam LAU. Effective Approaches for Watermarking XML Data. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 68-80, (2005). • Ho-Lam LAU and Wilfred NG. A Unifying Framework for Merging and Evaluating XML Information. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 81-94, (2005).