240 likes | 260 Views
The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking. Anja Theobald and Gerhard Weikum University of the Saarland Saarbrücken, Germany. weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de. Conclusion. Problem: diversity of Web / Intranet data
E N D
The Index-based XXL Search Enginefor Querying XML Datawith Relevance Ranking Anja Theobald and Gerhard Weikum University of the Saarland Saarbrücken, Germany weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de
Conclusion • Problem: • diversity of Web / Intranet data • despite XML, global schema is a myth • users are swamped with results or • are looking for needles in haystacks Our contribution: • combine XML querying with relevance ranking • demonstrate efficiency and search result quality • with XXL search engine prototype
Outline Adding relevance to XML • The XXL search engine: index-based query processing • Experiments •
<Uni> ETH Zürich <Fak> Nat.-Techn. Fak. I <FR> Fachrichtung Informatik <Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni> <Uni> Uni Stuttgart <Fak> Nat.-Techn. Fak. I <FR> Fachrichtung Informatik <Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni> <Uni> Uni Saarland <School> Math & Engineering <Dept> CS <Teaching> ... <GradStudies> <Course> Performance analysis <Lecturer> ... </> <Content> Queueing models .. </> <Lit href=springer/nelson.xml > <Lit href=... > </Course> <Course> Speech processing <Content> ... Markov chains... </> </Course> ... </Teaching> .. </Dept> .. </School> ... </Uni> Book Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni Saarland ... School: ... School: ... ... Dept: ... CS ... ... Dozent URL=... Inhalt Teaching ... ... GradStudies ... Course: Speech processing Course: Performance analysis ... ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Uni: Uni Stuttgart Uni: Uni Augsburg Semistructured data: elements, attributes, links organized as labeled graph ... School: CS ... Curriculum: E Commerce ... Course: Mobile Comm. ... Weekend: Data Mining ... Prerequisites: ... Markov processes ... ... ... XML Data Graph
Regular expressions over path labels + Logical conditions over element contents XML Querying Book www.allunis.de/unis.xml Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni Stuttgart ... School: CS Uni: Uni Saarland ... ... Course: Mobile comm. School: ... School: ... ... ... Prerequisites: ... Markov processes Dept: ... CS ... Uni: Uni Augsburg Teaching ... ... GradStudies Curriculum: E Commerce ... Course: Speech processing Course: Performance analysis ... Weekend: Data Mining ... ... ... Outline: ... statistical methods for classification ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“
XML Querying Book www.allunis.de/unis.xml Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni: Uni Stuttgart ... Markov chains School: School: CS CS Uni: Uni: Uni Saarland Course: ... ... Course: Mobile comm. School: School: ... School: School: ... ... ... Prerequisites: ... Markov processes Dept: Dept: ... CS CS ... Uni: Uni: Uni Augsburg Teaching ... ... GradStudies Curriculum: E Commerce ... Course: Course: Course: Speech processing Course: Performance analysis ... Weekend: Data Mining ... ... ... Outline: ... statistical methods for classification ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Markov chains U, C Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“ Uni As U U.#.School?.#.(Inst | Dept)+ As D D Like „%CS%“ D.#.Course As C C.# Like „%Markov chain%“
There is no global schema for Intranets or the Web Relevance ranking of results is absolutely crucial ! Boolean vs. Ranked Retrieval
Ranked Retrieval with XXL Book www.allunis.de/unis.xml Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni Stuttgart ... School: CS Uni: Uni Saarland ... ... Course: Mobile comm. School: ... School: ... ... ... Prerequisites: ... Markov processes Dept: ... CS ... Uni: Uni Augsburg Teaching ... ... GradStudies Curriculum: E Commerce ... Course: Speech processing Course: Performance analysis ... Weekend: Data Mining ... ... ... Outline: ... statistical methods for classification ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „CS“ And D.#.~Course As C AND C.# ~~ „Markov chain“
Dozent URL=... Inhalt ... Result ranking of XML data based on semantic similarity Ranked Retrieval with XXL Book www.allunis.de/unis.xml Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni Stuttgart ... School: CS Uni: Uni Saarland ... ... Course: Mobile comm. School: ... School: ... ... ... Prerequisites: ... Markov processes ... Dept: ... CS Uni: Uni Augsburg Teaching ... ... GradStudies Curriculum: E Commerce ... Course: Speech processing Course: Performance analysis ... Weekend: Data Mining ... ... ... Outline: ... statistical methods for classification ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „Computer Science“ And D.#.~Course As C and C.# ~~ „Markov chain“
Outline Adding relevance to XML The XXL search engine: index-based query processing • Experiments •
Semantic similarity conditions on names and contents ... F.#.~Lecturer As D And D.~Area ~~ „XML“ Based on tf*idf similarity of contents, ontological similarity of names probabilistic combination of conditions XXL: Flexible XML Search Language Extensible, simple core language Where clause: conjunction of regular path expressions with binding of variables Elementary conditions on element/attribute names and contents Select F, D, S From www.allunis.de/unis.xml Where Uni.#.School?.#.(Inst|Dept) As F And F.#.Lecturer As D And F.#.Student As S And D.Name = S.Name And D.Area Like „%XML%“
XXL Result Ranking Query: Where Uni.#.School?.#.(Inst|Dept)+ As D And D.#.~Lecturer As D And D.~Area ~~ „XML“ Data graph: Result graph: 1.0 Uni: UniSaarland Uni: UniSaarland 1.0 Dept: CS Dept: Math Dept: CS Dept: Math 0.9 Prof: GW Prof: GW 0.8 Teaching Project: IR for semistruct. data Project: IR for semistruct. data 0.6 Project: Digital libraries Course: IR Relevance score: 0.432 = 1.0 * 1.0 * 0.9 * 0.8 * 0.6 Seminar: XML
WWW • Query decomposition into • index-supported subexpressions • wide range of optimizations ...... ..... ...... ..... XXL Search Engine XXL servlets Path indexer XXL applet Query processor Content indexer Ontology Select ... Where Uni.#.(Inst|Dept) As F And F ~~ „Computer Science“ And F.#.~Course.# ~~ „Markov Chains“ Uni.#.(Inst|Dept) As F F ~~ „Computer Science“ F.#.~Course.# ~~ „Markov Chains“ F.#.~Course.# ~~ „Markov Chains“ F.#.~Seminar.# ~~ „Markov Chains“ F.#.~Seminar.# ~~ „Markov Chains“
Index Structures Element Path Index: materializes all (parent, child) element name pairs and dynamically checks transitive connectivity Uni, {id1, {<School, {id13, id14}> <Prof, {id111, id117, id119}>}, id2, {<Prof>, {id15}>} } School, {id13, {<Dean, {id27}>, <Dept, {id31, id32, id33}>}, id14, { ... } } precomputes all term occurrences in element contents, with frequency statistics Element Content Index: Engineering, idf=..., {<id79, tf=...>, <id85, tf=...>} XML, idf=..., {<id46, tf=...>, <id49, tf=...>, <id53, tf=...>} contains synonyms, hypernyms, and hyponyms of element names, and „semantic“ distances Element Ontology Index: Course, {<Seminar, 0.9>, <Project, 0.7>}, {<Teaching, 0.9>} {<Telecourse, 0.9>, <Video lecture, 0.7>, <Meditation, 0.1>}
Inst Uni % Dept Query Decomposition & Evaluation • decompose query into subqueries • choose global evaluation order of subqueries • represent subquery as NFSA • for each subquery choose local evaluation strategy (top-down or bottom-up) • evaluate subexpressions using indexes • compute subquery result paths with relevance scores • combine result paths into result graph Example query: Example of subquery NFSA: Uni.#.(Inst|Dept)+ As F And F ~~ „Computer Science“ And F.#.~Course.# ~~ „Markov Chains“ Uni.#.(Inst|Dept)+ Uni.#.(Inst|Dept)+
Observation: WWW / Intranet Information becomes better searchable when it is more explicitly structured and canonically annotated <Uni> Univ. Saarland <School> Engineering <Dept> Computer Science <Faculty> Prof. Dr. GW <Project> Semistructured Data ... XML</> ... ...... ..... ...... ..... Univer- sity Jour- nal Dept Univer- sity Confe- rence Insti- tute Jour- nal Dept Prof Confe- rence Publi- cation Insti- tute Prof • c (Course(c) s ((Dept(s) Inst(s)) Curriculum (c,s))) Publi- cation Course Re- search • c (Course(c) s ((Dept(s) Inst(s)) Curriculum (c,x))) Course Re- search „Poor man‘s ontology“: Teach- ing Semi- nar Pro- ject Graph of concepts capturing hypernym/hyponym relationships (e.g., from WordNet) Teach- ing Semi- nar Pro- ject quantitative reasoning („semantic similarity“ measures) The Role of Ontologies
Outline Adding relevance to XML The XXL search engine: index-based query processing Experiments •
Example Query SELECT * FROM INDEX WHERE ~drama.#.scene AS C AND C.speech AS S AND (S.speaker ~ "Woman") AND S.line AS L AND (L.CONTENT ~ "leader") AND C.speech AS M AND (M.speaker = "MACBETH")
Example Ontology thane – (a feudal lord or baron in Scotland) => lord, noble, nobleman – (a titled peer of the realm) => male aristocrat – (a man who is an aristocrat) => leader – (a person who rules or guides or inspires others)
Example Ontology woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)
Example Results Relevance = 0.0070400005 <scene> <speech> <speaker> Second Witch </speaker> <line> All hail, Macbeth, hail to thee, thane of Cawdor! </line> </speech> <speech> <speaker> MACBETH </speaker> <line> ... </line> </speech> </scene>
Test data: 100 XML documents with a total of 240 000 elements (ot.xml, nt.xml, ..., hamlet.xml, macbeth.xml, ..., SigmodRecord.xml) XXL Runtime Measurements Q1: Select * From Index Where #.publication AS A And A.~headline ~~ „XML“ And A.author% AS B 1 2 3 4 Q2: Select * From Index Where #.play AS A And A.#.personae AS B And B.~figure ~~ „King“ And B. title AS C #results: top-down bottom-up w/ optimization: 131 14.3 sec 694 sec 2.68 sec (incl. 0.37 sec) 2bu 1bu 3td 58 8.5 sec 3.7 sec 4.64 sec (incl. 0.33 sec) 1bu 2td 3td 4td
Conclusion Research avenue: explore and leverage synergies between XML (querying),(relevance-ranking) IR, (domain-specific or personal) ontologies, and machine learning (for classification, annotation, etc.) Goal: should be able to find results for every search in one day (computer time) with < 1 minintellectual effort that the best human experts can find with infinite time • pursued in CLASSIX project (joint DFG project with Norbert Fuhr‘s group in Dortmund)