270 likes | 422 Views
TOSS: An Extension of TAX with Ontologies and Similarity Queries. Edward Hung Yu Deng V.S. Subrahmanian. Presentation by: Valentina Bonsi Roberto Gamboni Giuseppe Vitalone. Speaker: Roberto Gamboni. Outline. Abstract TAX overview Quality problems TOSS architecture
E N D
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni Giuseppe Vitalone Speaker: Roberto Gamboni
Outline • Abstract • TAX overview • Quality problems • TOSS architecture • TOSS algebra • Experiments • Conclusions & Related works
Abstract • Tree Algebra for XML • an algebra developed for XML DB • 100% precision but low recall • semantic not considered • TAX with Ontologies and Similarity Queries • ontology • similarity enhancement • improves recall Much higher quality!
Tree Algebra for XML • Semistructured instance: I = (V,E,t) • G = (V,E) is a set of rooted directed trees where V is a set of nodes and E is a set of edges V x V. • t assigns for each object o V a type for its tag and content, i.e. o.tag = string and o.content = int. • Pattern tree: P = (T,F) • T = (V,E) is object labeled (a distinct integer) and edge labeled (‘pc’ or ‘ad’) tree • F is a selection condition applicable to objects in T.
carModel [Toyota/Yaris] carModel [Vw/Golf] carModel [Vw/Polo] price [20000] price [10000] price [14000] #3 pc year [2004] year [2005] year [2002] km [10000] km [40000] km [30000] #1 carDealer [Pico] carDealer [RBV] carDealer [RBV S.p.A.] pc fuelCons [12] fuelCons [13] fuelCons [10] #2 carModel [Toyota/Yaris] carModel [Vw/Polo] car car price [14000] price [10000] TAX selection example Pattern tree #1.tag=car & #2.tag=price & #3.tag=carModel & #2.content<15000 car DB1 car car Witness trees
title[Operating Systems] price [45,50] #3 pc author [W. Stallings] publisher [MacMillan] #1 year [1992] pc ISBN [002945671] #2 TAX similarity problems #1.tag=book & #2.tag=title & #3.tag=author & #3.content= “W. Stallings” book biblio title [Cryptography] price [42,50] Low recall!!! book author [William Stallings] publisher [Prentice Hall] year[2003] ISBN[003456783] • W. Stallings and William Stallings are probably the same person but TAX does not use any notion of similarity between terms. • Solution: improve TAX with some similarity measure ds(W. Stallings, William Stallings) = 0,1 (very similar) ds(W. Stallings, Shakespeare) = 5 (much less similar)
carModel [Toyota/Yaris] carModel [Vw/Golf] make [AstonMartin] make [Ferrari] carModel [Vw/Polo] make [Volkswagen] model [360] price [20000] model [Vanquish] price [10000] price [14000] model [Fox] car year [2004] year [2004] year [2002] year [2005] year [2002] year [2005] miles [30000] miles [10000] miles [15000] km [10000] km [40000] km [30000] cost [70000] carDealer [RBV] carDealer [Pico] cost [5000] cost [80000] carDealer [RBV] fuelCons [6] fuelCons [6] fuelCons [13] fuelCons [10] fuelCons [12] fuelCons [15] cars car car TAX multi-DB example DB2 DB1 dealerName[RVB] car location[Bologna] feedback[5] vendor automobiles car car
TAX problems with multi-DB • Different tags can refer to the same thing. • The same content can be stored differently. • Tags like km and miles or price and cost may contain values expressed in different units (i.e. EUR or USD).
Company #3 pc authors #1 isa pc #2 author author Computer Company firstName[Marco] firstName[Samuele] isa lastName[Pivi] lastName[Salti] Web search Company Google’s authors are never returned! company[Google] company[Eclipse Found.] isa Google Ontology Inter-term lexical relationships “Return all authors of papers written by someone in a Web Search Company” #1.tag = author & #2.tag = lastName & #3.tag = company & #3.content = “Web Search Company”
User-specified rules WordNet Ontology Maker TOSS: Architecture’s birdseye view Goal: extend and enhance TAX to return high quality answers using ontology and similarity measures User queries similarity measure threshold Fusion of Ontologies Similarity Enhancer SEO Query Executor results XML files Xindice system
name [Fuffi] isa race [African] proboscidean isa age [50] mammal isa isa name [Fido] carnivore canine isa race [Collie] isa age [4] isa spider name [Pito] arachnid isa race [Mactans] isa age [7] Ontology maker Derived ontology: XML DB: elephant dog animals black widow
Ontology Integration cars vendor carDealer dealerName location car feedback carModel year make car automobiles model km year price miles fuelCons cost fuelCons Interoperation Constraints (specified by user)
Fusion of Ontologies cars vendor dealerName location car feedback carModel year automobiles km miles make:2 and model:2 are both mapped into carModel price cost fuelCons • not grouped! • as different units might be used in istances, the administrator has to define a conversion function to compare these values
similarity measure threshold Similarity Enhancer TOSS: Architecture’s birdseye view User queries User-specified rules WordNet Fusion of Ontologies Ontology Maker SEO results Query Executor XML files Xindice system
Similarity Enhancer Threshold = 2 d(LAX,LB) =1,5 d(London City,London Heathrow)=1 d(London City,London Gatwick)= 1,3 d(London Gatwick, London Heathrow)=1,6 d(London City,Roma Fumicino) =3,5 d(Roma Fiumicino,LAX) = 9 LAX – CA (Los Angeles) United Airlines LB – CA (Long Beach) American Airlines London City Airport airports Delta Airlines London BAA Heathrow • Preserves the original partial order • 2. All nodes mapped into the same node are similar to each other • 3. Two strings are similar iff they are mapped into the same node • 4. There are not redundant nodes (no subset) British Airways London Gatwick Alitalia Roma Fiumicino
User-specified rules WordNet Ontology Maker TOSS: Architecture’s birdseye view User queries similarity measure threshold Fusion of Ontologies Similarity Enhancer SEO Query Executor results XML files Xindice system
Query Executor • Transforms a user query into a query that takes the similarity enhanced and (fused) ontology into account. • Implements an ontology extended algebra that improves TAX algebra. • In TOSS algebra, a simple selection condition is X op Y, where op {=, ≠, <, ≤, >, ≥, ~, instance_of, is_a, subtype_of, above, below} and X, Y are terms (attributes, types etc..).
TOSS Algebra • A selection condition is a simple selection condition or conjunction, disjunction, negation of selection conditions. • C = X ~ Y is trueiff a node containing both of them in SEO; • C = X instance_of Y is true iff type of X is a subtype of Y and its value dom(Y); • C = X subtype_of Y is true iff type(X) ≤ type(Y); • C = X below Y is true iff X instance_of Y or X subtype_of Y; • C = X above Y is true iff Y below X.
title[Operating Systems] price [45,50] #3 pc author [W. Stallings] publisher [MacMillan] #1 year [1992] pc ISBN [002945671] #2 Query Example #1.tag=book & #2.tag=title & #3.tag=author & #3.content ~ “W. Stallings” book biblio ds(W. Stallings, William Stallings) < title [Cryptography] price [42,50] book author [William Stallings] publisher [Prentice Hall] year[2003] ISBN[003456783] title [Operating Systems] book author [W. Stallings] NOW all correct answers are returned! title [Cryptography] book author [William Stallings]
Query Example(2) Name [Fuffi] “Return the list of all mammals” elephant Age [50] Mammal ??? Name [Fido] animals dog Age [4] ontology Name [Pito] black widow Age [7] Name [Fuffi] elephant Elephant IS A mammal Age [50] Dog IS A mammal Name [Fido] dog Age [4]
Implementation and Experiments • TOSS implemented in Java. • Built on top of Xindice DBMS. • Experiments over DBLP: • Recall and precision 12 selection queries on 3 data sets (each containing 100 random papers)
Recall and precision =TAX X = TOSS (=2) + = TOSS (=3) • TAX always get 100% precision but low recall! • TOSS maintains its precision close to 1 with much higher recall! • For queries with lowest TOSS precision, a precision degradation of 1/3 corresponds to a 3 times increase of recall
Recall and precision (2) =TAX X = TOSS (=2) + = TOSS (=3) • TOSS quality is always better than TAX!
Recall and precision (3) X = improvement (=2) + = improvement (=3) • In TOSS most of the queries get their normalized recall more than doubled • TOSS results with threshold=3 are not necessarily better than the ones with threshold=2
Conclusions & Related works • Ontologies to improve the quality of answers to queries (Wiederhold’s group); • Merge ontologies under interoperation constraints; • Semistructured instances with associated ontologies can be queried; • Introduct the concept of similarity search in semistructured DBs. • Scored pattern tree (TIX)
Bibliography • H.V. Jagadish, L.V.S. Lakshmanan, D. Srivastava and K. Thompson. TAX: A tree algebra for XML. In Proc. DBPL Conf, Rome, Italy 2001. • G.A. Miller et al. WordNet – a lexical database for english. Cognitive Science Laboratory, Princeton University. • G. Wiederhold. Interoperation, mediation and ontologies. In Interantional Symp. On Fifth Generation Computer Systems, Workshop on Heterogeneus Cooperative Knowledge Bases, ICOT, pages 33 – 48, 1994. • SIGMOD Record in XML. Available at http://www.acm.org/sigmod/record/xml/, Nov 2002.