290 likes | 509 Views
AN ADAPTATION OF THE VECTOR-SPACE MODEL FOR ONTOLOGY-BASED INFORMATION RETRIEVAL. Authors- Pablo Castells, Miriam Ferna´ndez , and David Vallet PRESENTED BY-AMALA RANGNEKAR. OVERVIEW. INTRODUCTION EARLIER MODELS’ ISSUES PROPOSED SYSTEM SEMI-AUTOMATIC ANNOTATION WEIGHING ANNOTATIONS
E N D
AN ADAPTATION OF THE VECTOR-SPACE MODEL FOR ONTOLOGY-BASED INFORMATION RETRIEVAL Authors- Pablo Castells, Miriam Ferna´ndez, and David Vallet PRESENTED BY-AMALA RANGNEKAR
OVERVIEW • INTRODUCTION • EARLIER MODELS’ ISSUES • PROPOSED SYSTEM • SEMI-AUTOMATIC ANNOTATION • WEIGHING ANNOTATIONS • ANNOTATION ISSUES • QUERY PROCESSING • RANKING ALGORITHM • ISSUES, COMBSUM SOLUTION • EXPERIMENTS • FINAL OBSERVATIONS • COMPARISON WITH CONVENTIONAL SYSTEM • STRENGTHS • CURRENT ISSUES
INTRODUCTION • Most search engines use keyword based techniques to return documents in response to user queries. • This approach is Boolean: ‘yes/no’ • A more intelligent IR using semantic search is necessary in combination with the present method. • Any reasons/examples as to why?
EG. US POPULATION FIG.1
EARLIER MODELS’ ISSUES • The Absence of a ‘weight’ for each term in the query. • ‘RELEVANCE’ of a term is not proportional to its ‘FREQUENCY’ . • Not making use of ‘RARITY’ of a term. HOW WOULD THIS HELP?? Eg. Arachnocentric (of spiders)
PROPOSED SYSTEM • ‘Conceptual searching’ techniques for heterogeneous KB have drawbacks. Do you know KIM? (https://www.ontotext.com/sites/default/files/publications/KIM_SAP_ISWC168.pdf) • Ranking: our concern is to rank docs annotated by query answers and not the answers themselves.
PROPOSED SYSTEM • Domain-Concept Superclass base concept(root). • Topic ‘Property’ of a class used for classification. • Document The proxy info. source to be searched upon.
SEMI-AUTOMATIC ANNOTATION • Domain Concept instances stores a multi-valued property called ‘label’ for every instance. (This is the most usual text form of the instance). • Whenever an occurrence is found, an annotation is created between the instance and the document. Instance Annotation Document FIG.3
WEIGHING ANNOTATIONS • ‘Weight’ assigned to every annotation instead of doc. • Shows relevance of instance with doc. • Weight computed by adaptation of TF-IDF algo. • Weight ‘dx’ for any instance ‘x’ occurring in doc ‘d’:
WEIGHING ANNOTATIONS Adaptation of the TF-IDF algorithm • freqx,d: of occurrences in d of the keywords attached to x • maxyfreqy,d: frequency of the most repeated instance in d • nx: #of documents annotated with x • D: the set of all documents in the search space
ANNOTATION ISSUES • METONYMY(Table Tennis=Ping pong) SOLUTION?? • Extending labeling schemes UNRESOLVED ISSUE: • SYNECDOCHE (Picasso.. …The painter also…) • Counting imprecision
QUERY PROCESSING RDQL queries are used to express: • Ontology instances • Document properties • Classification values Variables can be weighted: • Manually • Automatically
QUERY PROCESSING FIG. 4
RANKING ALGORITHM Semantic similarity value between Query and doc. • O: the set of all classes & instances in the ontology • D: the set of all documents • Qx: Extended query vector • Vq: the set of variables in the SELECT clause of q RANKING RETRIEVAL ANNOTATION
RANKING ALGORITHM • w: weight vector (0-1) • T: Tuples in the query result set • D: Doc search space • dx: wt of annotation of doc ‘d’ with instance x • q €Q: an RDQL query • Similarity :
ISSUES, COMBSUM SOLUTION • Normalizing required. • Incomplete KB results in lesser similarity value for even relevant docs. • Method needs to combined with keyword-based algo. Any suggestions for solutions?? • CombSUM
EXPERIMENTS KIM domain ontology and KB Complete KB includes: • 281 classes • 138 properties • 35,689 instances Automatic generation of concept-keyword mapping • 3 * 106annotations • Average observed response time below 30 sec • Weight of query variables set to 1
QUERY A: News about banks that trade on NASDAK, with fiscal net income > 2 million dollars Keyword-based: • Limited expressive power • Fails to express query condition Semantic Search: • Handles condition • Annotates relevant instances Ontology: • KB large, not massive. • KB doesn’t contain all banks hence precision is lesser at 100% recall FIG. 5
QUERY B:News about telecom companies Keyword-based: • KB contains few instances Semantic: • Keyword-based better, so linear combination value better Ontology: • Low precision • KB incomplete FIG. 6
QUERY C: News about insurance companies in USA. Ontology: • Performance is spoiled by incorrect annotations. (Kaye=company and person’s name) Semantic: • Since keyword-based result is better, the linear combination value is also better. FIG. 7
FINAL OBSERVATION • An average comparison of system over 20 queries. • Results: Situations where ontology-only search performs bad are compensated on average. FIG. 8
STRENGTHS • Better recall: Query for specific instances using class hierarchies & rules. • Better precision: Using weights, reducing ambiguities (extending labels), using structured semantic queries. • Combination of conditions on concepts Better results: • With increase in the # of clauses in the formal query • With complete and high quality KB
CURRENT ISSUES Further work neededas follows: • Automatic annotation. • Advanced NLP to replace human supervision. • Score combination strategy. • Model extension with profile of user interests for personalized search. ANY MORE?