230 likes | 382 Views
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications. Engineering Informatics Lab at Stanford. Data. TREC Genomics 2007 Data Set. Over 162,000 full-text scientific publications from 49 prominent journals in biomedicine Metadata available through MEDLINE
E N D
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford
Data Engineering Informatics Lab at Stanford University
TREC Genomics 2007 Data Set • Over 162,000 full-text scientific publications from 49 prominent journals in biomedicine • Metadata available through MEDLINE • Tasks involve passage, document, and feature retrieval • Methodologies are evaluated on their response to 36 topics (‘queries’) • The topics are categorized based on 13 entity types (Proteins, Genes, etc.) Engineering Informatics Lab at Stanford University
BioPortal • BioPortal is an integrated resource for biomedical ontologies • Currently indexes over 300 ontologies including Medical Subject Headings and Gene Ontology • Provides a comprehensive web service, abstracting the formats and API’s of all underlying ontologies Engineering Informatics Lab at Stanford University
Methodology Engineering Informatics Lab at Stanford University
How is Domain Knowledge Integrated • Annotating Documents prior to indexing • Response time is fast • Not flexible, the entire index has to be updated if a new ontology needs to be added • Indexes can grow very large (2) Query Expansion • Response time is slower • Very flexible, ontologies can be dynamically chosen Engineering Informatics Lab at Stanford University
Query Expansion • TREC Queries are first manually pre-processed “What [TUMOR TYPES] are found in zebrafish?” => “[Tumor][MeSH] AND zebrafish” • [Tumor] indicates term that has to be expanded • [MeSH] indicates ontology that should be used Engineering Informatics Lab at Stanford University
Query Expansion Tumor MeSH • The pre-processed query is automatically expanded using BioPortal’s API [Tumor][MeSH] => {Tumor, Neoplasm, Carcinoma, Leukemia …} Melanoma Adenocarcinoma Leukemia Nerve Sheath Neo Engineering Informatics Lab at Stanford University
Which Domain Knowledge is Integrated • The use of synonymy results in inconsistent performance (2007 TREC genomics track) • Common reasons include: • Relevant terms may not be classified as expected • Some relevant terms may not be classified in a particular ontology • Incomplete information (such as synonyms) • Selection of the appropriate domain ontology is important Engineering Informatics Lab at Stanford University
Enriching Existing Ontologies • Existing ontologies must be enriched to complete missing information • Multiple ontologies can be used to provide different classifications MeSH NCI Engineering Informatics Lab at Stanford University
Evaluations • Baseline • With Query Expansion (Suggested Sources) • Using Enriched Ontologies • Multiple Query Expansions per query Engineering Informatics Lab at Stanford University
Queries Engineering Informatics Lab at Stanford University
Baseline • Queries are used without modification, e.g., • “What [ANTIBODIES] have been used to detect protein PSD-95?” • “What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease?” • Document MAP: 0.277 Engineering Informatics Lab at Stanford University
Query Expansion • Queries are formulated in ‘AND’ clauses: “[Tumor][MeSH] AND zebrafish” => (Tumor, Neoplasm, Carcinoma, Leukemia …) AND zebrafish • Document MAP: 0.347 Engineering Informatics Lab at Stanford University
Multiple Query Expansion Terms • Expansion can be performed on multiple terms in the query • Example: Coronary Artery Disease => {Coronary heart disease, coronary disease, CAD, …} [Tumor][MeSH] AND zebrafish[MeSH} => (tumor, neoplasm, …) AND (zebrafish, daniorerio, …) • Document MAP: 0.352 Engineering Informatics Lab at Stanford University
Enriched Ontology • Marginal improvement over basic enhanced models • Document MAP: 0.352 • Why is the improvement only marginal? • Framework for enrichment based on synonymy is rigid, i.e., relevant terms that are entirely missing in the ontology are still not included • Relevant terms that are classified differently are never included in the search Engineering Informatics Lab at Stanford University
Visualization • Expert knowledge is valuable • We extend MINOE, a co-occurrence based visualization tool, originally designed for exploring marine ecosystems • User can browse (or search) documents through ontologies and visualize interactions between concepts SEE DEMO Engineering Informatics Lab at Stanford University
Summary • Search methodologies must be based on semantics in order to tackle terminology inconsistency • Domain ontologies provide these semantics • Domain ontologies need to be modified (or enriched) in order to fulfill information needs • User interaction is important Engineering Informatics Lab at Stanford University
Future Work • Using multiple enriched ontologies may provide the necessary terms • MeSH Descriptors are provided for every publication during indexing and can potentially improve results • Implement Okapi model for scoring documents Engineering Informatics Lab at Stanford University
Backup Slides Engineering Informatics Lab at Stanford University
Motivation • Scientific literature is an important source of information • Retrieving relevant information from scientific publications is challenging • Domain terminology is used inconsistently in scientific publications • Increasing amounts of information amplify the problem • Improved methodologies based on semantics are required Engineering Informatics Lab at Stanford University
Background • Text REtrieval Conference (TREC) organized by NIST has showcased many successful methods • The Genomics track focused on full-text scientific publications from 49 prominent journals • Methodologies involved: • Use of Synonymy from ontologies • Language based models • Query expansion and annotations • Okapi scoring model Engineering Informatics Lab at Stanford University
Goals • Understand how domain ontologies can be leveraged • Understand which domain ontologies can be leveraged • Develop a knowledge-based approach to integrate domain knowledge with search mechanism Engineering Informatics Lab at Stanford University