140 likes | 155 Views
Leveraging web semantics for user-friendly free-form search in scientific data, using NLP and ontology model matching. Explore system components and performance benefits in this innovative framework.
E N D
Ontological Framework for Enabling Free-Form Search in Scientific Discovery Chaitali Gupta, Madhusudhan Govindaraju Grid Computing Research Laboratory SUNY Binghamton E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Motivation • Most computer users today do not have to write programs • most end users of Grid and scientific data sets should be shielded from low-level details • Web Search engines search billions of web pages • use Natural Language Processing (NLP) and Information Retrieval (IR) technologies • return many links for any given search • XML based technology and ontologies can be used to categorize and organize information • machine-readable and understandable manner • retrieve specific information from Grid/scientific services. E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Project Vision • Our vision is that Web semantics can be leveraged to build search engine like interfaces even for Grid/Scientific Application Meta-Data. • abstract away the fundamental complexity of XML based services specifications and toolkits • Add a search box on portal dashboards • Automatically convert queries to Job description specification formats E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Related Work • MDS. • WSRF compliant service to publish/retrieve resource information • Condor ClassAds. • Combines schema, data, and query in a simple but powerful query specification language. • Condor Gangmatching. • Overcomes bilateral matching limitations of the ClassAds. E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Comparing with SPARQL E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Scope of Free-Form Queries • The problem of processing and acting upon arbitrary English is an extremely challenging • actively addressed in the AI community • Use many techniques from NLP and semantic web • Scope of our work is therefore limited • cannot accept any free-form query • designed to accept a limited form of English with a vocabulary taken from the ontology. E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Example queries for New York State Grid (NYSGrid) • List all sites of NYSGrid • All Sites of NYSGrid with Xeon processors • Processor configuration of nodes at Binghamton site of NYSGrid • All machine names in NYSGrid with CPU speed greater than 2.0GHz speed • Status of job ID 117 running on NYSGrid • Names of 16 free nodes on the NYSGrid with at least 4GB of memory • List all nodes of NYSGrid having CPU speed greater than 1Ghz and less than 4 Ghz E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Example ontology model E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
System Components • WSDL Processor • User Query Interface • Query Processor • Match Processor • Ontology Matcher • Dictionary Matcher • direct, stripped matching, hypernyms, hyponym • Lexicon • how people use words etc. • Relevance Checker • Glossary, input and output parameters of the Web service E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Example query that lights up the model • The Ontology Matcher retrieves the ontologies from the ontology repository and matches them with the user query. • Ontologies built in OWL for storing the vocabularies • concepts include “CPU”, “memory”, “storage”, “job”, etc. • use Jena to process OWL models/statements • <subject, object, predicate> E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
System Components • Queries that hit Ontology Matcher have an average of 95% - 96% better performance benefit than those requiring both Ontology and Dictionary Matcher. E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Performance of System Components • Execution time taken by the major components E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
System Components • Recall and Precision increases when domain dependent ontologies are considered. E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session:
Research Challenges • Design algorithms to automatically infer the context of user queries and map them to an appropriate set of Grid and scientific services. • Automatically extend and update domain knowledge using Semantic Web techniques and WordNet. • Build a feedback loop for cases that don’t work • Enable construction of simple workflows • multiple Grid services may be needed for a query • merging results from different services E-science Microsoft Workshop 2008: Semantics Birds of a Feather Session: