Efficient Selection & Integration of Data Sources

Efficient Selection& Integration of Data Sources for Answering Semantic Web Queries Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1 1 Lehigh University 2 Tech-X Corporation 11/11/07

Outline • Challenges • Desiderata and overview of our approach • OWLII: the subset of OWL that our system supports • OBII: Ontology Based Information Integrator • Evaluation • Wrap up

Challenge 1: Scalability • The (Semantic) Web is too big for a given system – regardless of advances in algorithms and/or smart hacks • We need to some how identify a suitable subset that is relevant to a query and “work” on them • Sampling and refinement – as Fensel and van Harmelen (IEEE IC 2007) suggest ? • get a good enough subset in one shot ?

sesame sesame sesame Challenge 2: Heterogeneity Query Query Query Query Query O1 O2 O3 O4 ON DLDB ? ? ? OWLIM • Need Alignments • Mapping tools • Third party alignments • Need Tools that exploit them

Desiderata 1. Rely as much as possible on existing infrastructure. 2. Answer a query using any ontology not just a globally accepted “query” ontology. 3. Identify a good enough subset of data sources that will get useful answers. 4. Be able to “discover” alignment information even when the ontologies are not directly mapped with one another. 5. Account for the dynamic nature of the Web where content of the data sources change rapidly

Our approach(Get a good enough subset in one shot) • Introduced a concept of source relevance (“REL” statements) • Allows data providers to advertise the relevance of their data to a query • If a source can express that it has relevant information we can choose to query it as opposed to other sources that do not express this information. • Adapted an information integration algorithm to select relevant sources for a query, given relevance meta data and ontology alignment • Implemented and evaluated the system on synthetic data

PDMS • Fast and proven algorithm for query reformulation in database community (Halevy et al. ICDE 2003) • Uses LAV and GAV information integration formalisms to describe maps and data sources • GAV • In first order logic it is an implication with multiple antecedent and a single consequent • Usually written like: O3:BigMonitor (x) :- O2: LCD (x) , screen (x, big) • LAV • In first order logic it is an implication single antecedent and multiple consequents • Usually written like:O1:CinemaDisplay (x) ⊑O2: LCD (x), screen (x, big)

OWL for information integration (OWLII) • The ontology language our system supports • A subset of OWL DL (therefore, decidable) • To represent LAV and GAV in OWL • We have extended the DHL language (Grosof et. al. 03) • REL statements are modeled as LAV statements • Details in a tech report http://www3.lehigh.edu/images/userImages/jgs2/Page_7287/LU-CSE-07-007.pdf

REL example <meta:RelStatement> <meta:source rdf:resource="http://sourceURL2"/> <meta:contained> <owl:Class rdf:about="TV" /> </meta:contained> <meta:container> <owl:Restriction> <owl:onProperty rdf:resource="#madeBy" /> <owl:hasValue rdf:resource="#Sony" /> </owl:Restriction> </meta:container> </meta:RelStatement>

Map example <owl:Class rdf:about="&a;NovelAuthor"> <rdfs:subClassOf rdf:resource="&b;Author"/> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="&b;writes"/> <owl:someValuesFrom rdf:resource="&b;Novel"/> </owl:Restriction> </rdfs:subClassOf> </owl:Class>

Not so simple maps! • Maps are not always straight forward • For example: Data type property to object type property • “profession” is a datatype property in O1 • “Profession” is a class and hasProfession is a object property in O2 (domain Person and range Profession) • O1:Person ⊓ O1:profession.{“teacher”}  O2:Person ⊓ O2:hasProfession.{teacher}

OBII • Ontology Based Information Integrator • Input • Domain ontologies (class and property hierarchy only) • Map ontologies (OWL files that import two ontologies and establish alignments using OWLII) • REL files (RDF files. A set of RDF triples enclosed by RelStatement describes a source relevance) • Data Sources (OWL files that contain only individual and property assertions ABox or Sesame repository that contain similar data) • Sparql query • Output • Variable binding in XML

OBII

Evaluation • A Baseline system: we load all the ontologies, all the maps and all the sources in a DL reasoner and issue a query to get a sound and complete answer • Basic PDMS to select sources (without any taxonomic reasoning) • OBII

Metrics • Response time • For the baseline system we add the time to load all the data, and the reasoning time to get the answers. • For the other two systems load time is calculated as the time to load the ontologies that have been used in the reformulation (map and domain ontologies) and the selected data sources. • The response time for these two systems then is a sum of load time, reformulation time and the reasoning time. • Percentage of complete responses to queries. • In determining the completeness of queries, we consider the baseline system's answers to be the reference set. • This is reasonable because it has all the data available to it and uses KAON2 a DL reasoner. • We only consider queries that entail at least one answer.

Data • Real world data is limited • Can not be used to test the system completely • We decided to use synthetic data • We developed a work load generator MOST (Maps Ontologies Sources Tester) • We plan to use some real data soon in the ISENS project

Response time for each system as we vary the number of ontologies and the number of sources Both basic PDMS and OBII are significantly faster then the base line system (note: the chart is in logarithmic scale) Additionally, the basic PDMS is typically twice as fast as OBII Similar trend in other configurations Results (1) # of Onts - Diameter- # of sources

Contribution of load time to response time OBII and basic PDMS’s main performance difference is due to load time. OBII identifies more sources as it uses taxonomic reasoning Since PDMS fails to identify these sources, it is incomplete for many queries (next chart) Results (2)

The percentage of complete query responses decreases in basic PDMS as we increase the number of data sources and the number of ontologies. OBII is 100% complete for all the queries with respect to the baseline system Results (3)

Wrap up! • The Semantic Web needs to be connected in order for the “Semantics” to really payoff • We have implemented a fast source algorithm for selecting and integrating Semantic Web data sources • Our initial evaluation shows promise but there is a lot to be done • Complex ontologies, expressive RELs ….

Backups

OWLII description

Source Selection Algorithm subPred returns sub class (properties), enhanced “match” that allows us to consider sub classes (properties). This is an improvement over the PDMS

How MOST was used? • OntGenerator • An average of 20 classes and 20 properties. • The class and property taxonomy have an average branching factor of 4 and an average depth of 3 • MapGenerator • An even distribution of various mapping axioms (it can be controlled) • Chose to map about 30% of the classes and 30% of the properties of a given domain ontology • The resulting map views contain an average of 5 conjuncts with some maps containing up to 11 conjuncts. • SourceGenerator • Create instances of 30% of the classes and 30% of the properties of the domain ontology that a source commits to. • On average each data source contains 50 triples. • QueryGenerator • Generate 200 random queries with 1 to 3 conjuncts (75% of conjuncts are properties as opposed to a class).

Efficient Selection & Integration of Data Sources