290 likes | 458 Views
Towards a Generic Framework for Semantic Data Registration and Integration in Geosciences. Kai Lin, Chaitan Baru San Diego Supercomputer Center University of California, San Diego. Data Integration Goal. Query heterogeneous data sources as a single resource
E N D
Towards a Generic Framework for Semantic Data Registration and Integration in Geosciences Kai Lin, Chaitan Baru San Diego Supercomputer Center University of California, San Diego www.geongrid.org
Data Integration Goal • Query heterogeneous data sources as a single resource • Query: not write a program (“ad hoc, non-procedural query languages”) • Heterogeneous: local resource controls definition of the data • Single resource: remove the burden of individually accessing each data source www.geongrid.org
Data Integration Challenges: Heterogeneities Syntactical Heterogeneity heterogeneous data format e.g. 02-04-2004 vs. 02/04/04 Structural Heterogeneity heterogeneous data models and schemas e.g. 02-04-2004 is saved as three columns or one columns Semantics Heterogeneity fuzzy metadata, terminology, “hidden” semantics, implicit assumptions • GEON Solution: • data should be semantically registered to GEON first • heterogeneities are resolved by registration www.geongrid.org
Levels of Registration • Metadata-level registration • Register metadata associated with a resource • submit required metadata. Predefined semantics. • “Item” level registration • Register the “schema” of a resources, e.g. relational database, shapefiles, … • Record semantics of schema elements, e.g. table name, column name • “Item-Detail” level registration • Register individual values in a dataset • Record semantics of each item in a record/column www.geongrid.org
Registering Structured Data • Relational databases • Shapefiles database tables • Excel spreadsheets database tables • Delimited ASCII files database tables • Headers of scientific data files, e.g. netCDF www.geongrid.org
Item Level Database Registration and Access Application Table Def Table Def View Table View Table Table Table View Def GEON JDBC Driver GEON Mediator Original Database select tables and views to register Published Database www.geongrid.org
How to Connect to GEON Databases • Download GEON JDBC Driver • Use the following code to create a connection // load driver Class.forName ("org.geongrid.jdbc.driver.Driver"); // set the mediator URL String url = "jdbc:geon://geon01.sdsc.edu:2532/GEON-63cb404c-6038-11d9-a69f”; // open the connection Connection conn = DriverManager.getConnection(url, "geonuser", "geongrid"); The host name and port number of GEON Mediator GEON ID GEON JDBC protocol Note: the original account information is not accessbile by end users www.geongrid.org
GEON Mediator Enables Write Protection Mediator Database UPDATE B C B B A • Only accepts SELECT statements • Rejects any requests other than SELECT www.geongrid.org
Read Protection for Unregistered Tables and Views Mediator Database SELECT * FROM A C B B A • An unregistered table or view is invisible to an end user • The data in the table can’t be viewed by SELECT statement • The schema can’t be fetched www.geongrid.org
GEON Database Integration • GEON Mediator supports integration at three levels • Level 1: Federation-Based Integration • End users need to be knowledgeable about each database • Level 2: View-Based Integration • End users see “integrated views”. An intermediary designs these views. • Level 3: Ontology-Based Integration • End users can query using familiar concepts • Requires middleware and formal representation of domain knowledge www.geongrid.org
Use SQL to query the federated database • Structural and semantic heterogeneity should be • solved by users themselves Level 1: Federation-Based Integration GEON Mediator backend A B A B C D C D SELECT * FROM A, E WHERE …… backend E E F G F G www.geongrid.org
A B C D SELECT * FROM V, W WHERE …… E F G • Allow defining views on top of the federated databases • Allow hiding the original backend schemas • Integration results can be shared and reused Level 2: View-Based Integration GEON Mediator backend A B C D V W backend E F G www.geongrid.org
A B C D E F G • Requires ontology annotations for backend databases • Use simple ontology query language to query the integrated database • End users do not need to know the backend schemas and local semantics Level 3: Ontology-Based Integration GEON Mediator backend A B C D Ontology Based Query backend E F G www.geongrid.org
GEON Ontology Based Data Integration Ontology2 ontology3 Ontology1 dataset1 dataset2 dataset3 dataset4 • Ontology Enabled Semantic Integration Challenges for Computer Scientists and Domain Scientists • Computer Scientists: build an integration system based on the ontological registration of datasets • Domain Scientists: create domain ontologies • Data Providers: register datasets to ontologies www.geongrid.org
Ontological Data Registration for Data integration • Registering a dataset to an ontology for data integration is a procedure to generate a partial model of the ontology from the dataset itself individuals ontology From registration dataset p Not all the constraints in the ontology are satisfied by the generated individuals www.geongrid.org
Registering Relational Tables to Ontology Classes • Associate one or more columns under an optional SQL condition to a selected class in the ontology • Provide a mapping method if no explicit names of individuals should be generated Location (23.5, 47.9) is the name of an individual of the class Location Same name indicates the same location GeologicalAge Precambrian Cenozoic Paleozoic www.geongrid.org
Registering Relational Tables to Ontology Object Properties • Associate two entities which are already registered to the domain class and the range class of a selected object property in the ontology hasAge Rock GeologicAge www.geongrid.org
ODAL and SOQL ODAL (Ontological Database Annotation Language) User query SOQL (Simple Ontology Query Language) Register item/item-detail to Ontology www.geongrid.org
ODAL(Ontological Database Annotation Language) <odal:NamedIndividuals odal:id="RockSample" odal:database="VTDatabase"> <odal:Class odal:resource="http://geon.vt.edu#RockSample" /> <odal:Table>Samples</odal:Table> <odal:Table>RockTexture</odal:Table> <odal:Table>RockGeoChemistry</odal:Table> <odal:Table>ModalData</odal:Table> <odal:Table>MineralChemistry</odal:Table> <odal:Table>Images</odal:Table> <odal:Column>ssID</odal:Column> </odal:NamedIndividuals> GUI generate to ODAL processor • Create a partial model of ontologies from databases • Independent of end interface • Independent of specific database implementations • The ODAL mapping is itself a “first-class” object The values in the column ssID of the table Samples, RockTexture, RockGeoChemistry, ModalData,MineralChemistry and Images represent instances of RockSample www.geongrid.org
ODAL: Import Ontologies The Ontologies used for annotating a database can be imported as follows: <?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” > <odal:Ontology> <odal:Imports rdf:resource="http://www.library.org/Book.owl"/> <odal:Imports rdf:resource="http://www.writer.org/Writer.owl"/> </odal:Ontology> …… </odal:ODAL> www.geongrid.org
ODAL: Database Connection Declaration The target databases for making annotation is declared as follows: <?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” > …… <odal:Database odal:id="PublicationDatabase"> <odal:DatabaseProductName>Oracle<odal:DatabaseProductName> <odal:DatabaseProductVersion>9.1.21<odal:DatabaseProductVersion> <odal:Host>oracle.sdsc.edu</odal:Host> <odal:Port>3456</odal:Port> <odal:DatabaseName>Publications</odal:DatabaseName> </odal:Database> …… </odal:ODAL> www.geongrid.org
ODAL: Simple Named Individuals <odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" > <odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/> <odal:Schema>Collections</odal:Schema> <odal:Table>book-price</odal:Table> <odal:Column>ISBN</odal:Column> </odal:NamedIndividuals> Suppose the Book ontology contains a class Book and the schema Collection contains a table Book-Price with a column ISBN. The statement says that each value in the column ISBN represents a book individual. odal:id gives a name to the declaration, and represents the set of the individuals generated by the statement. www.geongrid.org
ODAL: Named Individuals from Multiple Columns <odal:NamedIndividuals odal:id="LocationInTableRockSample" > <odal:Class odal:resource="http://www.usgs.org/Space.owl#Location"/> <odal:Schema>California</odal:Schema> <odal:Table>Rock-Sample</odal:Table> <odal:Column>Latitude</odal:Column> <odal:Column>Longitude</odal:Column> </odal:NamedIndividuals> Suppose an ontology contains a class Location and a database table Rock-Sample with two columns Latitude and Longitude. The statement says that a pair of latitude and longitude gives a location www.geongrid.org
ODAL: Named Individuals with Conditions <odal:NamedIndividuals odal:id="MaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee.owl#MaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’M’ >]]</odal:Condition> </odal:NamedIndividuals> <odal:NamedIndividuals odal:id="FemaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee#FemaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’F’ >]]</odal:Condition> </odal:NamedIndividuals> A condition in an odal:Condition element should be a boolean expression which is valid to be used in any WHERE clauses of SQL queries www.geongrid.org
ODAL: Data Type Property Declaration Person … SSN … age … … 1234-56-7890 … 8 … hasAge double <odal:NamedIndividuals odal:id="PersonInTablePerson" > <odal:Class odal:resource="http://www.foo.org/Person.owl#Person"/> <odal:Table>Person</odal:Table> <odal:Column>ssn</odal:Column> </odal:NamedIndividuals> <odal:OntologyProperty> <odal:DatatypeProperty odal:resource="http://www.foo.org/Person.owl#hasAge"/> <odal:Table>person</odal:Table> <odal:Domain odal:resource="PersonInTablePerson" /> <odal:Range odal:resource="age" /> </odal:OntologyProperty> www.geongrid.org
Conditions for Joining Individuals from Different Resources • To join data across independent resources we need we need to know the correspondence between entities. • For example, does “10001” represent the same rock in the two resources. By default, we assume they are not. • A set of datatype properties can be declared as a key for a class in the ontology. We do join cross multiple resources based on keys. e.g. { hasLatitude, hasLongitude} can be declared as a key of Location Two locations from different resources are same if they have the same latitude and longitude Rock www.geongrid.org
SOQL (Simple Ontology Query Language) location RockSample Location hasSiO2 lat long value float ValueWithUnit unit SELECT X.location.*; FROM RockSample X WHERE X.location.lat > 60 AND X.location.long > 100 AND X.hasSiO2.value < 30 AND X.hasSiO2.unit =‘weightPercetage’ string GUI generate to SOQL processor • Query single or integrated resources • via ontologies (i.e., high level logical views) • independent of schema-level representation www.geongrid.org
The Architecture of GEON Semantic Mediator Oracle DB2 MySQL SQL Server PostgreSQL PostGIS Query Execution Query Optimization Query Planning Internal Database SQL Parser Spatial SQL against federal schemas Mediator JDBC Driver SOQL Parser Semantic Query Rewriter SOQL Ontology Reasoner ODAL Processor GUI Portal or Application OWL ODAL SOQL Processor www.geongrid.org
Question: Finding all seismic stations within 1 mile from railroads GEON SOQL GUI SELECT X.code, X.location.* FROM SeismicStation X, Railroad Y WHERE distance(X.location, Y.geometry) < 1 SOQL Processor SELECT X2.stationcode, X2.lat, X2.lon FROM railroads_of_the_united_states X1, stationdatatable X2 WHERE distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1 Schema Mediator SELECT X1.the_geom FROM railroads X1 distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1 Seismic Stations Railroad shapefile SELECT X2.stationcode, X2.lat, X2.lon FROM stationdatatable X2 WHERE bounding box condition www.geongrid.org