380 likes | 480 Views
GBD. UFSC. Data Base Group of Santa Catarina Federal University. A Method for Defining Semantic Similarities between GML Schemas. Angelo Augusto Frozza – UFSC / UNIPLAC Ronaldo dos Santos Mello - UFSC. Summary. Introduction Method overview Preprocessing
E N D
GBD UFSC Data Base Group ofSanta CatarinaFederalUniversity A Method for Defining Semantic Similarities between GML Schemas Angelo Augusto Frozza – UFSC / UNIPLAC Ronaldo dos Santos Mello - UFSC Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Summary • Introduction • Method overview • Preprocessing • Definition of the similarity score • Mapping catalog • Conclusion Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Summary • Introduction • Method overview • Preprocessing • Definition of the similarity score • Mapping catalog • Conclusion Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Motivation • GIS have been extensively used by several kinds of organizations • Organizations may need to interchange geographic data • Problem: data heterogeneity • a same geographic entity may have different representations in different organizations • Solutions for supporting geographic data interoperability among autonomous and heterogeneous sources are required Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Motivation • Information interchange among GIS must solve heterogeneities at the following levels: • syntactic • semantic • Syntactic level -> schema heterogeneity • requires conversion of export and import formats • does not ensure that the data have any meaning to new users • Semantic level – two geographic entities represent the same real world fact? Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Tendency • Current solutions for syntactic and semantic interoperability among GIS are based on the use of standards and ontologies • Main initiatives • Geography Markup Language (GML) • Ontology Web Language (OWL) Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Proposal • A method for semi-automated determination of semantic similarities between elements of distinct GML schemas • consider the aid of an ontology as a basis for common knowledge • may consider expert user intervention • Contributions • Support for the development of GIS that requires semantic interoperability • Solution applied to recent technologies for representing geographic data and ontologies • GML and OWL • The method is applied to urban registration domain • Not so much explored on related work • Domain with large potential for practical applications • The method focus on the integration of small non-interconnected data sources Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Summary • Introduction • Method overview • Preprocessing • Definition of the similarity score • Mapping catalog • Conclusion Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Input Processing (on GML’ schema home) ... ... Output The Proposed Method Domainontology GML’’schema wrapper wrapper (a) (a) Similaritydefinition (b) (b) Mappingdefinition Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
... ... The Proposed Method Domainontology GML’’schema wrapper wrapper (a) (a) Similaritydefinition (b) (b) Processing (on GML’ schema home) Mappingdefinition Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Summary • Introduction • Method overview • Preprocessing • Definition of the similarity score • Mapping catalog • Conclusion Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Complex element Attribute O1 G1 Relationship O2 O3 O4 O5 G2 G3 G4 Data PreProcessing • A wrapper is used to convert ontology and GML schemas into a canonic (tree) structure OWL GML O1 = “Parcel” O2 = “address” (string) O3 = “BlockNumber” (integer) O4 = “isPart” (“Block”, atomic) O5 = “hasRepresentation” (“geographicRepresentation”, multivalued) G1 = “ParcelArea” G2 = “address” (string) G3 = “Block” (integer) G4 = “isPart” (“BlockMTR”, atomic) Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Summary • Introduction • Method overview • Preprocessing • Definition of the similarity score • Mapping catalog • Conclusion Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition • Types of conflicts considered: • Nomenclature • Synonyms • Homonyms • Composition • Structure (properties) • Relationships • Generalization/Specialization • Association Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition • We adapt the metrics proposed by Dorneles et al. (2004): • Metrics for Complex Values (MCV) • applied to data structures (complex element) • Metrics for Atomic Value (MAV) • applied to simple data (strings,dates, …) • application domain dependent • This metric set refers to a taxonomy appropriate to XML data handling. Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition • Each GML schema tree node is tested against each ontology tree node • A node name is initially tested for equality against a table of synonyms: Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition • If one or more corresponding synonyms are found, a structure similarity metric is applied on each positive result OWL OWL GML GML Parcel Parcel O1 O1 O1 G1 G1 G1 O2 O2 O2 O3 O3 O3 O4 O4 O4 O5 O5 O5 G2 G2 G2 G3 G3 G3 G4 G4 G4 Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition • If no corresponding synonym is found, a new search is done on the synonym table, applying a name similarity metric • Example: “BlockMTR” = “Block” • Chosen metric: Jaro Winkler • It extends the Jaro metric • It prevents strings that differ only at the end from having a large distance between them • It considers the concept of prefix Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition • If the similarity score is acceptable, the structure similarity metric is applied on each result • The pair with higher similarity score is chosen Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Structure Similarity Metric • εp : a node on set p • εd : a node on set d • p : set of element nodes from GML schema tree • d : set of class nodes from the ontology tree • n e m : number of children from εp and εd, respectively Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Similarity Score Definition εd εp Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Simple Attribute Metric • This metric is composed by • Jaro Winkler metric for names • Data type compatibility analysis • nameSim – attribute name similarity • typeSim – data type similarity • names and data types have different weights Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Jaro Winkler Metric JaroWinklerScore(s,t) = JaroScore(s,t) + (prefixLength * PREFIXSCALE * (1 - JaroScore(s,t))) • prefixLength- the length of the common prefix at the start of the string • PREFIXSCALE - a constant scaling factor for how much the score is adjusted upwards for having common prefix's • Examples: • Block≈ BlockMTR ≈ 0,875 + (0,5 * 0,125) = 0,937 • ParcelCTM≈ ParcelTaxable ≈ 0,820 + (0,6 * 0,179) = 0,927 Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Relationship Metric • This metric is composed by • Jaro Winkler metric for names • Concept similarity • Cardinality constraint analysis • nameSim – relationship name similarity • concSim – concept similarity • cardSim – cardinality similarity • The components of the formula have different weights Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Complex element OWL GML Attribute O1 G1 Relationship O2 O3 O4 O5 G2 G3 G4 Example of Similarity Definition • sim2 = attrSim (G2, O2) = 1 [address ≈ address] • sim3 = attrSim (G3, O3) = 0,95 [BlockNumber ≈ Block] Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Complex element OWL GML Attribute O1 G1 Relationship O2 O3 O4 O5 G2 G3 G4 Example of Similarity Definition • sim4 = relSim (G4, O4) = 0,98 [isPart ≈ isPart] Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Example of Similarity Definition • tupleSim() = (sim2 + sim3 + sim4) / 4 • tupleSim() = (1 + 0,95 + 0,98) / 4 =0,73 Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Summary • Introduction • Method overview • Preprocessing • Definition of the similarity score • Mapping catalog • Conclusion Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Mapping Catalog • The catalog is composed by two table sets • Information about the imported GML schemas (metadata) • Schema mappings • Each element on the main GML schema may have an equivalent concept in the ontology • Elements and similarities on the GML” schemas are related to the concepts from the main GML and the ontology Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Mapping Catalog - Example Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Summary • Introduction • Method overview • Preprocessing • Definition of the similarity score • Mapping catalog • Conclusion Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Conclusion • The assumptions that bases our work is • Geographic data interchange happens mainly among domains with some affinity • Geographic data are better defined semantically on a specific domain than through domain generalization • In this context, we expect that our method is useful as part of a system for GIS data integration Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Main Contribution • This work proposes a solution for the problem of semantic interoperability among GML schemas within the domain of urban registration • Method characteristics • an ontology that represents the domain knowledge • semi-automated equivalence determination Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Related Work • Related work focus on translating queries executed on closely interconnected heterogeneous environments • This work focus on data integration on environments that are not necessarily interconnected • This research includes a scenario where: • small municipalities, individually, have no means to keep complex systems • geographic data are spread over many institutions • On the other hand, as a consortium, they could promote data interchange through a mechanism that would identify the similarity among them Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Future Work • To define and execute experiments to validate and improve the method • To increase the scope of the domain • To extend the method to be applied to other domains • To consider other ontologies • To provide the integration of GML instances • To specify an environment for distributed geographic data queries based on the mappings Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
GBD UFSC Data Base Group of Santa Catarina Federal University Thanks! A Method for Defining Semantic Similarities between GML Schemas Angelo Augusto Frozza Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Application: Urban Register • Ontology Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br
Application: Urban Register • GML schema Angelo Augusto Frozza, Ronaldo dos Santos Mello {frozza, ronaldo}@inf.ufsc.br