390 likes | 490 Views
Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach. Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, milidiu@inf.puc-rio.br} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics. Summary. Motivation
E N D
Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, milidiu@inf.puc-rio.br} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics
Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions
Motivation • Goal – Gazetteer Integration • how to migrate entries from gazetteer GB to gazetteer GA • Problems • Duplicated Entries Elimination:Gazetteers may “overlap” – requires detecting and eliminating duplicates • Reclassification of migrated entries:Gazetteers may adopt different classification schemes – requires mapping the classification scheme of GB to that of GA
Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions
WordNet (2005), “WordNet - a lexical database for the English language”. Cognitive Science Laboratory, Princeton University, Princeton, NJ – USA. Available at: http://wordnet.princeton.edu Gazetteers & Thesauri • Gazetteer • a gazetteer is “a geographical dictionary (as at the back of an atlas) containing a list of geographic names, together with their geographic locations and other descriptive information” [WordNet 2005]. • a gazetteer is a catalog of geographic feature, where each entry has as attributes: • a unique ID • a unique type – a term taken from a feature type thesaurus • a name • optionally, a location – an approximation of the feature footprint
UNESCO (1995), “UNESCO Thesaurus”. United Nations Educational, Scientific and Cultural Organization, 1995. http://www.ulcc.ac.uk/unesco Gazetteers & Thesauri • Thesauri • a thesaurus is “a structured and defined list of terms which standardizes words used for indexing” [UNESCO 1995] • thesaurus relationships • NT – narrower term • BT – broader term • RT – related term • ...
ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer Gazetteers & Thesauri ADL Gazetteer Ex:ADL Feature Type Thesaurus
Gazetteers & Thesauri ADL Feature Type Thesaurus (sample terms rooted at ‘regions’)
Gazetteers & Thesauri ADL Feature Type Thesaurus (sample entry)
Wrapper Wrapper DataSource DataSource Gazetteers & Thesauri Mediator Mediator GAZ CAT DS CAT GAZ Reference Gazetteer Local Catalogue External Catalogue External Gazetteer Local DataSource Wrapper DataSource
Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions
ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS Gazetteer Integration • Gazetteer Integration Problem • how to migrate entries from gazetteer GB to gazetteer GA TA TB GA GB ADL Gazetteer GEONet
Gazetteer Integration • Duplicated Entries Elimination: • Gazetteers GA and GB may have entries that representthe same real-world features • use footprints to detect possible duplicates FB FA fa ≡ fb TA TB GA GB ADL Gazetteer GEONet
Gazetteer Integration • Reclassification of migrated entries: • Gazetteers may adopt different classification schemes – requires mapping the classification scheme of GB to that of GA TA TB GA GB m( tb ) = ta ADL Gazetteer GEONet
Gazetteer Integration • Aligning terms does not work... ...
Gazetteer Integration • Aligning term definitions is even worse... • (ADL) bay: indentations of a coastline or shoreline enclosing a part of a body of water; bodies of water partly surrounded by land. • (GNS) bay: a coastal indentation between two capes or headlands, larger than a cove but smaller than a gulf. • (GNS) island: tracts of land, smaller than a continent, surrounded by water at high water.
SWEET (2006) The Semantic Web for Earth and Environmental Terminology (SWEET). Jet Propulsion Laboratory, California Institute of Technology. Available at: http://sweet.jpl.nasa.gov/index.html Gazetteer Integration • Formal approaches (based on DL) are hopeless... ... <owl:Class rdf:ID="Island"> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource= "http://sweet.jpl.nasa.gov/ontology/space.owl#surroundedBy_2D" /> <owl:allValuesFrom> <owl:Class> <owl:unionOf rdf:parseType="Collection"> <owl:Class rdf:about="#OceanRegion" /> <owl:Class rdf:about="#LandwaterRegion" /> </owl:unionOf> </owl:Class> </owl:allValuesFrom> </owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf rdf:resource="#LandRegion" /> </owl:Class> ... </rdf:RDF>
Gazetteer Integration • Instance-based Thesauri Mapping: • use duplicates to figure out how to map the classification scheme of GB to that of GA FB FA fa ≡ fb TA TB GA GB m( tb ) = ta ADL Gazetteer GEONet
Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions
TA TB GB GA Instance-based Thesauri Mapping Approach Conceptual and Statistical Model • n(ta ,tb)= number of occurrences of pairs of objects faand fbsuch that: • fa GAand fbGB • fa≡ fb • taand tbare the types of fa, and fb, respectively • n(ta) = the number of entries in FA classified as ta FA FB
TA TB GB GA n( ta , tb ) + Δ 1 P( ta, tb ) = n( ta )+ 1 | TB | Instance-based Thesauri Mapping Approach Conceptual and Statistical Model • P(ta ,tb) = Mapping Rate Estimator • an estimation for the frequency that the term tamaps to tb, for each pair of terms ta TA and tb TB FA FB where:Δ =
TA TB GB GA Instance-based Thesauri Mapping Approach Conceptual and Statistical Model • = Threshold Mapping Rate • m(tb) = ta iff P(ta ,tb) Problem: What is the value of ? FA FB
Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions
ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteer GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS Experiments with Geographic Data Data collection • ADL Gazetteer (ADL Feature Type Thesaurus - TA) • Instances: 16783 • Thesaurus terms: 210 • GEOnet Server Names (GEOnet Thesaurus - TB) • Instances: 87608 • Thesaurus terms: 642
Experiments with Geographic Data Model Evaluation & Test • Data collected was partitioned into 7 datasets • 6 for tuning • 1 for testing Tuning sets Testing set
Experiments with Geographic Data Collected data Testing set 6-fold cross-validation
Training Set (Tk) ... Experiments with Geographic Data Collected data Testing set 6-fold cross-validation
Experiments with Geographic Data ... ...
Validation Set (Vk) ... Experiments with Geographic Data Collected data Testing set 6-fold cross-validation
Testing set Experiments with Geographic Data Collected data Validation Step Training Set (Tk) Validation Set (Vk) ... ... 6-fold cross-validation
Experiments with Geographic Data Collected data Testing set 6-fold cross-validation
Experiments with Geographic Data Collected data Estimated Threshold Mapping Rate Testing set 6-fold cross-validation
Experiments with Geographic Data Collected data Testing set 6-fold cross-validation
Experiments with Geographic Data Testing Step Collected data Threshold: 0.4 • Legend: • C: correct term alignments • P: proposed term alignments Testing set Example: Aligned terms ... 6-fold cross-validation
Summary • Motivation • Gazetteers & Thesauri • Gazetteer Integration • Instance-based Thesauri Mapping • Conceptual and Statistical Model • Experiments with Geographic Data • Conclusions
Conclusions • Conclusions: • duplicates help reclassification ! • a “semantic approach” may work when “syntactic approaches” fail (badly) • If you buy the idea, you also get... • a strategy to gradually learn how to reclassify gazetteer entries (as in a mediator) • a strategy to mediate access to object catalogs in general(as long as it is possible to detect duplicates) • (Gazetteer for the Brazilian territory: • extracted from the ADL Gazetteer • entries classified according to 4 different (aligned) schemes • encapsulated by Web services)
Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, milidiu@inf.puc-rio.br} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics