200 likes | 326 Views
Semantic Web Service Interoperability for Geospatial Decision Making (FP6-026514). Using Term-matching Algorithms for the Annotation of Geo-services. Miha Gr ča r 1 , Eva Klien 2 1 Jo ž ef Stefan Institute, Slovenia 2 Institute for Geoinformatics, Germany. Introduction and motivation.
E N D
Semantic Web Service Interoperability forGeospatial Decision Making (FP6-026514) Using Term-matching Algorithms for the Annotation of Geo-services Miha Grčar1, Eva Klien2 1Jožef Stefan Institute, Slovenia 2Institute for Geoinformatics, Germany
Introduction and motivation • Geo-data • Provided by geo-services • Information about geographical features such as rivers, lakes, roads, quarries, geological structure… • Geo-services • Web-based services • Defined by Open GIS Consortium (OGC) • Web Feature Services (WFS) • Spatial filtering • Common interface (syntactically…) • HTTP/XML-based • Semantic incompatibility (interoperability issue) • Synonymy (e.g. “Aegirite” and “Acmite” is the same mineral) • Data structured differently • Multiliguality (e.g. “river” and “fleuve” is the same thing) • European project SWING – Semantic Web Service Interoperability for Geospatial Decision Making • STREP in the 6th Framework Programme • http://www.swing-project.org/ This is what weare trying to solve
Outline of the talk • Geo-service annotation • Automating the annotation • Text mining • Web as the source of documents • Evaluation • Preliminary evaluation • Larger-scale evaluation • Conclusions and future work
Domain ontology Web Feature Axiomatized concept definitions Service that capture a specific view on the world Represent Geo-service annotation Facilitates discovery and composition How to establish this “bridge”? Real world entities Spatial information objects
Geo-service annotation Domain ontology WFS
Automating the annotation • Term matching is the main building block • Using text mining techniques for term matching • Bag-of-words representation of documents, document similarity • Clustering and classification • Visualization techniques • Using the Web as the source of documents for text mining • Search engines • On-line encyclopedias • Dictionaries, thesauruses…
Similarity? Similarity? Where do we getthesedocuments? Bag-of-words space Automating the annotation Geo-service Domain ontology Schema open-pit mine D:Quarry D:Legislation Similarity? Classifier
One possible source of the documents Context Search term Documents
Preliminary evaluation • Dataset: 150 mineral names together with their synonyms • Train a classifier to distinguish between mineral names
The Web Diopside Preliminary evaluation • Dataset: 150 mineral names together with their synonyms • Train a classifier to distinguish between mineral names Aegirite Alalite Allanite Classifier Synonym Diopside Diopside … … Zincblende Zinc-spinel Zinc vitriol
The Web Diopside Diopside Diopside Preliminary evaluation • Dataset: 150 mineral names together with their synonyms • Train a classifier to distinguish between mineral names Aegirite Alalite Allanite Sort andrecommendto the user Classifier … … Zincblende Zinc-spinel Zinc vitriol
Preliminary evaluation Sort order
Larger-scale evaluation • Datasets • STINET Thesaurus (STINET = Scientific and Technical Information Network) • 16,000 terms interlinked with broader-than, narrower-than, used-in-combination-for, used-alone-for… (2 more) • We took 1,000 term-pairs for each of the narrower-than and used-alone-for relations • GEMET (General Multilingual Environmental Thesaurus) • 6,000 terms interlinked withbroader-than and related-to • We took 1,000 term-pairs for each of the two relations • Tourism ontology • 710 concepts interlinked with is-a • A set of instances (mostly named entities) belonging to the concepts • We took 1,000 named entities and their corresponding concepts, and the entire structure defined by the is-a relation • WordNet (lexical database for the English language) • 115,000 synsets (i.e. sets of synonymous words) interlinked with hypernymy, meronymy, entailment, cause for verbs… (6 more) • We took 1,000 word-pairs for each of 9 selected relations • We also considered the inverted relations for 3 selected relations (e.g. consists-ofis inverse ofpart-of)
Larger-scale evaluation • Examples • GEMET • traffic infrastructure broader-than road network • mineral resource related-to mineral deposit • STINET • numerical methods and procedures used-alone-for gauss-seidel method • potassium narrower-than alkali metals • Tourism ontology • gliding field is-a sports institution • Warsaw instance-of city • WordNet • do drugs causes trip out • snore entails sleep • modify hypernym-of Europeanize • Cretaceous period instance-of geological period • shuffling meronym-of card game • rum meronym-of rum cocktail • housewife synonym-for homemaker
Larger-scale evaluation • Experimental setting • Classification algorithm • k-NN • Centroid classifier • Quotes • Yes – exact occurrence • No – co-occurrence • We ran experiments on 18 datasets, 4 different settings on each dataset; this means roughly 4 x 18,000 term-pairs altogether • We measured accuracy on top 1, 3, 5, 10, 20, 40 “recommended” items
Sort order GEMETrelated-to WordNetsynonymy WordNetentailment,hypernymy WordNetpart meronymy WordNetcause forverbs WordNetsubstancemeronymy WordNetmember meronymy STINET used-alone-for Larger-scale evaluation Synonymy… Meronymy… Verbs…
WordNethypernymy STINETnarrower-than Tourism ontology is-aGEMET broader-than WordNet Tourismontology Larger-scale evaluation Hyper-/Hyponymy… Class membership (instance-of)...
Conclusions • Term’s lexical category (e.g. verb vs. noun) has the largest impact on the accuracy • The dataset has [much] larger impact on the accuracy than the choice of the classifier • General vs. specific vocabulary (works better for specific vocabulary or named entities) • Semantics of the relation (works best for synonymy) • The centroid classifier faster and [slightly] more accurate • Quotes useful on datasets that contain [technical] expressions (e.g. STINET) • Inverting the relation has no major impact on the results
Future work • Try SVM • “Cleanup” the document sets • Active learning • Clustering, removing irrelevant clusters • Both techniques require interaction with the user • Visualize the “term space” • Latent Semantic Analysis (LSA), Multi-Dimensional Scaling (MDS) • Force-directed layout • Use WordNet to infer relations between arbitrary words • Input: two words • Process: detect the corresponding synsets and explore inter-relations • Output: most probable relations (according to WordNet) • Deal with the multilinguality issue • Kernel Canonical Correlation Analysis (KCCA) • Machine translation
Thank you... • ...for your attention • Any questions?