Using Term-matching Algorithms for the Annotation of Geo-services

Semantic Web Service Interoperability forGeospatial Decision Making (FP6-026514) Using Term-matching Algorithms for the Annotation of Geo-services Miha Grčar1, Eva Klien2 1Jožef Stefan Institute, Slovenia 2Institute for Geoinformatics, Germany

Introduction and motivation • Geo-data • Provided by geo-services • Information about geographical features such as rivers, lakes, roads, quarries, geological structure… • Geo-services • Web-based services • Defined by Open GIS Consortium (OGC) • Web Feature Services (WFS) • Spatial filtering • Common interface (syntactically…) • HTTP/XML-based • Semantic incompatibility (interoperability issue) • Synonymy (e.g. “Aegirite” and “Acmite” is the same mineral) • Data structured differently • Multiliguality (e.g. “river” and “fleuve” is the same thing) • European project SWING – Semantic Web Service Interoperability for Geospatial Decision Making • STREP in the 6th Framework Programme • http://www.swing-project.org/ This is what weare trying to solve

Outline of the talk • Geo-service annotation • Automating the annotation • Text mining • Web as the source of documents • Evaluation • Preliminary evaluation • Larger-scale evaluation • Conclusions and future work

Domain ontology Web Feature Axiomatized concept definitions Service that capture a specific view on the world Represent Geo-service annotation Facilitates discovery and composition How to establish this “bridge”? Real world entities Spatial information objects

Geo-service annotation Domain ontology WFS

Automating the annotation • Term matching is the main building block • Using text mining techniques for term matching • Bag-of-words representation of documents, document similarity • Clustering and classification • Visualization techniques • Using the Web as the source of documents for text mining • Search engines • On-line encyclopedias • Dictionaries, thesauruses…

Similarity? Similarity? Where do we getthesedocuments? Bag-of-words space Automating the annotation Geo-service Domain ontology Schema open-pit mine D:Quarry D:Legislation Similarity? Classifier

One possible source of the documents Context Search term Documents

Preliminary evaluation • Dataset: 150 mineral names together with their synonyms • Train a classifier to distinguish between mineral names

The Web Diopside Preliminary evaluation • Dataset: 150 mineral names together with their synonyms • Train a classifier to distinguish between mineral names Aegirite Alalite Allanite Classifier Synonym Diopside Diopside … … Zincblende Zinc-spinel Zinc vitriol

The Web Diopside Diopside Diopside Preliminary evaluation • Dataset: 150 mineral names together with their synonyms • Train a classifier to distinguish between mineral names Aegirite Alalite Allanite Sort andrecommendto the user Classifier … … Zincblende Zinc-spinel Zinc vitriol

Preliminary evaluation Sort order

Larger-scale evaluation • Datasets • STINET Thesaurus (STINET = Scientific and Technical Information Network) • 16,000 terms interlinked with broader-than, narrower-than, used-in-combination-for, used-alone-for… (2 more) • We took 1,000 term-pairs for each of the narrower-than and used-alone-for relations • GEMET (General Multilingual Environmental Thesaurus) • 6,000 terms interlinked withbroader-than and related-to • We took 1,000 term-pairs for each of the two relations • Tourism ontology • 710 concepts interlinked with is-a • A set of instances (mostly named entities) belonging to the concepts • We took 1,000 named entities and their corresponding concepts, and the entire structure defined by the is-a relation • WordNet (lexical database for the English language) • 115,000 synsets (i.e. sets of synonymous words) interlinked with hypernymy, meronymy, entailment, cause for verbs… (6 more) • We took 1,000 word-pairs for each of 9 selected relations • We also considered the inverted relations for 3 selected relations (e.g. consists-ofis inverse ofpart-of)

Larger-scale evaluation • Examples • GEMET • traffic infrastructure broader-than road network • mineral resource related-to mineral deposit • STINET • numerical methods and procedures used-alone-for gauss-seidel method • potassium narrower-than alkali metals • Tourism ontology • gliding field is-a sports institution • Warsaw instance-of city • WordNet • do drugs causes trip out • snore entails sleep • modify hypernym-of Europeanize • Cretaceous period instance-of geological period • shuffling meronym-of card game • rum meronym-of rum cocktail • housewife synonym-for homemaker

Larger-scale evaluation • Experimental setting • Classification algorithm • k-NN • Centroid classifier • Quotes • Yes – exact occurrence • No – co-occurrence • We ran experiments on 18 datasets, 4 different settings on each dataset; this means roughly 4 x 18,000 term-pairs altogether • We measured accuracy on top 1, 3, 5, 10, 20, 40 “recommended” items

Sort order GEMETrelated-to WordNetsynonymy WordNetentailment,hypernymy WordNetpart meronymy WordNetcause forverbs WordNetsubstancemeronymy WordNetmember meronymy STINET used-alone-for Larger-scale evaluation Synonymy… Meronymy… Verbs…

WordNethypernymy STINETnarrower-than Tourism ontology is-aGEMET broader-than WordNet Tourismontology Larger-scale evaluation Hyper-/Hyponymy… Class membership (instance-of)...

Conclusions • Term’s lexical category (e.g. verb vs. noun) has the largest impact on the accuracy • The dataset has [much] larger impact on the accuracy than the choice of the classifier • General vs. specific vocabulary (works better for specific vocabulary or named entities) • Semantics of the relation (works best for synonymy) • The centroid classifier faster and [slightly] more accurate • Quotes useful on datasets that contain [technical] expressions (e.g. STINET) • Inverting the relation has no major impact on the results

Future work • Try SVM • “Cleanup” the document sets • Active learning • Clustering, removing irrelevant clusters • Both techniques require interaction with the user • Visualize the “term space” • Latent Semantic Analysis (LSA), Multi-Dimensional Scaling (MDS) • Force-directed layout • Use WordNet to infer relations between arbitrary words • Input: two words • Process: detect the corresponding synsets and explore inter-relations • Output: most probable relations (according to WordNet) • Deal with the multilinguality issue • Kernel Canonical Correlation Analysis (KCCA) • Machine translation

Thank you... • ...for your attention • Any questions?

Using Term-matching Algorithms for the Annotation of Geo-services

Using Term-matching Algorithms for the Annotation of Geo-services

Presentation Transcript

Schema Matching Algorithms

Using Amazon Mechanical Turk for Product Term Annotation

Fast Matching Algorithms for Repetitive Optimization

Efficient Algorithms for Matching

Algorithms for Maximum Induced Matching Problem

Using BLAST for Genomic Sequence Annotation

Writing algorithms using the for-statement

Efficient algorithms for ( δ , γ , α )-matching

Faster algorithms for string matching problems: matching the convolution bound

Geo-Services

Exact String Matching Algorithms

Annotation consistency using annotation intersections

String Matching Algorithms

Filter Algorithms for Approximate String Matching

String Matching Algorithms

Using BLAST for Genomic Sequence Annotation

Algorithms for Image Matching for Visual Robot Navigation

Benefits of Using Professional Writing Services for Your Term Papers

Recruit candidates using matching Algorithms

Image annotation services