250 likes | 514 Views
Geographical Information Retrieval . Instituto Superior T écnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark. Por Bruno Martins (bgmartins@gmail.com). Motivation for Geographic IR. Geo-information associates things and events with places.
E N D
Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins (bgmartins@gmail.com)
Motivation for Geographic IR • Geo-information associates things and events with places. • Geo-information is abundant on the Web and on Digital Libraries. • Collections of geo-referenced photographs. • Newsfeeds. • General databases of geo-referenced information. • Around 80% of Web pages contain references to places. • Many information needs are related to a given geographical context. • Find me the nearest restaurants. • Find me news about Lisboa. • Find me photographs taken in Sintra. • ... • Around 20% of Web searches are “local” in nature. • Geographic information is part of our everyday lives!
Existing Geographical IR Systems • Web search engines with “local search” • Yahoo! Local, Google Local, ... • Integration with navigation mechanisms. • Mostly explore “Yellow-pages” information. • Web-based GIS platforms (virtual globes) • Google Earth, ... • Explore databases of georeferenced info. • OGC standards for Web-GIS • Photo repositories with “local search” • Flickr geo-tagging interface, ... • Explore automatic “GPS” geo-referencing. • Many more location-based services • Advertisement, discussion communities, ... • Location is everywhere in information systems.
Challenges for Geographical IR • Very few systems explore information on the Web directly. • They instead used databases of georeferenced information. • Geographic context embedded in natural language descriptions. • This presents problems to automated processing. • Place names are ambiguous and get confused with names of organizations, people, buildings and streets. • Web queries depend on exact match of text terms. • Handling structured queries (e.g. “concept, relation, location”). • Intelligent interpretation of spatial relationships (“near”, “west” etc). • Ranking results against some measure of geographic relevance.
Geographical Information Retrieval (GIR) • Geographic information retrieval (GIR) is concerned with the retrieval of geographically referenced information objects. • Information objects can be maps, images, digital geographic data or even textual (web) documents. • New multidisciplinary field • Combines techniques from database systems, information retrieval, digital libraries, user interfaces, geographical information systems, ... Geographic IR Knowledge Management Geographic Information Systems Information Retrieval
The difference among GIR and GIS • GIS is concerned with exact spatial representations and complex analysis at the level of the individual spatial object or field. • Users are experts, information is structured and unambiguous! • GIR is concerned with retrieving geo-referenced information resources that may be relevant to a geographic query region. • Unstructured and ambiguous information, everyday applications! • Similar to the difference between search engines and relational database systems!
Geo-referencing and GIR Y X • Information objects can be geo-referenced by either place names or by geographic coordinates (i.e. longitude & latitude) • Geographic coordinates represent exact physical location • Placenames are ambiguous (main problem of GIR) • Spatial relations may be either: • Geometric: distance and direction measured on a continuous scale. • Topological: spatially related but not directly measurable.
Anatomy of a Geographical IR System Mapping Ontology a.k.a. Gazetteer Query disambiguation Query footprint Broker Search Request + Query footprint User Interface Unranked Results Ranked Results Relevance Ranking Search Engine Ranked Results Info. Resources Textual Indexes Spatial Textual Spatial Textual Document Footprints Spatial Geo- tagging Text Indexing
Gazetteers / Geographic Ontology • Database containing placenames, the spatial relationships among them and the associated geographical footprints. • Support for geo-referencing with basis on the place names over text. • Many problems in using traditional gazetteers for GIR.
Roles of the Gazetteer in GIR document collection Metadata Extraction document footprints User Interface Geo-Tagging document footprints Query Disambiguation Spatial Index Relevance Ranking Query Expansion Relevance Ranking Search Component (query footprint) gazetteer
Challenges to using Gazetteers in GIR • To be useful in GIR the gazetteer should support • Different locations and boundary changes, integrating data from multiple sources. • Synonymous and variant names with differing locations for the same entity. • Different relationships among concepts. • Names in multiple languages. • “Fuzzy” regions and intra-urban place names. • More than gazetteers, we need an ontology!
Existing Gazetteer Systems/Services • Alexandria Digital Library (ADL) Gazetteer. • ~6 million entries • Has tried to standardize the format, description, and distribution of gazetteer data. • Has a published, detailed schema. • Basis for OGC standard. • Geonames website. • Integrates information from multiple sources. • Publishes OWL ontology. • ~6 million entries • EuroGeoNames project.
GeoTagging = GeoParsing+GeoCoding Geo-parsing Recognizing geographic references, ignoring non-geographic uses of place terminology Geo-coding Attaching a unique quantitative location (footprint) to the extracted geographic references
GeoParsing Textual Documents • The presence of placenames can be recognised with the help of gazetteers/geo-ontologies (i.e. lists of names) • Some types of place references given over text: • the name of the place : Coimbra • an address: INESC-ID, Rua Alves Redol, 9 Lisboa • an address fragment: “Manuel lived near Largo do Rato in Lisboa” • a postcode / zip code: 2840-137 • a phone number : most Lisbon phone numbers start with +351 21
Ambiguity in GeoParsing Documents Examples of false place references: • Personal names Smedes York,Jack London • Business names Dorchester Hotel,York Properties.. • Street names Oxford Street, London Road… • Common words bath, battle, derby, over, well, …… • Approach for handling ambiguity: • Look for patterns in surrounding context!!! • One reference per discourse.
GeoCoding place references in text Many different places with the same name (referent ambiguity) Newport, Cambridge, Springfield, Lisboa……… • Use context to decide: references to parent or nearby places. • Choose most important one: by population or place type. • Optional step taken by some GIR approaches: • Finding a document’s encompassing geographic scope. • Combine all place references given in the document. • Use heuristics to guide the process.
Document Indexing for Geographic IR • Different indexing strategies are possible: • Index documents with basis on gazetteer ids. • Use documents scopes to create document footprints (point, bounding rectangle, ...) and use footprints to index documents. • Strategy for handling queries: • Convert query to a query footprint/gazetteer id. • Match query footprint to document footprints/ids. • Rank documents according to “relevance”.
Data structures for indexing in GIR • Typical strategy is to have separate indexes. • Inverted index for text. • R-tree for footprints. • Access spatial index with query footprint/gazetteer id. • Access text index with query terms. • Merge results and find the intersection.
Ranking search results in GIR • Spatial similarity can indicate relevance • Documents whose spatial content is more similar to the spatial content of query should appear first. • But we need to consider both the: • Thematic relevance: BM25, TF-IDF, ... • Geographic relevance: proximity, containment, ... • Geometric (e.g. distance) and non-geometric (e.g. topology) • Other importance metrics: PageRank • State of the art consists of doing a linear combination.
Existing GIR systems : MetaCarta The MetaCarta system • Pioneer system addressing all aspects given in this talk. • Conducts geo-parsing and geo-coding of text documents, and sends back possible location references with relative strength scores. • Uses Natural Language Processing (NLP) to find possible location references. • Contains a gazetteer of ~14 million entries.
Other GIR Systems : Research projects • Prototype system from the SPIRIT EU project • Spatially-aware information retrieval on the Internet. • Geo-tagging of Web documents with basis on geo-ontology. • Alexandria Digital Library • Digital library of geo-referenced materials. • Focus on development of a large gazetteer. • GREASE, GIPSY, Web-a-Where, GeoXWalk, ... • Many more research projects addressing GIR aspects individually. • GeoCLEF evaluation contest similar to TREC. • Project DIGMAP under development at IST • Digital library for old maps and historical cartography resources • Indexing metadata records for geographic retrieval.
Current Challenges in Geographic IR • Improve “conventional GIR” components and methods • Geo-tagging, spatio-textual indexing and geo-relevance ranking. • Improved understanding of spatial natural language terminology. • Principled approaches for integration and evaluation of GIR. • Better user interfaces for exploration of GIR results. • Integration of geographical with temporal aspects. • Everything we do happens in space and time! • Creation of rich place ontologies with world-wide coverage. • Fuzzy regions and intra-urban placenames present challenges • Open GeoInformation Web services and Geospatial Semantic Web.
Where To Find More Information • Georeferencing: The Geographic Associations of Information • By Linda L. Hill (Author), MIT Press • Proceedings of the Workshops on Geographical IR • Edited by Chris Jones and Ross Purves (4th edition in 2007, Lisbon) • Talk to me using the email address bgmartins@gmail.com