210 likes | 363 Views
Research Paper Presentation – CS572 Summer 2011. Extracting Metadata for Spatially-Aware Information Retrieval on the Internet. Paper by Paul Clough (University of Sheffield Western Bank). Presented by Donghee Sung. Short Overview.
E N D
Research Paper Presentation – CS572 Summer 2011 Extracting Metadata for Spatially-Aware Information Retrieval on the Internet Paper by Paul Clough (University of Sheffield Western Bank) Presented by Donghee Sung
Short Overview • SPIRIT:Spatial awareness to information systems e.g. transport timetables routing system for motorists map-based web sites location based servicesKey Part:Extraction and use of geospatial information
Short Overview • CriteriaSpeed, Reliability, Flexibility, Multilingualism • Geo-Parsing: - Identifying geographic references- Gazetteer lookup with context rules to filter out common-usage words and personal names • Geo-Coding: - Assigning spatial coordinate- Based on information of geographic resource
What’s the SPIRIT? < http://www.geo-spirit.org/ >
What’s the SPIRIT? • SPIRITSPatially-Aware Information Retrieval on the InterneTA search engineto find documents and datasets on the web relating to place or regions
What’s the SPIRIT? • Poor existing web search facilities find information related to a particular location. Vicinity: find other places within radiuswww.somewherenear.comYellow pages services: find a specific place or post codeBuyukkten: associated admin’s IP with telephone area code Stanford Research Institute: proposed ‘.geo’ with cells with latitude and longitude
What’s the SPIRIT? • Resources relating to place may not be found may not be places nearby may have another name • Major Shortcoming:cannot recognize alternative name modern/historical variants informal name contained places name
What’s the SPIRIT? • SPIRIT ProjectQuery expansion / relevance ranking procedures Machine learning techniques extraction of geographical context generating metadata Multi-modal user interface textual input interactive map feedback Spatial indices for web collections.
Data Sources • Sources of Spatial DataTGN, OS, SABE • A large web collection of SPIRIT
Geo-Parsing Techniques • Tokenization Issues Stop-words Named-Entitiy Recognition (NER) Gazetteers
Geo-Parsing Techniques • Named-Entity Recognition (NER)Processing a text and identifying to particular categories of Named Entities(NE) People, Organization, Location. etc
Geo-Parsing Techniques • Tokenization Procedure1) Tokenized on whitespace @words = split(/s+/, $sentence); (Perl Regular Expressions) "Isn't it ashame.“ -> Isn't / it / ashame.2) Stemming / Case conversion. isn't / it / asham3) Removing stop-words
Geo-Parsing Techniques • Default setting in indexing and retrieving- Case sensitivity: Off - Stop-word removal: Off - Stemming: OffStop-word removal / stemming -> Reduce the size of index filesBut, can be useful:Stop-words : ‘in’, ‘inside’, or ‘of’Stemming: “London” from “London” &“Londoner”.
Geo-Parsing Techniques • Filtering candidate locations using context rules to remove stop-words references to people and organizations, and links to emails/URLs
Conclusion • Geo-Parsing method could be improved by enhancing the gazetteer matching and filtering • False hits would be reduced by generating better list of stop-words and using further context rules could reduce • Need for creating rules would be alleviateby generating further context rules with features on machine learning
References [3] Jones C.B., R. Purves, A. Ruas, M. Sanderson, M. Sester, M.J. van Kreveld, R. Weibel (2002). Spatial information retrieval and geographical ontologies an overview of the SPIRIT project. SIGIR 2002: In SIGI’02, Tampere, Finland, 387-388. [6] Joho, H. and Sanderson, M. (2004) The SPIRIT collection: an overview of a large web collection. In SIGIR Forum, 38(2), 57-61. [8] Mikheev A., Moens M. and Grover C. (1999) Named Entity recognition without gazetteers. In Proceedings of the Annual Meeting of the European Association for Computational Linguistics EACL'99, Bergen, Norway, 1-8. Spatially-Aware Information Retrieval on the Internet - A Working Searching System