1 / 25

Web-a-Where: Geotagging Web Content

Web-a-Where: Geotagging Web Content. Einat Amitay, Nadav Har’EI, Ron Sivan and Aya Soffer IBM Haifa Research Lab, Haifa, Israel (SIGIR 2004, p.273-280). Abstract. Web-a-Where A system for associating geography with Web pages.

joel-holden
Download Presentation

Web-a-Where: Geotagging Web Content

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web-a-Where: Geotagging Web Content Einat Amitay, Nadav Har’EI, Ron Sivan and Aya Soffer IBM Haifa Research Lab, Haifa, Israel (SIGIR 2004, p.273-280)

  2. Abstract • Web-a-Where • A system for associating geography with Web pages. • It locates mentions of places and determines the place each name refers to. • It assigns to each page a geographic focus: a locality that the page discusses as a whole. • Precision of the tagger up to 82% on individual geotags is achieved. • 91% of the foci reported are correct up to the country level.

  3. Introduction (1/3) • A page may have two types of geography associated with it: a source and a target. • A news article about Northern Ireland appearing on the CNN site (USA). • Two types of ambiguities: • geo/non-geo: e.g., a person name, Berlin, or a common word, Turkey. • geo/geo: distinct places have the same name: London, England vs. London, Ontario.

  4. Introduction (2/3) • In USA, there are 18 cities named Jerusalem, 24 named Paris and 63 Springfields in 34 states. • On Web-pages, 37% have several possible geographic meanings, and the average number of possible meanings per mention is roughly 2.

  5. Introduction (3/3) • Geographic name disambiguation • NLP employs machine learning to recognize names from their structure and context=> too complex for tremendous data on Web to work fast. • Gazetteer approach uses the glossary to recognize. • Web-a-Where employs the gazetteer approach. • Its goal is to identify all geographic mentions in Web pages, assign a geographic location and confidence level to each, and derive a focus (or foci) for the entire page.

  6. Tagging Individual Place Names • Geotagger: find and disambiguate geographic names, currently those of cities, states and countries. e.g. Paris/France/Europe (called taxonomy node). • The list of geographic names, their canonical taxonomies and other pertinent information is kept in a database known as a gazetteer. • The processing of a page is done in three phases: • Spotting • Disambiguation • Focus determination

  7. The Gazetteer (1/3) • The gazetteer contains a hierarchical view of the word, divided into continents, countries, states (for some countries) and cities. • A place is associated with a canonical taxonomy node, a number of names and/or abbreviations, world coordinates and a population estimation. • Nearly 40,000 places are listed, together with alternative spellings and abbreviations for a total of about 75,000 names: most in English and some in the vernacular.

  8. The Gazetteer (2/3) • The gazetteer is automatically created from a number of freely available data sources. • The gazetteer contains a non-geo section listing place names that are also very commonly used word. E.g., “To” (Myanmar), or “Of” (Turkey) are very commonly used.

  9. The Gazetteer (3/3) • Two tests for the non-geo section • Names that appeared more than 100 times, but in most cases were not capitalized as a name should be, were included. E.g., “Asbestos” (Quebec) and “Humble” (Texas). • Names mentioned much more frequently than their population would warrant were also included. E.g. “Grove” (Spain) (10,976) and “Atlantic” (Iowa) (7,474). Their high frequency did not match their small populations. • Require a manual pass: e.g. remove “Aspen”, add “Metro”

  10. Spotting Place Name Candidates • To find (or spot) all the possible geographic names in each page. • Case-insensitive appearance. • The list of words to spot is the list of all names in the gazetteer. • Short abbreviations are not spotted since often they are too ambiguous, as IN (Indiana or India, but also a common English preposition) or AT (for Australia).

  11. Disambiguating Spots (1/2) • If the tokens in the vicinity of a spot can uniquely qualify it, as in “IN” immediately following a spot of “Chicago”, the geotagger assigns this meaning to that spot with a confidence in the range of 0.95 – 1. Otherwise, left unassigned for the moment. • Each unresolved spot is assigned its default meaning: the largest population. 0.5 confidence. • In case the page has multiple spots of the same name where only one is qualified, the meaning of the qualified spot is delegated to the others. Confidence range: 0.8-0.9.

  12. Disambiguating Spots (2/2) • A disambiguating context for the spots that are still unresolved (confidence<0.7) is now sought. • “London” and “Hamilton” appear in the same page without further qualifications. “London” include “England, UK” and “Ontario, Canada” while “Hamilton” exists in “Ohio, USA”, “Ontario, Canada” as well as in “New Zealand”. The smallest disambiguating context is “Ontario, Canada”. • Confidence between 0.65 and 0.75, depending on whether the assigned meaning matches the spot’s default meaning.

  13. Determining Page Focus • Once we determined the correct meaning of every geographical name mentioned in the page, we would also like a way to decide which geographic mentions are incidental, and which constitute the actual focus of the page.

  14. Rationale of Focus Algorithm • The basic idea is that if several cities from the same region are mentioned, this might mean that this region is the focus. • E.g. San Francisco (Calif.), Los Angeles (Calif.) and San Diego (Calif.) => California. • E.g. San Jose (Calif.), Chicago (Ill.) => US. • Sometimes there are many foci in a page. => Try to coalesce many places into one region before declaring foci. • If a small region is the real focus, we should not report a larger region.

  15. Outline of Focus Algorithm(1/3) • Add a score to the taxonomy node (e.g., Paris/France/Europe), while adding lower scores to the enclosing hierarchies (France/ Europe and Europe). • The region that the focus-finding algorithm reports is a taxonomy node from the gazetteer. • The focus is limited to being a city, state, country or continent.

  16. Outline of Focus Algorithm(2/3) • A page contains 4 Orlando/Florida (.5), 3 Texas (.75), 8 For Worth/Texas (.75), 3 Dallas/Texas (.75), one Garland/Texas (.75), and 1 Iraq (.5). • A human: about Texas and perhaps also Orlando. • Indeed, that page comes from the “Orlando Weekly” site, in a forum titled “Just a look at the Texas Local Music Scene”.

  17. Outline of Focus Algorithm (3/3) • Texas got the top score because several separate cities – Fort Worth, Dallas and Garland contributed to it and is chosen as a focus. • US already cover Texas, so it is dropped. • Fort Worth is covered by Texas and is dropped. • Orlando/Florida is taken as a 2nd focus. • Iraq/Asia is below the importance threshold (.9), so is ignored.

  18. The Focus Scoring Algorithm • For a place with a taxonomy node A/B/C with confidence p[0, 1]. • Add p2 to A/B/C. • Add p2d to the enclosing region B/C. • Add p2d2 to C, where 0<d<1 (0.7) • Sort by score, loop over them from highest to lowest. • Stop at the low threshold (.9), or if sufficient many foci were already found (4). • Skip taxonomy levels that cover or are covered by on already selected as focus. Otherwise, add this level to the list of foci.

  19. Implementation • The Web-a-Where geotagger, implemented as a WebFountain miner, outputs the meaning for each place name in the text, together with a confidence figure for that meaning. • It also produces a set of up to four foci per page.

  20. Evaluating the Geotagger (1/3) • Evaluating corpus: 200 pages per collection • Arbitrary collection • Query Google with :+the, +and, +in • Collect the top 1000 results for each query • .GOV collection • .gov domain for the standard test of TREC03. • Open Directory Project (ODP) collection • Regional sub-directory • The chosen pages are larger then 3k.

  21. Evaluating the geotagger

  22. Evaluating the Geotagger (3/3)

  23. Testing Page Focus (1/2) • Random 20,000 Web-pages from the ODP’s Regional section that were larger than 3k.

  24. Testing Page Focus (2/2) • Will enlarging the gazetteer by adding smaller and smaller towns improve the page-focus determination, or hurt it?=> Web-a-Where is helped, not hurt, by an improved gazetteer.

  25. Conclusion • Experiments show that it can correctly tag individual name place occurrences 80% and recognize the correct focus of a page 91%. • The main source of errors is geo/non-geo ambiguity. => use part-of-speech tagger. • Geo/geo accuracy can be improved by improving the “disambiguating context” heuristics, and by devising additional ones.

More Related