180 likes | 194 Views
A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc. ‘wine’ in Europe. Al Hamra. (= ‘red’ in Arabic). Local and non-local information. More non-local information -> too many states to get probabilities. Madison.
E N D
A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.
Al Hamra (= ‘red’ in Arabic)
Local and non-local information More non-local information -> too many states to get probabilities Madison ‘s downtown Wisconsin Milwaukee
Deir az Zor • (32.10 N 41.11 E), 0.325 • (25.03 N 31.44 E), 0.151 • (….) confidence • 38 01'10.5"N 121 44'48.8"W • four miles south of Lusaka • (22.10 S 15.51 E) Candidate places
Minister Ishihara Ishihara, Japan (32.36 N 147.21 E) Local context resident of Madison Madison, WI; Madison, ID; Madison, CT; Madison, KY…
Context affects confidence • Increase or decrease c(p,n) based on strength of context words • “by Madison” vs. “President Madison” • can be added manually or automatically • and/or use HMM
Local context problems Madison family attractions Milwaukee Madison, WI; Madison, ID; Madison, CT; Madison, KY…
Increase c(p,n) based on number of other references: Enclosing regions or nearby points Madison Wisconsin Milwaukee
Ishihara, Japan’s leading epidemiologist, Ishihara, Japan (32.36 N 147.21 E) Pitfalls
Training • “Philadelphia” is usually geographic; “Bend” usually isn’t • If name n often refers to point p in documents, give (n,p) high confidence to start with • Use average confidence in a large corpus
Training cont’d • Extract local linguistic contexts that often occur with geographic names in tagged corpora • Or train HMM
Relevance • Several dimensions to relevance: • Traditional textual relevance of query terms • Georelevance Query: “cheese” in France
Georelevance • Depends on: • Attributes of the geotext, e.g. document frequency, font size, position • Geoconfidence • Aim: combination reflects user’s preferred balance between recall and correctness of the geographic reference • e.g. Georelevance = query term relevance * geoconfidence
Conclusion • Ambiguity problem much worse with large gazetteers • Can use probabilistic methods where feasible (local information), combine with confidence-based heuristics