300 likes | 468 Views
Adaptive Context Features for Toponym Resolution in Streaming News. Group 12 Hari Kishan Bandaru V S P V S K Kumar Parimi Sneha Anand Yeluguri. Paper. Adaptive Context Features for Toponym Resolution in Streaming News Michael D. Lieberman , Hanan Samet
E N D
Adaptive Context Features for Toponym Resolutionin Streaming News Group 12 HariKishanBandaru V S P V S K Kumar Parimi SnehaAnandYeluguri
Paper • Adaptive Context Features for Toponym Resolution in Streaming News • Michael D. Lieberman , Hanan Samet • Venue: In SIGIR’12: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval
Outline • Motivation • Related work • Problem Definition • Key concepts • Method • Validation/Results • Conclusion
Motivation • Demand for ever growing volumes of news and information. • People strive to stay up-to-date. • Internet-enabled mobile devices require location-based services.
Related work • Several commercial products for geotagging text are available, such as • MetaCarta’s Geotagger • Thomson Reuters’s OpenCalais • Yahoo!’s Placemaker
Problem Definition • The problem of assigning each toponym its correct lat/long values in the process of Geotagging, called toponym resolution, is a classification problem, where each of the possible interpretations for each toponym is classified as correct or incorrect, can be solved using our adaptive context features.
Introduction • News itself often has a strong geographic component. • Articles describing events that are relevant to geographic locations of interest to their readers. • Understand the geographic content present in the articles (Geotagging).
Geotagging Steps • Toponym recognition • finding all textual references to geographic locations. • Toponym Resolution • choosing the correct location interpretation for each toponym.
Key concepts • GEOTAGGING FRAMEWORK • Toponym Recognition • Toponym Resolution • Resolution Features • ADAPTIVE CONTEXT FEATURES • Proximity Features • Sibling Features • Feature Computation • Feature Propagation
Toponym Recognition • Toponym recognition procedure is designed as a multifaceted process involving • both rule-based and statistics-based • Perform lookups into various tables of entity names including location names, abbreviations, business names, person names, as well as cue words
Toponym Recognition • NLP tools, an NER package to recognize toponyms and other entities, and perform extensive post-processing on its output to ensure higher quality. • also perform • part-of-speech (POS) tagging to find phrases of proper • nouns, since names of locations (and other types of entities) • tend to be composed of proper nouns
Toponym Resolution • Methods from supervised machine learning to implement toponym resolution were used. • For a given toponym/interpretation pair (t, lt), decision is correct or incorrect. • Location interpretations are drawn from a gazetteer
Toponym Resolution • Decision tree-based ensemble classifier method random forests. • The random forests method constructs many decision trees based on different random subsets of the dataset, sampled with replacement. • Each decision tree is constructed using random subsets of features from the training feature vectors.
Previous Methods • One early proposed method considered the use of SVM regression to estimate a distance function based on feature vector values that is intended to capture the distance between a given lt, and t’s ground truth interpretation.
Resolution Features • Used several baseline toponym resolution features • I: Number of interpretations for t. • P: The population of lt, where a larger population indicates that lt is more well-known. • A: Number of alternate names for lt in various languages. More names indicates greater renown of lt. • D: Geographic distance of lt from an interpretation of a dateline toponym, which establishes a general location context for a news article. • L: Geographic distance of lt from the newspaper’s local lexicon, the expected location of its primary audience, expressed as a lat/long point.
Adaptive Context Features • Features reflect two aspects of toponym co-ocurrence and the evidence that interpretations impart to each other • Proximate interpretations • Sibling interpretations
Proximity Features • These are based on geographic distance. • Find for each other toponym o in the window around t the closest interpretation lo to lt. • The author computes the proximity feature for (t, lt) as the average of the geographic distances to the other interpretations. • The learning procedure can learn appropriate distance thresholds from its training data.
Sibling Features • Capture the relationships between textually proximate toponyms that share the same country, state, or other administrative division. • For each toponym/interpretation pair (t, lt), sibling feature value the number of other toponyms o in the window around t with an interpretation that is a sibling of lt at a given resolution.
Feature Accuracy • Window breadth, corresponds to size of the window around t . • Window depth is the maximum number of interpretations to be considered for each toponym in the window. • Rank these interpretations using various factors like GeoNames, Population of the location, Geographic distance.
Validation/Results • General difficulty of geotagging due to large gazetteer, large amount of toponym ambiguity. • The extensive experiments performed on adaptive method and competing geotagging methods: • Thomson Reuters’s OpenCalais, and • Yahoo!’s Placemaker • Vary the adaptive context parameters(window breadth and depth) and their affect on • feature computation time • accuracy of the Adaptive method
Gazetteer Ambiguity Toponyms and the number of interpretations
Datasets Breakdown of location types within each of test corpora
Resolution Accuracy Resolution accuracy of various methods
Resolution Accuracy(Contd.) Importance of features used in the Adaptive method
Conclusion And Future Work • Adaptive context features serve as a flexible, useful addition to geotagging algorithms for streaming news and other textual domains. • Test different toponyms weightings in window to judge their effect on resolution accuracy. • Consider clusters of news articles about the same topic and design other features using these clusters.