510 likes | 626 Views
Extracting Geographical Gazetteers from the Internet. Olga Uryupina 30.05.03. Overview. Named Entity Recognition & Gazetteers Data Initial Algorithm Bootstrapping approach Evaluation ToDo. NE Recognition.
E N D
Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03
Overview • Named Entity Recognition & Gazetteers • Data • Initial Algorithm • Bootstrapping approach • Evaluation • ToDo
NE Recognition National Gallery ofScotland – The nucleus of the Gallery was formed by the Royal Institution‘s collection, later expanded by bequests and purchasing. Playfair designed (1850-57) the imposing classical building to house the works.
State-of-the-art systems Standard approaches usually combine • Rules • Statistics • Gazetteers Classes distinguished: • Person • Organisation • Location
NE Recognition – with and without gazetteers (Mikheev, Moens, and Grover, 1999) ran their system in different modes
Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Manually created gazetteers Available resources: • Word lists from the Web • Atlases & maps • Digital gazetteers (e.g. Alexandria Digital Library)
Manually created gazetteers – drawbacks • Only positive data (no way to find out whether Mainau island does not exist or is simly not listed) • Difficult to adjust when new classes are required • Not available for most languages: Aquisgrana
Task We can get rid of manually compiled gazetteers by using the Internet. Task: subclassify locations using the Internet counts (obtained from the Altavista Search Engine). Offline vs. Online processing
Data Manually created gazetteer (1260 items) Classes: • COUNTRY Pitcairn • REGION Bavaria/Bayern • RIVER Oder • ISLAND Savai‘i • MOUNTAIN Ohmberge • CITY Nancy Washington: 11xCITY, 1xMOUNTAIN, 2xISLAND, (31+1+1)xREGION
Data Gazetteer example
Data For each class we sample 100 items from the gazetteer. As the lists overlap, this results in 520 different items (TRAINING data). The rest was used for TESTING. CITY: ... REGION: ... COUNTRY: ... RIVER: ..., Victoria, ... ISLAND: ..., Victoria, ... MOUNTAIN: ..., Victoria, ... • TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
Initial system For each class a set of keywords was created. ISLAND island islands archipelago
Initial system For each item X to be classified, queries of the form “X KEYWORD“ and “KEYWORD of X“ are sent to the Altavista search engine.
Initial system Machine learners use the counts to induce classifications. Learners tested for this task: • C4.5 • TiMBL • Ripper
Initial system – drawbacks Still needs manually created resources: • Set of patterns • Initial gazetteer (TRAINING) Only online (slow) processing – the system can only classify items, provided by the user, but not extract new names itself
Bootstrapping Riloff & Jones, 1999 – Bootstrapping for IE task ITEMS PATTERNS
Bootstrapping Main problem – noise: the patterns set can get infected Remedies: • Vaccine (external algorithm for evaluating patterns) • Stop lists • Human experts
Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns
Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns
Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns
Collecting patterns (step 1) • Go to AltaVista • ask for an item • download first n pages • match with a simple regexp • patterns
Example – step 1 10 best patterns for ISLAND: of X 70 the X 60 X and 58 X the 55 to X 53 in X 52 and X 47 X is 45 X in 45 on X 45
Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns
Rescoring (step 2) Goal: discard too general patterns – score of pattern p for class c – penalty for appearing in more than one class
Example – step 2 10 best patterns for ISLAND: X island 17 island of X 9 X islands 8 island X 7 islands X 7 insel X 7 the island X 6 X elects 5 of X islands 5 zealand X 4
Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns
Learning classifiers (step 3) 20 best patterns are used to train Ripper (as in the initial system) Produced classifiers: • high-recall • high-accuracy • high-precision
Example – step 3 • High-recall classifier for ISLAND: if #(„X island“)/#X >= 0.003879 classify X as +ISLAND if #(„and X islands“)/#X >= 0.000002 classify X as +ISLAND if #(„insel X“)/#X >= 0.017099 classify X as +ISLAND otherwise classify X as –ISLAND • Extraction patterns: „X island“, „and X islands“, „insel X“
One more example – step 3 • High-accuracy classifier for ISLAND: if #(„X island“)/#X >= 0.000636 classify X as +ISLAND if #(„and X islands“)/#X >= 0.000002 and #(„X sea“)/#X>=0.000013 and #(„X geography“)<13 classify X as +ISLAND if #(„X islands“)/#X >= 0.000056 and #(„pacific islands X“)/#X>=0.000006 classify X as +ISLAND otherwise classify X as –ISLAND
Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns
Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns
Collecting and discarding items (steps 4&5) The same procedure as the step 1: go to AltaVista, ask for extraction patterns (cf. step 3), .. Discarding: common names (beginning with low-case letters), stop words (not necessary, but save time)
Example – steps 4 and 5 Extracted islands (alphabetically):
Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns
Classifying (step 6) High-precision classifier (cf. step 3) is run on collected items • rejected items are discarded • accepted items used for extraction at the next loop
Example – step 6 Extracted islands (alphabetically):
Evaluation Classifiers: • initial system • bootstrapping from the seed gazetteer • bootstrapping from positive examples only Items lists: • bootstrapping from the seed gazetteer
Comparing the performance RIVER, MOUNTAIN, COUNTRY – the new system is better! ISLAND – the new system improved and became better after the 2nd loop. REGION – infected category („departments of X“); however, the system is improving. CITY – very heterogeneous class (homonymy); 1st loop – „streets of X“, 2nd loop – „km from X“, „ort X“.
Comparing the systems Bootstrapping (vs. the initial system): + patterns learned automatically + word lists produced • cheap seed gazetteer Problem: it‘s easy to download huge lists of islands etc., but very difficult to check them and classify properly
Learning from positives CITY: ... REGION: ... COUNTRY: ... RIVER: ..., Victoria, ... ISLAND: ..., Victoria, ... MOUNTAIN: ..., Victoria, ... Before: => TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY] Now: => TRAINING: Victoria [-CITY, -REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
New items New ISLANDs: true islands 121 (90.3%) found in the atlases 93 not found 28 descriptions 5 (3.7%) parts of names 3 (2.2%) mistakes 5 (3.7%) _______ all 134
Conclusion Advantages of our approach: • very few manually collected data required (seed gazetteer) • no sophisticated engineering – patterns produced automatically • on-line classifiers provide negative information and are applicable to any entity • new items (off-line gazetteer) collected automatically