200 likes | 316 Views
Fine-Grained Geographical Relation Extraction from Wikipedia. André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS). Overview. motivation why are fine-grained relations important? self-annotation automatic annotation using structured data
E N D
Fine-Grained Geographical Relation Extraction from Wikipedia André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS)
Overview • motivation • why are fine-grained relations important? • self-annotation • automatic annotation using structured data • use this annotation for training classifier • extraction framework • evaluation and conclusion
Geographical data provider • GeoNames • gazetteer • names, type, coordinates • 8 million entries • 2.6 million populated places • community-based • Creative Commons Attribution 3.0 License • Free to share
Task Definition • relation definition • R1-2 • ADM3-ADM4 • Landkreis (county)- Gemeinde (municipality) • R0-1 • ADM4-PPL • Gemeinde (municipality) and Ortsteil (suburb) • task • classify all possible binary relations of named entities in one sentence
Example - binary relations between all NEs • Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). • Gebroth is a municipality in the county Bad Kreuznach in Rheinland-Pfalz (Germany). • binary relations between NEs • (Gebroth,Bad Kreuznach) element of R1_2 • (Gebroth, Rheinland-Pfalz) • (Gebroth, Deutschland) • (Bad Kreuznach, Rheinland-Pfalz) • (Bad Kreuznach, Germany) • (Rheinland-Pfalz, Deutschland)
Requirements for extraction system • fast to develop • requested relation types can change • avoid expensive manual annotation • fine-grained relation types • e.g. simple part-of relation is not sufficient • trained system need no structured data • several input sources (Wikipedia, blogs, twitter, news) • German data
Wikipedia as resource • structured data • templates (e.g. infoboxes), links, categories, tables, lists • unstructured data • written text • high quality • many users • WikiBots • structured data can be used to annotate unstructured data → self-annotation
Self-Annotation - example R1_2(Gebroth, Bad Kreuznach) Gebroth Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). unstructureddata structureddata Landkreis Bad Kreuznach (county)
Self-annotation - challenges • infoboxes are not always complete/correct/coherent filled • matching with unstructured data • pattern matching not sufficient • orthographic variances • morphology • multi-word expressions • matching need some manual adjustment • only one relation per article
Extraction framework • UIMA (Unstructured Information Management Architecture) • pipeline architecture • easy exchange of components • fast development • extended components • CollectionReader for Wikipedia • linguistic annotation • supervised classifier
Extraction pipeline German Wikipedia FSPar-Engine MaxEnt-Classifier structured data JWPL unstructured text UIMA Pipeline FSPar-Annotator CollectionReader Self-Annotation ClearTK Consumer CollectionReader text GeoNames
Linguistic processing • FSPar engine (Schiehlen 2003) • tokenizer • PoS-tagger (bases on TreeTagger) • chunker • partial dependency parser
Supervised classification • extended ClearTK-Annotator • feature sets • F0: NE distance (baseline) • F1: Window-based (pos, lemma, size=2) • F2: chunks (parent chunks of NEs) • F3: dependency parse (paths between NEs) • MaxEntClassifier
Evaluation • 9000 articles about German municipalities and suburbs • 5300 articles for training • 1800 articles for development • 1800 articles for final evaluation • R1_2 relation is also available from the Federal Statistical Office of Germany • Used for evaluate self-annotation • 99.9 % ( 1 error in 1304 sentences)
Conclusion • text is important resource for context-aware systems • self-annotation • automatic annotation using structured data • Wikipedia is a valuable resource • structured and unstructured data • containing fine-grained relations • UIMA based implementation • fine-grained geographical relation extraction is possible
Questions: ?! www.nexus.uni-stuttgart.de