130 likes | 249 Views
Analyzing Wikipedia and Gold-Standard Corpora for NER Training. Nothman et al. 2009, EACL. William Y. Wang Computer Science. CS 6998 NLP for the Web Columbia University 04/22/2010. Outline. Motivation NER and Gold-Standard Corpora The Problem: Cross-corpora Performance Wikipedia for NER
E N D
Analyzing Wikipedia and Gold-Standard Corpora for NER Training Nothman et al. 2009, EACL William Y. Wang Computer Science CS 6998 NLP for the Web Columbia University 04/22/2010
Outline • Motivation • NER and Gold-Standard Corpora • The Problem: Cross-corpora Performance • Wikipedia for NER • Results • Conclusion and My Observation
Motivation • Manual Annotation is “expensive”. • (1) expensive (2) time (3) extra problems • Can we use linguistic resources to create NER corpus automatically? • What’s the cross-corpora NER performance? • How can we utilize Web resource (e.g. Wikipedia) to improve NER?
NER Gold Corpora • MUC-7: Locations(LOC), organizations(ORG), personal names(PER) • CoNLL-03: LOC, ORG, PER, Miscellaneous(MISC) • BBN: 54 tags in Penn Treebank
Corpus and Error Analysis • N-gram tag variation: • Check tags of all n-grams appear multiple times to see if the NE tags are consistent • Entity type frequency: • (1) POS tag with its NE tag • (e.g. nationalities are often with JJ or NNPS) • (2) Wordtypes • (3) Wordtypes with Functions (e.g. Bank of New England -> Aaa of Aaa Aaa) • Tag sequence confusion: • Looking into the detail of confusion matrix
Using Wikipedia to Build NER Corpus • Classify all articles into entity classes • 2. Split Wikipedia articles into sentences • 3. Label NEs according to link targets • 4. Select sentences for inclusion in a corpus
Improve Wikipedia NER • Baseline: 58.9% and 62.3% on CoNLL and BBN • Inferring extra links using Wikipedia Disambiguation Pages • 2. Personal titles: not all preceding titles indicate PER • (e.g. Prime Minister of Australia) • Previously missed JJ entities (e.g. American / MISC) • Miscellaneous changes
Results DEV set results (higher but similar to test set results)
Conclusion • The impact of NER training corpora on its corresponding test set is huge • Annotation-free Wikipedia NER corpora created • Wikipedia data performs better in the cross-corpora NER task • Still much room for improvement
Comments • What I like about this paper: • The scope of this paper is unique (analogy: cross-cultural studies) • Utilizing novel linguistic resources to solve basic NLP problems • Good results • Relatively clear and easy to understand • What I don’t like about this paper: • The overall method to improve Wikipedia NER training is not a principal approach
Overall Assessment: 8/10