1 / 13

CS 6998 NLP for the Web Columbia University 04/22/2010

Analyzing Wikipedia and Gold-Standard Corpora for NER Training. Nothman et al. 2009, EACL. William Y. Wang Computer Science. CS 6998 NLP for the Web Columbia University 04/22/2010. Outline. Motivation NER and Gold-Standard Corpora The Problem: Cross-corpora Performance Wikipedia for NER

fritz
Download Presentation

CS 6998 NLP for the Web Columbia University 04/22/2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing Wikipedia and Gold-Standard Corpora for NER Training Nothman et al. 2009, EACL William Y. Wang Computer Science CS 6998 NLP for the Web Columbia University 04/22/2010

  2. Outline • Motivation • NER and Gold-Standard Corpora • The Problem: Cross-corpora Performance • Wikipedia for NER • Results • Conclusion and My Observation

  3. Motivation • Manual Annotation is “expensive”. • (1) expensive (2) time (3) extra problems • Can we use linguistic resources to create NER corpus automatically? • What’s the cross-corpora NER performance? • How can we utilize Web resource (e.g. Wikipedia) to improve NER?

  4. NER Gold Corpora • MUC-7: Locations(LOC), organizations(ORG), personal names(PER) • CoNLL-03: LOC, ORG, PER, Miscellaneous(MISC) • BBN: 54 tags in Penn Treebank

  5. Problem: Poor Cross-corpus Performance

  6. Corpus and Error Analysis • N-gram tag variation: • Check tags of all n-grams appear multiple times to see if the NE tags are consistent • Entity type frequency: • (1) POS tag with its NE tag • (e.g. nationalities are often with JJ or NNPS) • (2) Wordtypes • (3) Wordtypes with Functions (e.g. Bank of New England -> Aaa of Aaa Aaa) • Tag sequence confusion: • Looking into the detail of confusion matrix

  7. Using Wikipedia to Build NER Corpus • Classify all articles into entity classes • 2. Split Wikipedia articles into sentences • 3. Label NEs according to link targets • 4. Select sentences for inclusion in a corpus

  8. Improve Wikipedia NER • Baseline: 58.9% and 62.3% on CoNLL and BBN • Inferring extra links using Wikipedia Disambiguation Pages • 2. Personal titles: not all preceding titles indicate PER • (e.g. Prime Minister of Australia) • Previously missed JJ entities (e.g. American / MISC) • Miscellaneous changes

  9. Results DEV set results (higher but similar to test set results)

  10. Conclusion • The impact of NER training corpora on its corresponding test set is huge • Annotation-free Wikipedia NER corpora created • Wikipedia data performs better in the cross-corpora NER task • Still much room for improvement

  11. Comments • What I like about this paper: • The scope of this paper is unique (analogy: cross-cultural studies) • Utilizing novel linguistic resources to solve basic NLP problems • Good results • Relatively clear and easy to understand • What I don’t like about this paper: • The overall method to improve Wikipedia NER training is not a principal approach

  12. Overall Assessment: 8/10

  13. Thank you!

More Related