210 likes | 328 Views
The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition. Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006. Roadmap. Bakeoff Task Motivation Bakeoff Structure: Materials and annotations Tasks and conditions Participants and timeline
E N D
The Third Chinese Language Processing Bakeoff:Word Segmentation and Named Entity Recognition Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006
Roadmap • Bakeoff Task Motivation • Bakeoff Structure: • Materials and annotations • Tasks and conditions • Participants and timeline • Results & Discussion: • Word Segmentation • Named Entity Recognition • Observations & Conclusions • Thanks
Bakeoff Task Motivation • Core enabling technologies for Chinese language processing • Word segmentation (WS) • Crucial tokenization in absence of whitespace • Supports POS tagging, parsing, ref. resolution, etc • Fundamental challenges: • “Word” not well, consistently defined; humans disagree • Unknown words impede performance • Named Entity Recognition (NER) • Essential for reference resolution, IR, etc • Common class of new unknown words
Data Source Characterization • Five corpora, providers • Annotation guidelines available, varied • Simplified and traditional characters • Range of encodings, all available in Unicode (UTF-8) • Provided in common XML, converted to train/test form (LDC)
Tasks and Tracks • Tasks: • Word Segmentation: • Training and truth: whitespace delimited • End-of-word tags replaced with space, no others • Named Entity Recognition: • Training and truth: Similar to Co-NLL 2-column • NAMEX only: LOC, PER, ORG (LDC: +GPE) • Tracks: • Closed: Only provided materials may be used • Open: Any materials may be used, but must document
Structure: Participants &Timeline • Participants: • 29 sites submitted runs for evaluation (36 init) • 144 runs submitted: ~2/3 WS; 1/3 NER • Diverse groups: 11 PRC, 7 Taiwan, 5 US, 2 Japan, 1each: Singapore, Korea, Hong Kong, Canada • Mix of Commercial: MSRA, Yahoo!, Alias-I, FR Telecom, etc- and Academic sites • Timeline: • March 15: Registration open • April 17: Training data released • May 15: Test data released • May 17: Results due
Word Segmentation: Results • Contrasts: Left-to-right maximal match • Baseline: Uses only training vocabulary • Topline: Uses only testing vocabulary
Word Segmentation: CityU CityU Closed CityU Open
Word Segmentation: CKIP CKIP Closed CKIP Open
Word Segmentation: MSRA MSRA Closed MSRA Open
Word Segmentation: UPUC UPUC Closed UPUC Open
Word Segmentation: Overview • F-scores: 0.481-0.797 • Best score: MSRA Open Task (FR Telecom) • Best relative to topline: CityU Open: >99% • Most frequent top rank: MSRA • Both F-scores and OOV recall higher in Open • Overall good results: Most outperform baseline
Word Segmentation: Discussion • Continuing OOV challenges • Highest F-scores on MSRA • Also highest top and base lines • Lowest OOV rate • Lowest F-scores on UPUC • Also lowest top and baselines • Highest OOV rate (> double all other OOV) • Smallest corpus (~1/3 MSRA) • Best scores: most consistent corpus • Vocabulary, annotation • UPUC also varies in genre: train: CTB; test: CTB,NW,BN
NER Results • Contrast: Baseline • Label as Named Entity if unique tag in training
NER Results: CityU CityU Closed CityU Open
NER Results: LDC LDC Closed LDC Open
NER Results: MSRA MSRA Closed MSRA Open
NER: Overview • Overall results: • Best F-score: MSRA Open Track: 0.91 • Strong overall performance: • Only two results below baseline • Direct comparison of NER Open vs Closed • Difficult: only two sites performed both tracks • Only MSRA had large numbers of runs • Here Open outperformed Closed: top 3 Open > Closed
NER Observations • Named Entity Recognition challenges • Tagsets, variation, and corpus size • Results on MSRA/CityU much better than LDC • LDC corpus substantially smaller • Also larger tagset: GPE • GPE easily confused for ORG or LOC • NER results sensitive to corpus size, tagset, genre
Conclusions & Future Challenges • Strong, diverse participation in WS & NER • Many effective competitive results • Cross-task, cross-evaluation comparisons • Still difficult • Scores sensitive to corpus size, annotation consistency, tagset, genre, etc • Need corpus, config-independent measure of progress • Encourage submissions that support comparisons • Extrinsic, task-oriented evaluation of WS/NER • Continuing challenges: OOV, annotation consistency, encoding combinations and variation, code-switching
Thanks • Data Providers: • Chinese Knowledge Information Processing Group, Academia Sinica, Taiwan: • Keh-Jiann Chen, Henning Chiu • City University of Hong Kong: • Benjamin K.Tsou, Olivia Oi Yee Kwong • Linguistic Data Consortium: Stephanie Strassel • Microsoft Research Asia: Mu Li • University of Pennsylvania/University of Colorado: • Martha Palmer, Nianwen Xue • Workshop co-chairs: • Hwee Tou Ng and Olivia Oi Yee Kwong • All participants!