The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition

The Third Chinese Language Processing Bakeoff:Word Segmentation and Named Entity Recognition Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006

Roadmap • Bakeoff Task Motivation • Bakeoff Structure: • Materials and annotations • Tasks and conditions • Participants and timeline • Results & Discussion: • Word Segmentation • Named Entity Recognition • Observations & Conclusions • Thanks

Bakeoff Task Motivation • Core enabling technologies for Chinese language processing • Word segmentation (WS) • Crucial tokenization in absence of whitespace • Supports POS tagging, parsing, ref. resolution, etc • Fundamental challenges: • “Word” not well, consistently defined; humans disagree • Unknown words impede performance • Named Entity Recognition (NER) • Essential for reference resolution, IR, etc • Common class of new unknown words

Data Source Characterization • Five corpora, providers • Annotation guidelines available, varied • Simplified and traditional characters • Range of encodings, all available in Unicode (UTF-8) • Provided in common XML, converted to train/test form (LDC)

Tasks and Tracks • Tasks: • Word Segmentation: • Training and truth: whitespace delimited • End-of-word tags replaced with space, no others • Named Entity Recognition: • Training and truth: Similar to Co-NLL 2-column • NAMEX only: LOC, PER, ORG (LDC: +GPE) • Tracks: • Closed: Only provided materials may be used • Open: Any materials may be used, but must document

Structure: Participants &Timeline • Participants: • 29 sites submitted runs for evaluation (36 init) • 144 runs submitted: ~2/3 WS; 1/3 NER • Diverse groups: 11 PRC, 7 Taiwan, 5 US, 2 Japan, 1each: Singapore, Korea, Hong Kong, Canada • Mix of Commercial: MSRA, Yahoo!, Alias-I, FR Telecom, etc- and Academic sites • Timeline: • March 15: Registration open • April 17: Training data released • May 15: Test data released • May 17: Results due

Word Segmentation: Results • Contrasts: Left-to-right maximal match • Baseline: Uses only training vocabulary • Topline: Uses only testing vocabulary

Word Segmentation: CityU CityU Closed CityU Open

Word Segmentation: CKIP CKIP Closed CKIP Open

Word Segmentation: MSRA MSRA Closed MSRA Open

Word Segmentation: UPUC UPUC Closed UPUC Open

Word Segmentation: Overview • F-scores: 0.481-0.797 • Best score: MSRA Open Task (FR Telecom) • Best relative to topline: CityU Open: >99% • Most frequent top rank: MSRA • Both F-scores and OOV recall higher in Open • Overall good results: Most outperform baseline

Word Segmentation: Discussion • Continuing OOV challenges • Highest F-scores on MSRA • Also highest top and base lines • Lowest OOV rate • Lowest F-scores on UPUC • Also lowest top and baselines • Highest OOV rate (> double all other OOV) • Smallest corpus (~1/3 MSRA) • Best scores: most consistent corpus • Vocabulary, annotation • UPUC also varies in genre: train: CTB; test: CTB,NW,BN

NER Results • Contrast: Baseline • Label as Named Entity if unique tag in training

NER Results: CityU CityU Closed CityU Open

NER Results: LDC LDC Closed LDC Open

NER Results: MSRA MSRA Closed MSRA Open

NER: Overview • Overall results: • Best F-score: MSRA Open Track: 0.91 • Strong overall performance: • Only two results below baseline • Direct comparison of NER Open vs Closed • Difficult: only two sites performed both tracks • Only MSRA had large numbers of runs • Here Open outperformed Closed: top 3 Open > Closed

NER Observations • Named Entity Recognition challenges • Tagsets, variation, and corpus size • Results on MSRA/CityU much better than LDC • LDC corpus substantially smaller • Also larger tagset: GPE • GPE easily confused for ORG or LOC • NER results sensitive to corpus size, tagset, genre

Conclusions & Future Challenges • Strong, diverse participation in WS & NER • Many effective competitive results • Cross-task, cross-evaluation comparisons • Still difficult • Scores sensitive to corpus size, annotation consistency, tagset, genre, etc • Need corpus, config-independent measure of progress • Encourage submissions that support comparisons • Extrinsic, task-oriented evaluation of WS/NER • Continuing challenges: OOV, annotation consistency, encoding combinations and variation, code-switching

Thanks • Data Providers: • Chinese Knowledge Information Processing Group, Academia Sinica, Taiwan: • Keh-Jiann Chen, Henning Chiu • City University of Hong Kong: • Benjamin K.Tsou, Olivia Oi Yee Kwong • Linguistic Data Consortium: Stephanie Strassel • Microsoft Research Asia: Mu Li • University of Pennsylvania/University of Colorado: • Martha Palmer, Nianwen Xue • Workshop co-chairs: • Hwee Tou Ng and Olivia Oi Yee Kwong • All participants!

The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition

The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition

Presentation Transcript

Lexical Semantics and Ontologies Tutorial at the ACL/HCSnet 2006 Advanced Program in Natural Language Processing

STP Segmentation Targeting Positioning

CS 388: Natural Language Processing: Part-Of-Speech Tagging, Sequence Labeling, and Hidden Markov Models (HMMs)

Market Segmentation

Connectionist Model of Word Recognition (Rumelhart and McClelland)

Normalized Cuts and Image Segmentation

Discourse Segmentation

LING / C SC 439/539 Statistical Natural Language Processing

Within-Category Variation is Used in Spoken Word Recognition

Subphonemic detail is used in spoken word recognition: Temporal Integration at Two Time Scales

Natural Language Processing (4)

Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

Speech perception as a window into language processing:

Natural Language Processing COMPSCI 423/723

Natural Language Processing, Linguistics and Terminology

FREE-WORD COMBINATIONS

Natural Language Processing

CS157B - Fall 2004

An Introduction to Pattern Recognition 主講人：朱家德