70 likes | 248 Views
The Second International Chinese Word Segmentation Bakeoff. Coordinated by Thomas Emerson. Roadmap. Contest Details Corpora, Tracks, and Sites Results Baselines and Measures Discussion Thanks. Corpora. Four Corpora: 2 simplified chars, 2 traditional
E N D
The Second International Chinese Word Segmentation Bakeoff Coordinated by Thomas Emerson
Roadmap • Contest Details • Corpora, Tracks, and Sites • Results • Baselines and Measures • Discussion • Thanks
Corpora • Four Corpora: 2 simplified chars, 2 traditional • All provide ground truth and segmentation standard
Tracks and Sites • Two tracks: • Open: Participants may use any data to train • External lexica, POS information, etc • Closed: Sites may ONLY use training data set • 23 Participating sites completed bakeoff • 9 PRC, 4 HK, 4 US, 2 TW, 1 GB, 1 JP, 1 SG • 130 runs submitted
Results • Baseline: L-to-R MaxMatch w/training vocab: 0.83-0.93 • Topline: L-to-R MaxMatch w/test truth vocab: 0.99 • Measures: Recall, Precision, F-measure • Recall on OOV, Recall on in-vocab • Best F-score: Open 0.972, median 0.941 • Best closed: 0.964 (on MSR corpus) • Best OOV recall: Open 0.872; Closed 0.813 • Vs 2003: best F-score: 0.961: now 17 reach
Results AS Closed: NAIST, Stanford AS Open: SG, Yahoo!, Sheffield MSR Closed Stanford, UHK, Yahoo! MSR Open: Harbin, SG, UHK
Thanks & Future • Thanks to participants and providers • Academia Sinica, ICL Beijing, CUHK,MSRA • Future Bakeoffs: • Different training/test registers? • Additional tasks? NER? • Suggestions?