180 likes | 375 Views
How Large a Corpus do We Need: Statistical Method vs. Rule-based Method. Hai Zhao , Yan Song and Chunyu Kit Department of Computer Science and Engineering Shanghai Jiao Tong University, China zhaohai@cs.sjtu.edu.cn 2010.05.20. Motivation. If
E N D
How Large a Corpus do We Need:Statistical Method vs. Rule-based Method Hai Zhao, Yan Song and Chunyu Kit Department of Computer Science and Engineering Shanghai Jiao Tong University, China zhaohai@cs.sjtu.edu.cn 2010.05.20
Motivation • If • corpus scale is the only factor that affects the learning performance, • then • how large an annotated corpus do we need for a specific performance metric?
Zipf’s Law • Data sparseness becomes serious
Choosing the task Chinese word segmentation • A special case of tokenization in natural language processing (NLP) for many languages that have no explicit word delimiters such as spaces. • Original: • 她来自苏格兰 • She comes from SU GE LAN Meaningless! • Segmented: • 她/来/自/苏格兰 • She comes from Scotland. Meaningful!
Why the Task (CWS) • A simple task • Both statistical and rule-based methods are available for this task • Multiple standard large scale annotated corpora are available, too. • A word-oriented task just like what Zipf’s law will be interested in
Performance Metric • Evaluation Metric, F-score: F=2RP/(R+P) • R: recall, the proportions of the correctly segmented words to all words in the gold-standard segmentation • P: precision, the proportions of the correctly segmented words to all words in a segmenter’s output
Data sets and ApproachesCharacters in number of characters • Approaches • CRFs as the statistical method: learning from an annotated corpus • Forward maximal matching algorithm (FMM) as the rule-based method: perform segmentation based on a predefined lexicon • Comparable: • FMM lexicon is extracted from the same annotated corpus that CRFs adopts
Data Splitting • Overcome data sparseness by training corpus splitting
CRFs Performance vs. Corpus ScaleExponential enlargement of corpus gives linear performance improvement
FMM: about the Lexicon • let L denote the size of the lexicon, and s for that of the corpus from which the lexicon is extracted, we will have • And, F-score given by FMM
OOV issue • Special interest in CWS: Out-of-vocabulary words (OOV) mean those that appear in test corpus but absent in training corpus. • the rate of OOV, the proportion of OOV to all words from test corpus, will heavily affect the segmentation performance.
Conclusions • A bad news: Statistical method asks for an exponential increase of annotated corpus scale to overcome the sparseness caused by Zipf’s law. • To enlarge annotated corpus is not a good way for statistical method’s performance improvement. • A little surprise: Rule-based method only asks for a negative inverse increase of corpus (lexicon) scale. • Is rule-based method more effective than statistical one? • Lexicon is much cheaper than annotated corpus(text).