How Large a Corpus do We Need: Statistical Method vs. Rule-based Method

How Large a Corpus do We Need:Statistical Method vs. Rule-based Method Hai Zhao, Yan Song and Chunyu Kit Department of Computer Science and Engineering Shanghai Jiao Tong University, China zhaohai@cs.sjtu.edu.cn 2010.05.20

Motivation • If • corpus scale is the only factor that affects the learning performance, • then • how large an annotated corpus do we need for a specific performance metric?

Zipf’s Law • Data sparseness becomes serious

Choosing the task Chinese word segmentation • A special case of tokenization in natural language processing (NLP) for many languages that have no explicit word delimiters such as spaces. • Original: • 她来自苏格兰 • She comes from SU GE LAN Meaningless! • Segmented: • 她/来/自/苏格兰 • She comes from Scotland. Meaningful!

Why the Task (CWS) • A simple task • Both statistical and rule-based methods are available for this task • Multiple standard large scale annotated corpora are available, too. • A word-oriented task just like what Zipf’s law will be interested in

Performance Metric • Evaluation Metric, F-score: F=2RP/(R+P) • R: recall, the proportions of the correctly segmented words to all words in the gold-standard segmentation • P: precision, the proportions of the correctly segmented words to all words in a segmenter’s output

Data sets and ApproachesCharacters in number of characters • Approaches • CRFs as the statistical method: learning from an annotated corpus • Forward maximal matching algorithm (FMM) as the rule-based method: perform segmentation based on a predefined lexicon • Comparable: • FMM lexicon is extracted from the same annotated corpus that CRFs adopts

Data Splitting • Overcome data sparseness by training corpus splitting

Learning Curves:CRFs vs. FMM

CRFs Performance vs. Corpus ScaleExponential enlargement of corpus gives linear performance improvement

FMM: about the Lexicon • let L denote the size of the lexicon, and s for that of the corpus from which the lexicon is extracted, we will have • And, F-score given by FMM

FMM Performance vs. Corpus Scale

FMM Lexicon Size vs. Performance

OOV issue • Special interest in CWS: Out-of-vocabulary words (OOV) mean those that appear in test corpus but absent in training corpus. • the rate of OOV, the proportion of OOV to all words from test corpus, will heavily affect the segmentation performance.

OOV rate vs. Corpus Scale

Fitting OOV Rate

Conclusions • A bad news: Statistical method asks for an exponential increase of annotated corpus scale to overcome the sparseness caused by Zipf’s law. • To enlarge annotated corpus is not a good way for statistical method’s performance improvement. • A little surprise: Rule-based method only asks for a negative inverse increase of corpus (lexicon) scale. • Is rule-based method more effective than statistical one? • Lexicon is much cheaper than annotated corpus(text).

Thanks!

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method

Presentation Transcript

Visible-Surface Detection Methods

Method

Four Square Writing Method for Grades 1-3 Four Square Writing Method for Grades 4-6 Four Square Writing Method for Gra

The Two-Phase Simplex Method

SW-846 Method 5035A

Chapter 4

A Mathematical View of Our World

Linear Programming: The Graphical Method

24-hour dietary recall and food record method

PREGNANCY DIAGNOSIS IN COW

The Hedonic Pricing Method

Dorsal Slit Method

Review for Advanced 7th Grade Final Exam

Machine Learning

Finite Element Method

Statistical Methods

The Direct Method - The DM

Rule-Based Reasoning: Constraint Solving and Deduction

BullsEye Telecom

Scientific Method

HOMOZYGOSITY MAPPING USING LOD SCORE METHOD