Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually-Segmented Corpora

Word Frequency Approximation for Chinese UsingRaw, MM-Segmented and Manually-SegmentedCorpora Wei Qiao and Maosong Sun Department of Computer Science and Technology Tsinghua University

Outline • Introduction • Motivations • The New Approximation Methods • Data Set • Experiments and Result Analysis • Conclusion and Future Work

Background Plays important roles in NLP applications TF in information retrieval Word segmentation using statistic method Teaching Chinese as a second language The Key point of research Easy for English but hard for Chinese Chinese word frequency approximation： A correct manually-segmented Chinese corpus is needed What is a word in Chinese? => Inconsistence phenomena need a corpus with several hundred million characters => Zipf’s Law Where are we? The resources we can use： Introduction—Motivations (1)

Introduction—Motivations (2) Raw corpus Manually segmented corpora Unrealistic Precision 90%,better consistence Consistent,much higher than actual value Complex segmenter Perfect segmenter MM segmenter Character of string Count Word frequency √ √ √ Manually segmented corpora Segmented corpus Raw corpus Precision 95%, weak inconsistence Inconsistence

Where are we? • Introduction • The New Approximation Methods • Architecture • Data Set • Experiments and Result Analysis • Conclusion and Future Work

Architecture——（1） Manually segmented corpora Word frequency Combine method Simply add them up String of character Raw corpus Combine method MM approximation result Segmented corpus Cite Sun and Zhang (2006) 1-4: the average of forward and backward 5 ：Backward 6+：string of character

Architecture——（2） The shorter the word, the better the manually segmented corpus result is —— 1，2，3，4+ descent Initial : adjust it through experiment Balance corpus size Word length effect

Where are we? • Introduction • The New Approximation Methods • Data Set • Experiments and Result Analysis • Conclusion and Future Work

Data Set • Data set for word frequency approximation • Manually segmented corpora • Tsinghua and Peking Language University corpus（HUAYU） • Peking University corpus（BEIDA） • Raw corpus （RC）： 447,079,112 characters

Data Set • Standard corpus — Golden-standard • Institute of National Applied Linguistics corpus, denoted as YUWEI (25,000,309 words，51,311,659 character） • distribute：124 kinds of different kinds of fields from the year 1920 to 2001——of large size and relatively balanced • Select the words with frequency above 3 from YUWEI. Totally 99,660 words • Get a word rank sequence YWL by sorting their frequency in descent order • Distribute of YWL according to word length：

Where are we? • Introduction • The New Approximation Methods • Data Set • Experiments and Result Analysis • Experiments design • Results and analysis • Conclusion and Future Works

Experiments design 标准序 w1 r1 w2 r2 Rank1 Rank2 Rank3 w1 w1 w1 Mr1 Rr1 Zr1 w3 r3 w2 w2 w2 Mr2 Rr2 Zr1 w3 w3 w3 Mr3 Zr1 Rr3 99,660 wn wn wn Mrn Rrn Zr1 wn rn YWL RM RR RZ RYW

Results and Analysis ——（1） Rank1 Rank2 w1 r1 w1 w2 r2 w2 w3 r3 w3 wn rn wn • Spearman Ranks Correlation Coefficients (SRCC)

Results and Analysis ——（2） • Adjust the factor Initial value Experiment value 1: 1 1:3.9

Results and Analysis ——（3） • Overcast Rate Evaluation on YUWEI • Top 50,000 words

Results and Analysis ——（4） • test result on all YWL words ？

Results and Analysis ——（5） High 8,076 Middle 52,148 Low 39,436 YWL TOP N words

Results and Analysis ——（6） • test result on high frequency words test

Results and Analysis ——（7） • test result on middle frequency words test

Results and Analysis ——（8） • test result on low frequency words test Unbelievable for low frequency words

Where are we? • Introduction • The New Approximation Methods • Data Set • Experiment and Analysis • Conclusion and Future Work

Conclusion and Future Work • Propose a trade-off method to do Chinese word frequency approximation based on raw corpus, MM-segmented corpus and manually segmented corpora • Experiments shows that it can exactly benefit the word frequencies approximation result ——in the condition that we have several manually-segmented corpus which is small size and unbalanced • Still not very satisfied in some cases • Future work: • More study to look into low frequency words • Evaluation in other NLP tasks

Thank you！ Welcome your questions and comments^_^

Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually-Segmented Corpora