230 likes | 402 Views
Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually-Segmented Corpora. Wei Qiao and Maosong Sun Department of Computer Science and Technology Tsinghua University. Outline. Introduction Motivations The New Approximation Methods Data Set
E N D
Word Frequency Approximation for Chinese UsingRaw, MM-Segmented and Manually-SegmentedCorpora Wei Qiao and Maosong Sun Department of Computer Science and Technology Tsinghua University
Outline • Introduction • Motivations • The New Approximation Methods • Data Set • Experiments and Result Analysis • Conclusion and Future Work
Background Plays important roles in NLP applications TF in information retrieval Word segmentation using statistic method Teaching Chinese as a second language The Key point of research Easy for English but hard for Chinese Chinese word frequency approximation: A correct manually-segmented Chinese corpus is needed What is a word in Chinese? => Inconsistence phenomena need a corpus with several hundred million characters => Zipf’s Law Where are we? The resources we can use: Introduction—Motivations (1)
Introduction—Motivations (2) Raw corpus Manually segmented corpora Unrealistic Precision 90%,better consistence Consistent,much higher than actual value Complex segmenter Perfect segmenter MM segmenter Character of string Count Word frequency √ √ √ Manually segmented corpora Segmented corpus Raw corpus Precision 95%, weak inconsistence Inconsistence
Where are we? • Introduction • The New Approximation Methods • Architecture • Data Set • Experiments and Result Analysis • Conclusion and Future Work
Architecture——(1) Manually segmented corpora Word frequency Combine method Simply add them up String of character Raw corpus Combine method MM approximation result Segmented corpus Cite Sun and Zhang (2006) 1-4: the average of forward and backward 5 :Backward 6+:string of character
Architecture——(2) The shorter the word, the better the manually segmented corpus result is —— 1,2,3,4+ descent Initial : adjust it through experiment Balance corpus size Word length effect
Where are we? • Introduction • The New Approximation Methods • Data Set • Experiments and Result Analysis • Conclusion and Future Work
Data Set • Data set for word frequency approximation • Manually segmented corpora • Tsinghua and Peking Language University corpus(HUAYU) • Peking University corpus(BEIDA) • Raw corpus (RC): 447,079,112 characters
Data Set • Standard corpus — Golden-standard • Institute of National Applied Linguistics corpus, denoted as YUWEI (25,000,309 words,51,311,659 character) • distribute:124 kinds of different kinds of fields from the year 1920 to 2001——of large size and relatively balanced • Select the words with frequency above 3 from YUWEI. Totally 99,660 words • Get a word rank sequence YWL by sorting their frequency in descent order • Distribute of YWL according to word length:
Where are we? • Introduction • The New Approximation Methods • Data Set • Experiments and Result Analysis • Experiments design • Results and analysis • Conclusion and Future Works
Experiments design 标准序 w1 r1 w2 r2 Rank1 Rank2 Rank3 w1 w1 w1 Mr1 Rr1 Zr1 w3 r3 w2 w2 w2 Mr2 Rr2 Zr1 w3 w3 w3 Mr3 Zr1 Rr3 99,660 wn wn wn Mrn Rrn Zr1 wn rn YWL RM RR RZ RYW
Results and Analysis ——(1) Rank1 Rank2 w1 r1 w1 w2 r2 w2 w3 r3 w3 wn rn wn • Spearman Ranks Correlation Coefficients (SRCC)
Results and Analysis ——(2) • Adjust the factor Initial value Experiment value 1: 1 1:3.9
Results and Analysis ——(3) • Overcast Rate Evaluation on YUWEI • Top 50,000 words
Results and Analysis ——(4) • test result on all YWL words ?
Results and Analysis ——(5) High 8,076 Middle 52,148 Low 39,436 YWL TOP N words
Results and Analysis ——(6) • test result on high frequency words test
Results and Analysis ——(7) • test result on middle frequency words test
Results and Analysis ——(8) • test result on low frequency words test Unbelievable for low frequency words
Where are we? • Introduction • The New Approximation Methods • Data Set • Experiment and Analysis • Conclusion and Future Work
Conclusion and Future Work • Propose a trade-off method to do Chinese word frequency approximation based on raw corpus, MM-segmented corpus and manually segmented corpora • Experiments shows that it can exactly benefit the word frequencies approximation result ——in the condition that we have several manually-segmented corpus which is small size and unbalanced • Still not very satisfied in some cases • Future work: • More study to look into low frequency words • Evaluation in other NLP tasks
Thank you! Welcome your questions and comments^_^