520 likes | 719 Views
Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches. 黃瀚萱 2008. Outline. Motivation and Goals Related Work System Design HMMs & CRFs Experiments Conclusion. CCSS: Classical Chinese Sentence Segmentation.
E N D
Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches 黃瀚萱 2008
Outline • Motivation and Goals • Related Work • System Design • HMMs & CRFs • Experiments • Conclusion
CCSS: Classical Chinese Sentence Segmentation • Almost all pre-20th century Chinese is written without any punctuation marks. • Nothing to separate words from words, phrases from phrases, and sentences from sentences. • Explicit boundaries of sentences and clauses are lacking. • readers have to manually identify these boundaries during reading.
Example of CCSS from Zhangzi • 北冥有魚其名為鯤鯤之大不知其幾千里也化而為鳥其名為鵬鵬之背不知其幾千里也怒而飛其翼若垂天之雲是鳥也海運則將徙於南冥南冥者天池也 北冥有魚.其名為鯤.鯤之大.不知其幾千里也.化而為鳥.其名為鵬.鵬之背.不知其幾千里也.怒而飛.其翼若垂天之雲.是鳥也.海運則將徙於南冥.南冥者.天池也.
Challenges • CCSS is not a trivial problem, inherently ambiguous. • 道 / 可道 / 非常道 / 名 / 可名 / 非常名. • 道可道 / 非常道 / 名可名 / 非常名. • Difficult to construct a set of rules or practical procedures to do CCSS. • Readers perform CCSS in instinctive ways. They rely on their experience and sense of the language rather than on a systematic procedure.
Automated CCSS • Innumerable documents in Classical Chinese from the centuries of Chinese history remain to be segmented. • To aid in processing these documents, a automated CCSS system is proposed. • Enable completion of segmentation tasks quickly and accurately.
Research Goals • Evaluation Metrics • Datasets • Training data • Benchmarking • Statistical segmenters.
Outline • Motivation and Goals • Related Work • System Design • HMMs & CRFs • Experiments • Conclusion
Related Areas Linguistics NLP Machine Learning CWS Classifiers CCSS SBD Tagging Chunking
Useful Chinese Features • Chinese Character • 則and 而usually appear in the head of sentences. • 也 and矣 usually appear in the tail of sentences. • Phonology • 反切, 平仄, 擬音 • POS • Verbs, nouns, adjectives, adverbs, etc. • Antithesis and couplet • 道/ 可道 / 非常道 / 名 / 可名 / 非常名
Sentence Boundary Detection • Distinguish the periods used as the end-of-sentence indicator from other usages. • Parts of abbreviations, i.e. Dr. Wang. • Decimal points, i.e. 1.618. • Ellipsis, i.e. “I don’t know…” • Metrics for segmentation • F-measure and NIST-SU error rate • SBD in speech.
Chinese Word Segmentation • Identify the boundaries of the words in a given text. • Non-trivial problem, ambiguity • 日 文章 魚 怎麼 說 • 日文 章魚 怎麼 說 • Words can be handled with a dictionary. • Segmentation by character tagging [Xue, 2003]
Part-of-Speech Tagging • Tagging the words of a sentence with word class. • The[AT] representative[NN] put[VBD] chairs[NNS] on[IN] the[AT] table[NN]. • Tagging another information rather than word class. • Position-of-character tagging. • Classical Chinese POS[Huang et al., 2002] • Focused on sentence segmented text.
Outline • Motivation and Goals • Related Work • System Design • HMMs & CRFs • Experiments • Conclusion
CCSS Framework training data training test data segmentationmodel testing testing outcome dataset measure metrics performance measurement
Sequential Data • Transform the sentence segmenting task to a character labeling task. • Tagging with four position-of-character tags • Left boundary (LL) • Middle character (MM) • Right boundary (RR) • single character clause (LR) • 北冥有魚/ 其名為鯤/ 鯤之大/ 不知其幾千里也
Markov Chain for CCSS 北 LL MM 冥、有 Start LR RR 魚 Finish
Sequence Labeling Models • Hidden Markov Models • Maximum Entropy • Conditional Random Fields [Lafferty, 2001] • With Averaged Perceptron [Collins, 2002] • Large margin methods • Support Vector Machine • AdaBoost
Conditional Random Fields • The model • P( “LL, MM, MM, RR” | “北冥有魚”) • Tagging the x withViterbi algorithm • λ is estimated by the averaged perceptron algorithm.
Datasets • Focused on the corpora of the Pre-Qin and Han Dynasties (先秦兩漢) • Fundamental of later Chinese. • Simpler syntax. • Shorter sentences. • The words are largely composed of a single character. • Qing Palace Memorials (奏摺)
Evaluation Metrics • Specificity (1 – fallout) • Probability of the true negative cases. • F-measure (F1) • Harmonic means of precision and recall. • NIST-SU error rate • Ratio of the wrongly segmented boundaries to the reference boundaries. • More than 100% if the mis-segmented too serious.
Outline • Motivation and Goals • Related Work • System Design • HMMs & CRFs • Experiments • Conclusion
Experiment Design • Experiment 1 • Evaluate the performance of HMMs and CRFs • 10-fold cross-validation • Experiment 2 • Find the best training data from ancient Chinese. • Train the system on one dataset, and test it on others. • Experiment3 • Cross-era evaluation. • Train the system on the data from the Qing, and test it on the data from the pre-Qing and Han Dynasties, and vice versa. • 以古鑑近?以近鑑古?
Result: Experiment 1 Evaluation in 10-Fold Cross-Validation. 5 generations of CRFs averaged perceptron with 100K feature functions.
Result: Experiment 2 5 generations of CRFs averaged perceptron with 100K feature functions.
Experiment3a: 以古鑑近 Training Data 論語 孟子 莊子 左傳 史記 Test Data HMMs/CRFs 清代奏摺 Validation
Experiment 3b: 以近鑑古 Training Data 清代奏摺 Test Data 論語 孟子 HMMs/CRFs Validation 莊子 左傳 史記
Outline • Motivation and Goals • Related Work • System Design • HMMs & CRFs • Experiments • Conclusion
Overall • Build up an automated CCSS system. • Complete 3 tasks during the system developing. • A set of evaluation metrics. • A set of datasets. • Two segmentation models. • HMMs • CRFs
Datasets • Evaluatedsome classics from the 5th century BCE to the 19th century, including 論語, 孟子, 莊子, 左傳, and 史記. • My system maintains its performance on a test data differing from the training data, but the difference in written eras between the test data and training data cannot be too great. • 史記 is the best dataset for training.
Segmentation Models • Overall performance • Model comparison
Future Work • Apply more Chinese features • Phonology, POS, and Antithesis. • Integrate pre-defined rules • Names, places, dates, numbers. • Mix several datasets to obtain a more general, robust dataset.
Conditional Random Fields • The model • P( “LL, MM, MM, RR” | “北冥有魚”) • Feature functions: f(yt-1, yt, x, t) • f(MM, RR, ‘曰’) = 1 • ’曰’出現在句末 • f(RR, ‘孟’) = 0 • ‘孟’從未出現在句末 • 由 λk決定 fk的重要性,λk由資料中學習而來
Conditional Random Fields Cont. • 將 x 加上標籤 • 以Viterbi algorithm 實作 • 參數評估(決定 λ 值) • 保證收斂,但沒有分析式的解法,須透過迭代法逼近。 • 複雜度高、速度慢、不易實作 • GIS, IIS, L-BFGS, etc. • 以 averaged perceptron 代替傳統的數值方法。
Raw Results (Better Case) 齊宣王問曰.文王之囿.方七十里.有諸.孟子對曰.於傳有之.曰.若是其大乎.曰.民猶以為小也.曰.寡人之囿.方四十里.民猶以為大.何也.曰.文王之 囿.方七十里.芻蕘者往焉.雉兔者往焉.與民同之.民以為小.不亦宜乎.臣始至於境.問國之大禁.然後敢入.臣聞郊關之內.有囿方四十里.殺其麋鹿者.如 殺人之罪.則是方四十里為阱於國中.民以為大.不亦宜乎. 齊宣王問曰.文王之囿方七十里.有諸孟子對曰.於傳有之.曰.若是其大乎.曰.民猶以為小也.曰.寡人之囿.方四十里.民猶以為大.何也.曰.文王之囿方 七十里.芻蕘者往焉.雉兔者往焉.與民同之.民以為小.不亦宜乎.臣始至於境.問國之大禁.然後敢入.臣聞郊關之內.有囿方四十里.殺其麋鹿者.如殺人之 罪.則是方四十里.為阱於國中.民以為大.不亦宜乎. Training Data: 史記,Test Data: 孟子
Raw Results (Worse Case) 孟子曰.三代之得天下也.以仁.其失天下也以不仁.國之所以廢興存亡者亦然.天子不仁.不保四海.諸侯不仁.不保社稷.卿大夫不仁.不保宗廟.士庶人不仁.不保四體.今惡死亡而樂不仁.是由惡醉而強酒. 孟子曰.三代之.得天下也.以仁其失天下也.以不仁國之所以廢興.存亡者.亦然.天子不仁.不保四海.諸侯不仁.不保社.稷卿大夫.不仁不保.宗廟士.庶人不仁.不保四體.今惡死亡.而樂不仁.是由惡醉.而強酒. Training Data: 史記,Test Data: 孟子
Evaluation Measures tn fp tp fn
Statistical Machine Learning Training Data Learner (Model) input output
結論:評估指標 • 主要參考指標 • Specificity • F-measure (F1) • Recall和precision的調和平均 • NIST-SU error rate • ROC Curves • 同時評估recall和specificity • 比較多筆斷句結果
統計式斷句系統設計 • Statistical approach • 斷句系統可以視為一個machine learner。 • 不透過人力預先定義規則,而從大量training data中調整學習 • Training data的需求 • 一般性 • 足夠的數量