for Rocling 2011

Unsupervised Overlapping Feature Selection for Conditional Random FieldsLearning in Chinese Word Segmentation for Rocling 2011 • Ting-hao Yang, Tian-jian Jiang, Chan-hung Kuo • , Richard Tzong-han Tsai, Wen-lian Hsu • Institute of Information Science, Academia Sinica • Department of Computer Science & Engineering, Yuan ZeUniversity

Introduction Term Contributed Boundary Feature using Conditional Random Fields in 2010 A unified view of several unsupervised feature selection based on frequent strings

Flow chart

Toolkit SRILM YASA

SRILM C++ libraries The toolkit supports N-gram statistics for language model

YASA Automatically extractfrequent strings from unlabeled corpus

Flow chart

6-Tag

Extended Label [0 -9 ] + [B1|B2|B3|M|E|S]

Score N-Gram score Frequent string score Accessor variety score

Score Convert from term frequency and N-Gram frequency Logarithm ranking mechanism

Score

Score Consider the score of outer pattern Equation of AV

Score

Score Scores are also used for filtering overlapping pattern

Overlapping and Non-overlapping

Non-overlapping “塑膠原料的” score 3 conflicts with ”的生產”score 1 ”的生產” is labeled as unseen

Overlapping information?

Overlapping String

Character-based N-Gram (CNG) Character-based N-gramextracted by SRILM Keeping overlapping information

Term Contributed Boundary (TCB) Using Frequent String from YASA Selected by forward maximum matching algorithm

Term Contributed Frequency (TCF) Using Frequent String from YASA Keep Overlapping information Converting score from frequent string

Accessor Varietybased String (AVS) Using SRILM to generate N-Grams Measure how likely a substring is a Chinese word Using logarithm ranking mechanism

AVS+TCB and AVS+TCF Compound AVS and TCB/TCF

CRF Labeling Scheme

Flow chart

Conditional Random Fields Undirected graphical models trained to maximize a conditional probability of random variables X and Y Feature instances are generated from template file

Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens Feature template

Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens 欲速則不達 Feature template

Experiment • Data set • Academia Sinica (AS) • City University of Hong Kong (CityU) • Microsoft Research (MSR) • Peking University (PKU)

Evaluation Metric

F1 measure

Rank score of F1 measure

Recall of Out-Of-Vocabulary

Rankscore of ROOV

Conclusion The feature collections which contain AVS obtains better F1 TCB/TCF enhances the 6-tag approach on the Recall of Out-of-Vocabulary Only with high quality feature, overlapping label can keep useful information

Thanks for your attention

for Rocling 2011

for Rocling 2011

Presentation Transcript

Wedding Trends for 2011

Calendar for 2011

Marketing Plans For 2011

TRAINING FOR 2011

Ideas for 2011

YASP 2011 for PS

Grains Outlook for 2011

Goals for 2011-2012

DESDs for 2011

ROCLING-SIGIR 組織與工作

Review for Finals 2011

TIME-TABLE FOR 2011

Unemployment Updates for 2011

Scheduling For 2011-2012

O.S.A.P. For 2010/2011

Crop Outlook for 2011

B.E.S.T. for 2011

for Rocling 2011

for Rocling 2011

Presentation Transcript

Wedding Trends for 2011

Calendar for 2011

Marketing Plans For 2011

TRAINING FOR 2011

Ideas for 2011

YASP 2011 for PS

Grains Outlook for 2011

Goals for 2011-2012

DESDs for 2011

ROCLING-SIGIR 組織與 工作

Review for Finals 2011

TIME-TABLE FOR 2011

Unemployment Updates for 2011

Scheduling For 2011-2012

O.S.A.P. For 2010/2011

Crop Outlook for 2011

B.E.S.T. for 2011

ROCLING-SIGIR 組織與工作