1 / 45

for Rocling 2011

Unsupervised Overlapping Feature Selection for C onditional R andom F ields Learning in Chinese Word Segmentation. for Rocling 2011. Ting- hao Yang, Tian-jian Jiang , Chan-hung Kuo , Richard Tzong-han Tsai, Wen-lian Hsu Institute of Information Science, Academia Sinica

brice
Download Presentation

for Rocling 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Overlapping Feature Selection for Conditional Random FieldsLearning in Chinese Word Segmentation for Rocling 2011 • Ting-hao Yang, Tian-jian Jiang, Chan-hung Kuo • , Richard Tzong-han Tsai, Wen-lian Hsu • Institute of Information Science, Academia Sinica • Department of Computer Science & Engineering, Yuan ZeUniversity

  2. Introduction Term Contributed Boundary Feature using Conditional Random Fields in 2010 A unified view of several unsupervised feature selection based on frequent strings

  3. Flow chart

  4. Flow chart

  5. Toolkit SRILM YASA

  6. SRILM C++ libraries The toolkit supports N-gram statistics for language model

  7. YASA Automatically extractfrequent strings from unlabeled corpus

  8. Flow chart

  9. 6-Tag

  10. Extended Label [0 -9 ] + [B1|B2|B3|M|E|S]

  11. Score N-Gram score Frequent string score Accessor variety score

  12. Score Convert from term frequency and N-Gram frequency Logarithm ranking mechanism

  13. Score

  14. Score Consider the score of outer pattern Equation of AV

  15. Score

  16. Score Scores are also used for filtering overlapping pattern

  17. Overlapping and Non-overlapping

  18. Non-overlapping “塑膠原料的” score 3 conflicts with ”的生產”score 1 ”的生產” is labeled as unseen

  19. Overlapping information?

  20. Overlapping String

  21. Overlapping String

  22. Overlapping String

  23. Overlapping String

  24. Overlapping String

  25. Character-based N-Gram (CNG) Character-based N-gramextracted by SRILM Keeping overlapping information

  26. Term Contributed Boundary (TCB) Using Frequent String from YASA Selected by forward maximum matching algorithm

  27. Term Contributed Frequency (TCF) Using Frequent String from YASA Keep Overlapping information Converting score from frequent string

  28. Accessor Varietybased String (AVS) Using SRILM to generate N-Grams Measure how likely a substring is a Chinese word Using logarithm ranking mechanism

  29. AVS+TCB and AVS+TCF Compound AVS and TCB/TCF

  30. CRF Labeling Scheme

  31. Flow chart

  32. Conditional Random Fields Undirected graphical models trained to maximize a conditional probability of random variables X and Y Feature instances are generated from template file

  33. Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens Feature template

  34. Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens 欲速則不達 Feature template

  35. Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens 欲速則不達 Feature template

  36. Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens 欲速則不達 Feature template

  37. Experiment • Data set • Academia Sinica (AS) • City University of Hong Kong (CityU) • Microsoft Research (MSR) • Peking University (PKU)

  38. Evaluation Metric

  39. Evaluation Metric

  40. F1 measure

  41. Rank score of F1 measure

  42. Recall of Out-Of-Vocabulary

  43. Rankscore of ROOV

  44. Conclusion The feature collections which contain AVS obtains better F1 TCB/TCF enhances the 6-tag approach on the Recall of Out-of-Vocabulary Only with high quality feature, overlapping label can keep useful information

  45. Thanks for your attention

More Related