1 / 42

Unsupervised Segmentation of Words into Morphemes Morpho Challenge Workshop 2006

Unsupervised Segmentation of Words into Morphemes Morpho Challenge Workshop 2006. Mikko Kurimo , Mathias Creutz, Krista Lagus. Opening – Welcomes. Welcome to the Morphochallenge workshop, everybody! challenge participants workshop speakers other PASCAL researchers

Download Presentation

Unsupervised Segmentation of Words into Morphemes Morpho Challenge Workshop 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Segmentation of Words into MorphemesMorpho Challenge Workshop 2006 Mikko Kurimo, Mathias Creutz, Krista Lagus

  2. Opening – Welcomes • Welcome to the Morphochallenge workshop, everybody! • challenge participants • workshop speakers • other PASCAL researchers • others interested in the topic

  3. Motivation • To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. • Get basic vocabulary units suitable for different tasks: • Speech and text understanding • Machine translation • Information retrieval • Statistical language modelling • Rule based systems can split: read + ing, but have difficulties for complicated words and languages

  4. Workshop 12 April, final timetable • 0900 Opening • 0910 Introduction and evaluation report • 0950 Invited talk by Richard Sproat • 1050 Break • 1120 Morfessor baseline by Krista Lagus • 1150 Competitors presentations • 1230 Lunch • 1400 Competitors (contd.) • 1500 Discussion • 1530 Conclusion

  5. Morning session • 09:10 Mikko Kurimo • Introduction and Evaluation report • 09:50 Prof. Richard Sproat (Invited Talk) • University of Illinois at Urbana-Champaign • ”Computational Morphology and its Implications for the Theoretical Morphology” • 10:50 – 11:20 Coffee break

  6. Noon session • 11:20 Krista Lagus: "Morfessor in MorphoChallenge" • 11:50 Delphine Bernhard: "Morphological segmentation for the automatic acquisition of semantic relationships in the context of MorphoChallenge 2005" • 12:10 Stefan Bordag: "Two-step approach to unsupervised morpheme segmentation" • 12:30 – 14:00 Lunch

  7. Afternoon session • 14:00 Lars Johnsen: • "Learning morphology on tokens" • 14:20 Samarth Keshava and Emily Pitler: • "Reports - Quick and Simple Unsupervised Learning of Morphemes" • 14:40 Eric Atwell (Mikko Kurimo): • "Combinatory Hybrid Elementary Analysis of Text" • 15:00 Discussion • 15:30 Conclusion

  8. Discussion topics for afternoon • New ways to evaluate the obtained units ? • New evaluation languages: German, Norwegian, French, Estonian, Arabic,..? • Other application evaluations: SLU, IR, MT,..? • New organizer partners ? • MorphoChallenge2 ? • Journal special issue ? • 2nd Morpho Challenge workshop ? • ?

  9. Opening - Thanks • Thanks to all who made Morpho Challenge possible! • PASCAL network, coordinators, challenge program organizers • Morpho Challenge organizing committee • Morpho Challenge program committee • Morpho Challenge participants • Morpho Challenge evaluation team • Challenge workshop organizers

  10. Let’s start. It is my pleasure to welcome the first speaker, who is...

  11. Morpho Challenge – Introduction and evaluation report Mikko Kurimo, Mathias Creutz, Matti Varjokallio (Helsinki, FI) Ebru Arisoy, Murat Saraclar (Istanbul, TR)

  12. Contents • Motivation • Call for participation • Rules • Datasets • Participants • Results of competition 1, word segmentation • Results of competition 2, language modeling • Conclusion

  13. Motivation • To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. • Get basic vocabulary units suitable for different tasks: • Speech and text understanding • Machine translation • Information retrieval • Statistical language modelling

  14. Motivation • The scientific goals of this challenge are: • To learn of the phenomena underlying word construction in natural languages • To discover approaches suitable for a wide range of languages • To advance machine learning methodology

  15. Contents • Motivation • Call for participation • Rules • Datasets • Participants • Results of competition 1, word segmentation • Results of competition 2, language modeling • Conclusion

  16. Call for participation • Part of the EU Network of Excellence PASCAL’s Challenge Program • Participation is open to all and free of charge • Word sets are provided for three languages: Finnish, English, and Turkish • Implement an unsupervised algorithm that segments the words of each language! • No language-specific tweaking parameters, please • Write a paper that describes your algorithm

  17. Rules • Segmented words are submitted to the organizers • Two different evaluations are made • Competition 1: Comparison to a linguisticmorpheme segmentation "gold standard“ • Competition 2: Speech recognition experiments, where statistical n-gram language models utilize the morphemes instead of entire words.

  18. Datasets • Word lists are downloadable at our home page • Each word in the list is preceded by its frequency • Finnish: newspapers, books, newswires: 1.6/32M • Turkish: web, newspapers, sports news: 0.6/17M • English: Gutenberg, Gigaword, Brown: 170k/24M • Small gold standard sample in each language

  19. Participants • A1 Choudri and Dang, Univ. Leeds, UK • A2 a,b, Bernhard, TIMC-IMAG, F • A3 'A.A.‘ Ahmad and Allendes, Univ. Leeds, UK • A4 ‘comb’,’lsv’, Bordag, Univ. Leipzig, D • A5 Rehman and Hussain, Univ. Leeds, UK • A6 'RePortS‘, Pitler and Keshava, Univ. Yale, USA • A7 Bonnier, Univ. Leeds, UK • A8 Kitching and Malleson, Univ. Leeds, UK • A9 'Pacman‘, Manley and Williamson, Univ. Leeds, UK • A10 Johnsen, Univ. Bergen, NO • A11 'Swordfish‘, Jordan, Healy and Keselj, Univ. Dalhousie, CA • A12 'Cheat‘, Atwell and Roberts, Univ. Leeds, UK • M1-3 Morfessor, Categories-ML, MAP, Helsinki Univ. Tech, FI

  20. Contents • Motivation • Call for participation • Rules • Datasets • Participants • Results of competition 1, word segmentation • Results of competition 2, language modeling • Conclusion

  21. Competition 1: Word segmentation • Two samples : boule_vard , cup_bearer_s‘ • Gold standard: boulevard , cup_bear_er_s_‘ • 2 correct hits (H), 1 insertion (I), 2 deletions (D) • Precision = H / (H + I) = 2 / (2 + 1) = 0.67 • Recall = H / (H + D) = 2 / (2 + 2) = 0.50 • F-Measure = harmonic mean of precision and recall = 2H / (2H + I + D) = 4 / (4 + 1 + 2) = 0.57 • A secret (random)10% subset of words evaluated • Morfessor Baseline: 54.2% FI, 51.3% TR, 66.0 EN

  22. Results: F-measure in Finnish data

  23. F-measure with reference algorithms

  24. F-measure in Turkish data

  25. F-measure with reference algorithms

  26. F-measure in English data

  27. F-measure with reference algorithms

  28. F-measure, the 3 languages task

  29. ...with reference algorithms

  30. Contents • Motivation • Call for participation • Rules • Datasets • Participants • Results of competition 1, word segmentation • Results of competition 2, language modeling • Conclusion

  31. Competition 2: Language modeling • A statistical N-gram LM trained for the obtained morphemes using a large text corpus • Growing N-gram model for Finnish by HUT tools • 4-gram model for Turkish using SRILM • Free lexicon size (40´000 – 700´000) • ~10M N-grams (Finnish) or 50-70M bytes (Turkish)

  32. Evaluation by speech recognition • Realistic benchmark application: Continuous reading of large-vocabulary texts (books and news) • Letter error rate LER% = (sub + ins + del) / letters • Baseline systems using LMs of Morfessor’s segments • Finnish recognizer made at HUT (HUT tools): speaker-dep., running speed 10-15 xRT, baseline 1.31% LER • Turkish made at Bogazici Univ. (HTK and AT&T tools): speaker-indep., running 2-3 xRT, baseline 13.7% LER

  33. Speech recognition letter error rate (LER)

  34. LER for reference algorithms

  35. LER for grammatic rules and words, too

  36. Update for Turkish results NEW

  37. Contents • Motivation • Call for participation • Rules • Datasets • Participants • Results of competition 1, word segmentation • Results of competition 2, language modeling • Conclusion

  38. Conclusion • The scientific goals of this challenge are: • To learn of the phenomena underlying word construction in natural languages • To discover approaches suitable for a wide range of languages • To advance machine learning methodology

  39. Conclusion • 14 different unsupervised segmentation algorithms • 12 participating research groups • Evaluations for 3 languages • Full report and papers in the proceedings • Website: http://www.cis.hut.fi/morphochallenge2005

  40. Acknowledgments • Text and speech data providers in all languages! • Finnish and Turkish evaluation teams • Funding from PASCAL, Finnish Academy, Lang. Tech. Grad school, HUT, and Bogazici Univ. • LM and ASR tools in HUT, SRI, and AT&T • Competition participants!

  41. The second speaker today : Professor Richard Sproat, University of Illinois at Urbana-Champaign:”Computational Morphology and its Implications for the Theoretical Morphology”

  42. Richard Sproat • Professor of Linguistics and Electrical and Computer Engineering at the University of Illinois and head of the Computational Linguistics Lab at the Beckman Institute. • Received his Ph.D. from MIT in 1985 and has since then worked also at AT&T Bell Labs. • A well-known expert in language and computational linguistics, including syntax, morphology, computational morphology, articulatory and acoustic phonetics, text processing, text-to-speech synthesis, writing systems, and text-to-scene conversion.

More Related