1 / 29

Unsupervised Morpheme Analysis - Morpho Challenge Workshop 2007

Join us at the workshop to design machine learning algorithms for morpheme discovery in language models. Explore new evaluation methods and applications in speech recognition, machine translation, and information retrieval.

cambella
Download Presentation

Unsupervised Morpheme Analysis - Morpho Challenge Workshop 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Break 11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

  2. Unsupervised Morpheme AnalysisMorpho Challenge Workshop 2007 Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University of Technology, Finland

  3. Opening Welcome to the Morpho Challenge 2007 workshop: • challenge participants • workshop speakers • other CLEF researchers • others interested in the topic

  4. Motivation • To design statistical machine learning algorithms that discover which morphemes words consist of • Follow-up to Morpho Challenge 2005 (segmentation of words into morphs) • Morphemes are useful as vocabulary units for statistical language modeling in: Speech recognition, Machine translation, Information retrieval

  5. Discussion topics for the end • New ways to evaluate morphemes ? • New test languages: Hungarian, Estonian, Russian, Arabic, Korean, Japanese, Chinese ? • New application evaluations: MT,..? • New organizing partners ? • Morpho Challenge 3 ? • Journal special issue ? • 3rd Morpho Challenge workshop ?

  6. Thanks Thanks to all who made Morpho Challenge 2007 possible: • PASCAL network, CLEF, Leipzig corpora collection • Morpho Challenge organizing committee • Morpho Challenge participants • Morpho Challenge program committee • Morpho Challenge evaluation team • CLEF 2007 workshop organizers

  7. 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:00 Break 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

  8. 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Break 11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

  9. Unsupervised Morpheme Analysis Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1 Mikko Kurimo, Mathias Creutz, Matti Varjokallio

  10. Contents • Objectives • Call for participation, Rules, Datasets • Participants • Morfessor • New evaluation method • Results • Conclusion

  11. Scientific objectives • To learn of the phenomena underlying word construction in natural languages • To discover approaches suitable for a wide range of languages • To advance machine learning methodology

  12. Call for participation • Part of the EU Network of Excellence PASCAL’s Challenge Program • Organized in collaboration with CLEF • Participation is open to all and free of charge • Word sets are provided for: Finnish, English,German and Turkish • Implement an unsupervised algorithm that discovers morpheme analysis of words in each language!

  13. Rules • Morpheme analysis are submitted to the organizers and two different evaluations are made • Competition 1: Comparison to a linguistic morpheme "gold standard“ • Competition 2: Information retrieval experiments, where the indexing is based on morphemes instead of entire words.

  14. Datasets • Word lists downloadable at our home page • Each word in the list is preceded by its frequency • Finnish: 3M sentences, 2.2M word types • Turkish: 1M sentences, 620K word types • German: 3M sentences, 1.3M word types • English: 3M sentences, 380K word types • Small gold standard sample available in each language

  15. Examples of gold standard analyses • English: baby-sitters baby_N sit_V er_s +PL • Finnish: linuxiin           linux_N +ILL • Turkish: kontrole         kontrol +DAT • German: zurueckzubehalten  zurueck_B zu be halt_V +INF

  16. New evaluation method • Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings • Solution: Compare to the linguistic gold standard analysis by matching the morpheme-sharing word pairs • Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme

  17. Participants • Delphine Bernhard, TIMC-IMAG, F (now moved to Darmstadt Univ. Tech., D) • Stefan Bordag, Univ. Leipzig, D • Paul McNamee and James Mayfield, JHU, USA • Daniel Zeman, Karlova Univ., CZ \\ • Christian Monson et al., CMU, USA • Emily Pitler and Samarth Keshava, Univ. Yale, USA • Morfessor MAP, Helsinki Univ. Tech., FI • (Michael Tepper, Univ. Washington, USA)

  18. Contents • Objectives • Call for participation, Rules, Datasets • Participants • Morfessor • New evaluation method • Results • Conclusion

  19. Contents • Objectives • Call for participation, Rules, Datasets • Participants • Morfessor • New evaluation method • Results • Conclusion

  20. New evaluation method • Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings • Solution: Compare to the linguistic gold standard analysis by matching the morpheme-sharing word pairs • Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme

  21. Evaluation measures • F-measure = 1/(1/Precision + 1/Recall) • Precision is the proportion of suggested word pairs that also have a morpheme in common according to the gold standard • Recall is the proportion of word pairs sampled from the gold standardthat also have a morpheme in common according to the suggested algorithm

  22. Results: Finnish, 2.2M word types

  23. Results: Turkish, 620K word types

  24. Results: German, 1.3M word types

  25. Results: English, 380K word types

  26. Conclusion • 12 different unsupervised algorithms • 6 participating research groups • Evaluations for 4 languages • Good results in all languages and in IR • Full report and papers in the CLEF proceedings • Website: http://www.cis.hut.fi/morphochallenge2007/

  27. Acknowledgments • Data from Leipzig and CLEF • Gold standard providers in all languages! • Workshop organization by CLEF • Funding from PASCAL and Academy of Finland • Competition participants!

  28. 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:00 Break 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

  29. 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Break 11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

More Related