Unsupervised Morpheme Analysis - Morpho Challenge Workshop 2007

09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Break 11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

Unsupervised Morpheme AnalysisMorpho Challenge Workshop 2007 Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University of Technology, Finland

Opening Welcome to the Morpho Challenge 2007 workshop: • challenge participants • workshop speakers • other CLEF researchers • others interested in the topic

Motivation • To design statistical machine learning algorithms that discover which morphemes words consist of • Follow-up to Morpho Challenge 2005 (segmentation of words into morphs) • Morphemes are useful as vocabulary units for statistical language modeling in: Speech recognition, Machine translation, Information retrieval

Discussion topics for the end • New ways to evaluate morphemes ? • New test languages: Hungarian, Estonian, Russian, Arabic, Korean, Japanese, Chinese ? • New application evaluations: MT,..? • New organizing partners ? • Morpho Challenge 3 ? • Journal special issue ? • 3rd Morpho Challenge workshop ?

Thanks Thanks to all who made Morpho Challenge 2007 possible: • PASCAL network, CLEF, Leipzig corpora collection • Morpho Challenge organizing committee • Morpho Challenge participants • Morpho Challenge program committee • Morpho Challenge evaluation team • CLEF 2007 workshop organizers

09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:00 Break 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

Unsupervised Morpheme Analysis Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1 Mikko Kurimo, Mathias Creutz, Matti Varjokallio

Contents • Objectives • Call for participation, Rules, Datasets • Participants • Morfessor • New evaluation method • Results • Conclusion

Scientific objectives • To learn of the phenomena underlying word construction in natural languages • To discover approaches suitable for a wide range of languages • To advance machine learning methodology

Call for participation • Part of the EU Network of Excellence PASCAL’s Challenge Program • Organized in collaboration with CLEF • Participation is open to all and free of charge • Word sets are provided for: Finnish, English,German and Turkish • Implement an unsupervised algorithm that discovers morpheme analysis of words in each language!

Rules • Morpheme analysis are submitted to the organizers and two different evaluations are made • Competition 1: Comparison to a linguistic morpheme "gold standard“ • Competition 2: Information retrieval experiments, where the indexing is based on morphemes instead of entire words.

Datasets • Word lists downloadable at our home page • Each word in the list is preceded by its frequency • Finnish: 3M sentences, 2.2M word types • Turkish: 1M sentences, 620K word types • German: 3M sentences, 1.3M word types • English: 3M sentences, 380K word types • Small gold standard sample available in each language

Examples of gold standard analyses • English: baby-sitters baby_N sit_V er_s +PL • Finnish: linuxiin linux_N +ILL • Turkish: kontrole kontrol +DAT • German: zurueckzubehalten zurueck_B zu be halt_V +INF

New evaluation method • Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings • Solution: Compare to the linguistic gold standard analysis by matching the morpheme-sharing word pairs • Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme

Participants • Delphine Bernhard, TIMC-IMAG, F (now moved to Darmstadt Univ. Tech., D) • Stefan Bordag, Univ. Leipzig, D • Paul McNamee and James Mayfield, JHU, USA • Daniel Zeman, Karlova Univ., CZ \\ • Christian Monson et al., CMU, USA • Emily Pitler and Samarth Keshava, Univ. Yale, USA • Morfessor MAP, Helsinki Univ. Tech., FI • (Michael Tepper, Univ. Washington, USA)

Contents • Objectives • Call for participation, Rules, Datasets • Participants • Morfessor • New evaluation method • Results • Conclusion

New evaluation method • Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings • Solution: Compare to the linguistic gold standard analysis by matching the morpheme-sharing word pairs • Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme

Evaluation measures • F-measure = 1/(1/Precision + 1/Recall) • Precision is the proportion of suggested word pairs that also have a morpheme in common according to the gold standard • Recall is the proportion of word pairs sampled from the gold standardthat also have a morpheme in common according to the suggested algorithm

Results: Finnish, 2.2M word types

Results: Turkish, 620K word types

Results: German, 1.3M word types

Results: English, 380K word types

Conclusion • 12 different unsupervised algorithms • 6 participating research groups • Evaluations for 4 languages • Good results in all languages and in IR • Full report and papers in the CLEF proceedings • Website: http://www.cis.hut.fi/morphochallenge2007/

Acknowledgments • Data from Leipzig and CLEF • Gold standard providers in all languages! • Workshop organization by CLEF • Funding from PASCAL and Academy of Finland • Competition participants!

09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:00 Break 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

Unsupervised Morpheme Analysis - Morpho Challenge Workshop 2007