1 / 30

Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF

This article provides an overview of the Morpho Challenge 2007 in CLEF, focusing on the unsupervised morpheme analysis. The challenge aimed to design statistical machine learning algorithms that discover which morphemes words consist of, with applications in speech recognition, machine translation, and information retrieval.

arrieta
Download Presentation

Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Morpheme Analysis – Overview ofMorpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University of Technology, Finland

  2. My job at Helsinki: Multimodal Interfaces@ Adaptive Informatics (Research Centre of Academy of Finland)

  3. Research topics of MMI group Continuous Speech Recognition Adaptive Natural Language Modelling Content Based Image and Video Retrieval Multimodal Interfaces: Proactive audio-visual information navigation, Effective multilingual interaction, Intermodal cross-over of semantics

  4. Motivation of Morpho Challenge • To design statistical machine learning algorithms that discover which morphemes words consist of • Follow-up to Morpho Challenge 2005 (segmentation of words into morphs) • Morphemes are useful as vocabulary units for statistical language modeling in: Speech recognition, Machine translation, Information retrieval

  5. The vocabulary problem Unique words per corpus size • Many applications require a large vocabulary: e.g. speech recognition, information retrieval, machine translation. • Agglutinative and highly-inflected languages suffer from a severevocabulary explosion • We need more efficient representation units Unique words (millions) Corpus size (million words)

  6. Scientific objectives • To learn of the phenomena underlying word construction in natural languages • To discover approaches suitable for a wide range of languages and tasks • To advance machine learning methodology

  7. Morpho Challenge 2007 • Part of the EU Network of Excellence PASCAL’s Challenge Program • Organized in collaboration with CLEF • Participation is open to all and free of charge • Word sets are provided for: Finnish, English,German and Turkish • Implement an unsupervised algorithm that discovers morpheme analysis of words in each language!

  8. Thanks Thanks to all who made Morpho Challenge 2007 possible: • PASCAL network, CLEF, Leipzig corpora collection • Morpho Challenge organizing committee • Morpho Challenge program committee • Morpho Challenge participants • Morpho Challenge evaluation team • CLEF 2007 organizers!

  9. Rules • Morpheme analysis are submitted to the organizers and two different evaluations are made • Competition 1: Comparison to a linguistic morpheme "gold standard“ • Competition 2: Information retrieval experiments, where the indexing is based on morphemes instead of entire words.

  10. Training data • Word lists downloadable at our home page • Each word in the list is preceded by its frequency • Finnish: 3M sentences, 2.2M word types • Turkish: 1M sentences, 620K word types • German: 3M sentences, 1.3M word types • English: 3M sentences, 380K word types • Small gold standard sample available in each language

  11. Examples of gold standard analyses • English: baby-sitters baby_N sit_V er_s +PL • Finnish: linuxiin           linux_N +ILL • German: zurueckzubehalten  zurueck_B zu be halt_V +INF • Turkish: kontrole         kontrol +DAT

  12. 1. A new linguistic evaluation method • Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings • Solution: Compare to the linguistic gold standard analysis by matching the morpheme-sharing word pairs • Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme

  13. Evaluation measures • F-measure = 1/(1/Precision + 1/Recall) • Precision is the proportion of suggested word pairs that also have a morpheme in common according to the gold standard • Recall is the proportion of word pairs sampled from the gold standardthat also have a morpheme in common according to the suggested algorithm

  14. Participants • Delphine Bernhard, TIMC-IMAG, F (now moved to Darmstadt, D) • Stefan Bordag, Univ. Leipzig, D • Paul McNamee and James Mayfield, JHU, USA • Daniel Zeman, Karlova Univ., CZ • Christian Monson et al., CMU, USA • Emily Pitler and Samarth Keshava, Univ. Yale, USA • Morfessor MAP, Helsinki Univ. Tech, FI • (Michael Tepper, Univ. Washington, USA)

  15. Results: Finnish, 2.2M word types

  16. Results: Turkish, 620K word types

  17. Results: German, 1.3M word types

  18. Results: English, 380K word types

  19. 2. Practical evaluation • Real world application for morpheme analysis: Information Retrieval • Analysis is needed to handle morphology (inflection, compounding) • CLEF collections for Finnish, German and English

  20. Data sets Finnish (CLEF 2004) 55K documents from articles in Aamulehti 94-95 50 test queries and 23K binary relevance assessments English (CLEF 2005) 107K documents from articles in Los Angeles Times 94 and Glasgow Herald 95 50 test queries and 20K binary relevance assessments German (CLEF 2003) 300K documents from short articles in Frankfurter Rundschau 94, Der Spiegel 94-95 and SDA 94-95 60 test queries and 23K binary relevance assessments

  21. Reference methods • Morfessor Baseline: our public code since 2002 • Morfessor Categories-MAP: improved, public since 2006 • dummy: no segmentation • grammatical: gold standard segmentations • all: all alternatives included • first: only first alternative • Porter: LEMUR's default stemmer • Tepper: hybrid method based on Morfessor MAP

  22. Evaluation 1/2 • Words in the documents and queries were replaced by the submitted segmentations • New words: • the CLEF collections contained words that were not in the original word list • additional segmentations were requested • if segmentation was not provided, words were indexed as such

  23. Evaluation 2/2 • LEMUR-toolkit ( http:// www.lemurproject.org/ ) • Okapi BM25 retrieval, default parameter settings • Okapi seems to handle common morphemes poorly => stoplist for most common ones (above a fixed frequency threshold) • Also an alternative set of non-stoplisted results with TFIDF

  24. Results: Finnish

  25. Results: German

  26. Results: English

  27. Conclusions • Analysis of new words important for Finnish, less so for German and English • Porter stemming unbeaten for English (so far) • Unsupervised morpheme analysis works very well for IR!

  28. Future directions? • Finnish, Turkish, English, German, ...? • Language modeling, Speech recognition, Information Retrieval, ...? • Venice, Budapest, ...? • PASCAL, CLEF, ...?

  29. Summary 2007 • 14 different unsupervised algorithms • 8 participating research groups • Evaluations for 4 languages (3 for IR) • Good results in all languages and IR • Full report and papers in the CLEF proceedings • Details, presentations, links, info at website: http://www.cis.hut.fi/morphochallenge2007/

  30. Acknowledgments • Data from Leipzig and CLEF • Gold standard providers in all languages! • Workshop organization by CLEF • Funding from PASCAL and Academy of Finland • Competition participants!

More Related