WP2: Metadata generation and glossary creation in eLearning

WP2: Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, 29. August 2008 Luxemburg

Outline • Achievements at M24 • Improvement of tools, second cycle

Achievements (1) Achievements reached in the first two years of the project: • Corpora of learning objects, eight languages, linguistic annotation, keywords marked, definitions marked • Keyword extractor for metadata generation(KWE) • Definition extractor for glossary creation (GCD)

Achievements (2) • Integration of the tools into ILIAS LMS via webservices • Quantitative evaluation of the corpora and tools • Validation of the tools in user-centered usage scenarios for all languages (first round)

Activities of the final phase • Improve the performance of the tools • Implement / integrate linguistic processing chains • Proper embedding of tools in the context of teaching and learning ( WP5) • Documentation

Key Word Extractor • Implemented an additional distribution measure (Averaged Reduced Frequency), which measures word commonness • Implemented a voting mechanism, where each method has a vote and keyword candidates are ordered by votes • Slight modifications to the language models

Key Word Extractor – Lexical Chaining Outcome of the German pilot experiment: • Wordnets are too general to capture the LO-specific vocabulary • Domain-ontology is too specific to generate strong chains • Combination of both resources desirable However, the risk of failure to improve results lead us to the decision to abandon of these experiments

Glossary Candidate Detector • Used machine learning to improve precision of the tool (4 languages) • Machine learning methods: Naive Bayesisan classifier, Balanced Random Forests • Machine learning as single method (Polish) vs. ML as post-processing step (Dutch, Portuguese) • ML for the most frequent definition types (Dutch Portuguese) vs. for all definition types (Polish)

GCD Evaluation – Dutch

GCD Evaluation – Polish

GCD Evaluation – Portuguese, Copula-Definitions

Glossary Candidate Detector - Conclusions • ML (in combination with rule-based grammars) achieve state-of-the art results for all languages • The approach to be chosen depends on whether high precision or high recall is preferred • The ML method and the careful choice of the features are critical to the success

Linguistic Processing Chain • Languages: Czech, Dutch, English, Polish, Portuguese and Romanian • Rationale: enables us to add new documents in the mentioned languages • Based on linguistic processing tools of the partners • Integrated into ILIAS ( WP4)

Documentation • Javadoc documentation of the code (available from the website and sourceforge) • Documentation of the command line interface of stand-alone tools • Documentation of integration procedure and integrated tools / interface ( WP4)

Thank you for your attention

Demo We simulate a tutor who adds a learning objects and generates and edits additional data

WP2: Metadata generation and glossary creation in eLearning