Metadata generation and glossary creation in eLearning

Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008

Outline • Demonstration of the functionalities • Where we stand • Evaluation of tools • Consequences for the development of the tools in the final phase

Demo We simulate a tutor who adds a learning objects and generates and edits additional data

Where we stand (1) Achievements reached in the first year of the project: • Annotated corpora of learning objects • Stand-alone prototype of keyword extractor (KWE) • Stand-alone prototype of glossary candidate detector (GCD)

Where we stand (2) Achievements reached in the second year of the project: • Quantitative evaluation of the corpora and tools • Validation of the tools in user-centered usage scenarios for all languages • Further development of tools in response to the results of the evaluation

Evaluation - rationale Quantitative evaluation is needed to • Inform the further development of the tools (formative) • Find the optimal setting / parameters for each language (summative)

Evaluation (1) Evaluation is applied to: • the corpora of learning objects • the keyword extractor • the glossary candidate detector In the following, I will focus on the tool evaluation

Evaluation (2) Evaluation of the tools comprises of • measuring recall and precision compared to the manual annotation • measuring agreement on each task between different annotators • measuring acceptance of keywords / definition (rated on a scale)

KWE Evaluation step 1 • On human annotator marked n keywords in document d • First n choices of KWE for document d extracted • Measure overlap between both sets • measure also partial matches

KWE Evaluation – step 2 • Measure Inter-Annotator Agreement (IAA) • Participants read text (Calimera „Multimedia“) • Participants assign keywords to that text (ideally not more than 15) • KWE produces keywords for text

KWE Evaluation – step 2 • Agreement is measured between human annotators • Agreement is measured between KWE and human annotators We have tested two measures / approaches • kappa according to Bruce / Wiebe • AC1, an alternative agreement weighting suggested by Debra Haley at OU, based on Gwet

KWE Evaluation – step 3 • Humans judge the adequacy of keywords • Participants read text (Calimera „Multimedia“) • Participants see 20 KW generated by the KWE and rate them • Scale 1 – 4 (excellent – not acceptable) • 5 = not sure

GCD Evaluation - step 1 • A human annotator marked definitions in document d • GCD extracts defining contexts from same document d • Measure overlap between both sets • Overlap is measured on the sentence level, partial overlap counts

GCD Evaluation – step 2 • Measure Inter-Annotator Agreement • Experiments run for Polish and Dutch • Prevalence-adjusted version of kappa used as a measure • Polish: 0.42; Dutch: 0.44 • IAA rather low for this task

GCD Evaluation – step 3 • Judging quality of extracted definitions • Participants read text • Participants get definitions extracted by GCD for that text and rate quality • Scale 1 – 4 (excellent – not acceptable) • 5 = not sure

GCD Evaluation – step 3 Further findings • relatively high variance (many ‚1‘ and ‚4‘) • Disagreement between users about the quality of individual definitions

Individual user feedback - KWE • The quality of the generated keywords remains an issue • Variance in the responses from different language groups • We suspect a correlation between language of the users and their satisfaction • Performance of KWE relies on language settings, we have to investigate them further

Individual user feedback – GCD • Not all the suggested definitions are real definitions. • Terms are ok, but definitions cited are often not what would be expected. • Some terms proposed in the glossary did not make any sense. • The ability to see the context where a definition has been found is useful.

Consequences - KWE • Use non-distributional information to rank keywords (layout, chains) • Present first 10 keywords to user, more keywords on demand • For keyphrases, present most frequent attested form • Users can add their own keywords

Consequences - GCD • Split definitions into types and tackle the most important types • Use machine learning alongside local grammars • Look into the part of the grammars which extract the defined term • Users can add their own definitions

Plans for final phase • KWE, work with lexical chains • GCD, extend ML experiments • Finalize documentation of the tools

Validation User scenarios with NLP tools embedded: • Content provider adds keywords and a glossary for a new learning object • Student uses keywords and definitions extracted from a learning object to prepare a presentation of the content of that learning object

Validation • Students use keywords and definitions extracted from a learning objects to prepare a quiz / exam about the content of that learning object

Validation We want to get feedback about • The users‘ general attitude towards the tools • The users‘ satisfaction with the results obtained by the tools in the particular situation of use (scenario)

User feedback • Participants appreciate the option to add their own data • Participants found it easy to use the functions

Plans for the next phase Improve precision of extraction results: • KWE – implement lexical chainer • GCD – use machine learning in combination with local grammars or substituting these grammars • Finalize documentation of the tools

Corpus statistics – full corpus • Measuring lengths of corpora (# of documents, tokens) • Measuring token / tpye ratio • Measuring type / lemma ratio

Corpus statistics – full corpus • Bulgarian, German and Polish corpora have a very low number of tokens per type (probably problems with sparseness) • English has by far the highest ratio • Czech, Dutch, Portuguese and Romanian are in between • type / lemma ration reflects richness of inflectional paradigms

To do • Please check / verify this numbers • Report, for the M24 deliverable, about improvements / recanalysis of the corpora (I am aware of such activities for Bulgarian, German, and English)

Corpus statistics – annotated subcorpus • Measuring lenghts of annotated documents • Measuring distribution of manually marked keywords over documents • Measuring the share of keyphrases

Keyphrases

Metadata generation and glossary creation in eLearning