400 likes | 544 Views
Metadata generation and glossary creation in eLearning. Lothar Lemnitzer Review meeting, Zürich, 25 January 2008. Outline. Demonstration of the functionalities Where we stand Evaluation of tools Consequences for the development of the tools in the final phase. Demo.
E N D
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008
Outline • Demonstration of the functionalities • Where we stand • Evaluation of tools • Consequences for the development of the tools in the final phase
Demo We simulate a tutor who adds a learning objects and generates and edits additional data
Where we stand (1) Achievements reached in the first year of the project: • Annotated corpora of learning objects • Stand-alone prototype of keyword extractor (KWE) • Stand-alone prototype of glossary candidate detector (GCD)
Where we stand (2) Achievements reached in the second year of the project: • Quantitative evaluation of the corpora and tools • Validation of the tools in user-centered usage scenarios for all languages • Further development of tools in response to the results of the evaluation
Evaluation - rationale Quantitative evaluation is needed to • Inform the further development of the tools (formative) • Find the optimal setting / parameters for each language (summative)
Evaluation (1) Evaluation is applied to: • the corpora of learning objects • the keyword extractor • the glossary candidate detector In the following, I will focus on the tool evaluation
Evaluation (2) Evaluation of the tools comprises of • measuring recall and precision compared to the manual annotation • measuring agreement on each task between different annotators • measuring acceptance of keywords / definition (rated on a scale)
KWE Evaluation step 1 • On human annotator marked n keywords in document d • First n choices of KWE for document d extracted • Measure overlap between both sets • measure also partial matches
KWE Evaluation – step 2 • Measure Inter-Annotator Agreement (IAA) • Participants read text (Calimera „Multimedia“) • Participants assign keywords to that text (ideally not more than 15) • KWE produces keywords for text
KWE Evaluation – step 2 • Agreement is measured between human annotators • Agreement is measured between KWE and human annotators We have tested two measures / approaches • kappa according to Bruce / Wiebe • AC1, an alternative agreement weighting suggested by Debra Haley at OU, based on Gwet
KWE Evaluation – step 3 • Humans judge the adequacy of keywords • Participants read text (Calimera „Multimedia“) • Participants see 20 KW generated by the KWE and rate them • Scale 1 – 4 (excellent – not acceptable) • 5 = not sure
GCD Evaluation - step 1 • A human annotator marked definitions in document d • GCD extracts defining contexts from same document d • Measure overlap between both sets • Overlap is measured on the sentence level, partial overlap counts
GCD Evaluation – step 2 • Measure Inter-Annotator Agreement • Experiments run for Polish and Dutch • Prevalence-adjusted version of kappa used as a measure • Polish: 0.42; Dutch: 0.44 • IAA rather low for this task
GCD Evaluation – step 3 • Judging quality of extracted definitions • Participants read text • Participants get definitions extracted by GCD for that text and rate quality • Scale 1 – 4 (excellent – not acceptable) • 5 = not sure
GCD Evaluation – step 3 Further findings • relatively high variance (many ‚1‘ and ‚4‘) • Disagreement between users about the quality of individual definitions
Individual user feedback - KWE • The quality of the generated keywords remains an issue • Variance in the responses from different language groups • We suspect a correlation between language of the users and their satisfaction • Performance of KWE relies on language settings, we have to investigate them further
Individual user feedback – GCD • Not all the suggested definitions are real definitions. • Terms are ok, but definitions cited are often not what would be expected. • Some terms proposed in the glossary did not make any sense. • The ability to see the context where a definition has been found is useful.
Consequences - KWE • Use non-distributional information to rank keywords (layout, chains) • Present first 10 keywords to user, more keywords on demand • For keyphrases, present most frequent attested form • Users can add their own keywords
Consequences - GCD • Split definitions into types and tackle the most important types • Use machine learning alongside local grammars • Look into the part of the grammars which extract the defined term • Users can add their own definitions
Plans for final phase • KWE, work with lexical chains • GCD, extend ML experiments • Finalize documentation of the tools
Validation User scenarios with NLP tools embedded: • Content provider adds keywords and a glossary for a new learning object • Student uses keywords and definitions extracted from a learning object to prepare a presentation of the content of that learning object
Validation • Students use keywords and definitions extracted from a learning objects to prepare a quiz / exam about the content of that learning object
Validation We want to get feedback about • The users‘ general attitude towards the tools • The users‘ satisfaction with the results obtained by the tools in the particular situation of use (scenario)
User feedback • Participants appreciate the option to add their own data • Participants found it easy to use the functions
Plans for the next phase Improve precision of extraction results: • KWE – implement lexical chainer • GCD – use machine learning in combination with local grammars or substituting these grammars • Finalize documentation of the tools
Corpus statistics – full corpus • Measuring lengths of corpora (# of documents, tokens) • Measuring token / tpye ratio • Measuring type / lemma ratio
Corpus statistics – full corpus • Bulgarian, German and Polish corpora have a very low number of tokens per type (probably problems with sparseness) • English has by far the highest ratio • Czech, Dutch, Portuguese and Romanian are in between • type / lemma ration reflects richness of inflectional paradigms
To do • Please check / verify this numbers • Report, for the M24 deliverable, about improvements / recanalysis of the corpora (I am aware of such activities for Bulgarian, German, and English)
Corpus statistics – annotated subcorpus • Measuring lenghts of annotated documents • Measuring distribution of manually marked keywords over documents • Measuring the share of keyphrases