Applications of Natural Language Processing

Applications of Natural Language Processing Course 7 – 05 April 2012 Diana Trandabățdtrandabat@info.uaic.ro

Content • NLP in eLearning • Generating test questions • Keywordsidentification • Extraction of definitions

eLearning • eLearning comprises all forms of electronically supported learning and teaching. • eLearning 2.0- withthe emergence of Web 2.0 • Conventionale-learning systems were based on instructional packets, which were delivered to students using assignments. Assignments were evaluated by the teacher. • In contrast, the new e-learning places increased emphasis on social learning and use of social software such as blogs, wikis, podcasts etc.

NLP in eLearning • NLP techniques in educational applications working with textual data: • intelligent tutoring systems • automatic generation of exercises • assessment of learner generated discourse • reading and writting assistance • These applications require an adaptation of NLP techniques to various types of discourse, e.g. tutoring dialogues, which are different from typical task-oriented spoken dialogue systems. • Moreover, educational applications place strong requirements on NLP systems, which have to be robust yet accurate.

Educational Natural LanguageProcessing

Educational NLP • Definition: • Field of research exploring the use ofNLP techniques in educational contexts • Why? • Large text repositories with user generateddiscourse and user generated metadata are created • These repositories need advanced informationmanagement and NLP to be efficiently accessed • Using these repositories to create structuredknowledge bases can improve NLP

Computer-based Testing • Definition: All forms of assessment delivered with the helpof computers • also calledComputer Assisted/Aided Assessment (CAA) • Adequate question types for CAA (McKenna & Bull, 1999): • Multiple choice questions (MCQs) • True/False questions • Matching questions • Ranking questions • Sequencing questions • etc.

NLP for Computer Assisted Assessment • Generation of questions and exercises • Writing test questions, especially objective test items, is anextremely difficult and time consuming task for teachers • Use of NLP to automatically generate objective test items,esp. for language learning • Assessment and evaluation of answers to subjectivetest items • Use of NLP to automatically: • Diagnose errors in short-answer essays • Grade essays

Automatic Generation of TestItems • Source data • Corpora: texts should be chosen according to • the learner model (level, mastered vocabulary) • the instructor model (target language, word category) • Lexical semantic resources, e.g. WordNet • Tools • Tokeniserand sentence splitter • Lemmatiser • Conjugation and declension tools • POS tagger • Parser and chunker

Multiple-Choice Questions • Choose the correct answer among a set of possible answers: • Who was voted the best international footballer for 2004? (a) Henry (b) Beckham (c) Ronaldinho (d) Ronaldo • Usually 3 to 5 alternative answers Question focus Distractors Correct answer / Key

Distractors • Distractors (also distracters) are the incorrect answerspresented as a choice in a multiple-choice test • Challenge: Generation of "good" distractors • Ensure that there is only one correct response for singleresponse MCQ • The key should not always occur at the same position in thelist of answers • Distractorsshould be grammatically parallel with each otherand approximately equal in length • Distractorsshould be plausible and attractive • However, distractors should not be too close to the correctanswer and risk confusing students

Multiple-Choice Questions 1. Selection of the key • Unknownwords that appear in a reading • Domain-specific terms 2. Generation of the question focus • Constrained patterns • Transformation of source clauses to questionfocuses. Transitive verbs require objects → Which kind of verbs require objects?

Multiple-Choice Questions 3, Generation of the distractors • WordNet concepts which are semantically close to the key,e.g. hypernyms and co-hyponyms • "Which part of speech serves as the most centralelement in a clause?" • Key: "verb", • Distractors: "noun", "adjective", "preposition“ • Same POS • Similar frequency range • For grammar questions, use a declension or a conjugation tool togenerate different forms of the key, e.g. change case, number,person, mode, tense, etc. • Common student errors in the given context • Collocations: frequent co-occurrence with either the left or theright context

Fill-in-the-Blank Questions • Consists of a portion of text with certain words removed • The student is asked to "fill in the blanks“ • Challenges: • Phrase the question so that only one correct answer ispossible (e.g. verb to be conjugated)

Fill-in-the-Blank Question Generation • 1. Selection of an input corpus • 2. POS tagging • 3. Selection of the blanks in the input corpus • Every "n-th" (e.g. fifth or eighth) word in the text • Words in specified frequency ranges, e.g. only highfrequency or low frequency words • Words belonging to a given grammatical category • Open-class words, given their POS • Machine learning, based on a pool of input questions usedas training data • 4. Where needed, provide some information about the wordin the blank, e.g. verb lemma when the test targets verbconjugation

Overview on assessment of learnergenerated data • Short answer assessment • Learner's response, one + target responses, question,source reading passage • Linguistic analysis: annotation, alignment, diagnosis • Essays • Plagiarismdetection • Speech generation

Automatic Text Simplification • Related techniques: summarisation and sentencecompression • Syntactic simplification: • Removal or replacement of difficult syntactic structures, usinghand-built transformational rules applied to dependency andparsetrees • Lexical simplification: • Replacedifficult words with simpler ones • Difficult words are identified using the number of syllablesand/or frequency counts in a corpus • Choose the simplest synonym for difficult words in WordNet

Vocabulary Assistance for Reading • Overall goal: support vocabulary acquisition during readingfor: • children, who learn to read • foreign language learners, who read texts in a foreignlanguage • Problem: a word's context may not provide enoughinformation about its meaning • Solution: augment documents with dynamically generatedannotations about (problematic) words

Automatic detection of definitions A grammar iscreated for the automatic identification of definitions in texts Types of definitions • “is_def” –“HTML este tot un protocol folosit de World Wide Web.” (HTML is also a protocol used by World Wide Web). • “verb_def” –“Poştaelectronicăreprezintătransmisiamesajelorprinintermediulunorreţeleelectronice.” (Electronic mail represents sending messages through electronic networks). • “punct_def” – “Bit – prescurtareapentru binary digit” (Bit – shortcut for binary digit)

Types of definitions • layout_def • “pron_def” – “…definirii conceptului de baze de date. Acesta descrie metode de ….” (…defining the database concept. It describes methods of ….) • “other_def” – “triunghi echilateral, adică cu toate laturile egale” (equilateral triangle i.e. having all sides equal).

Distribution of the definitions

Rules • Simple grammar rules • Composed grammar rules • “is_def” grammar rule: <rule name="may_be_term"> <seq> <query match="tok[@base='fi' and substring(@ctag,1,5)='vmip3']"/> <first> <ref name="UndefNominal" /> <ref name="DefNominal" /> </first> </seq> </rule>

Evaluation • Lxtransduce (Tobin 2005) is used to match the grammar in files

Question Answering • Accordingly to the answer type, we have the following type of questions (Harabagiu, Moldovan 2007): • Factoid – “Who discovered the oxygen?” or “When did Hawaii become a state?” or “What football team won the World Coup in 1992?”. • List – “What countries export oil?” or “What are the regions preferred by the Americans for holidays?”. • Definition – “What is quasar?” or “What is a question-answering system?”

QA – Example • Question: Cine este Zeus? (Cine, zeus, PERSON) • Snippet: 0026#10014#1.0#Zeus#Zeus\zeus\NP este\fi\V3\ cel\cel\TSR\ mai\mai\R\ puternic\puternic\ASN\ dintre\dintre\S\ olimpieni\olimpieni\NPN\ ,\,\COMMA\ socotit\socoti\VP\ drept\drept\S\ stăpânul\stăpân\NSRY\ suprem\suprem\ASN\ al\al\TS\ oamenilor\om\NPOY\ şi\şi\CR\ al\al\TS\ zeilor\zeu\NPOY\ .\.\PERIOD\ • Our pattern for “is_def” (\zeus\.*\NP .*\fi\V3\(.*))match the snippet

Keywordsextraction • Using a trening corpus of documentsannotatedwithkeywords • Measuring distribution of manually marked keywords over documents

Reflection • Did the human annotators annotate keywords of domain terms? • Was the task adequately contextualised?

Keyword extraction • Good keywords have a typical, non random distribution in and across documents • Keywords tend to appear more often at certain places in texts (headings etc.) • Keywords are often highlighted / emphasised by authors • Keywords express / represent the topic(s) of a text

Modelling Keywordiness • Linguistic filtering of KW candidates, based on part of speech and morphology • Distributional measures are used to identify unevenly distributed words • TFIDF • Knowledge of text structure used to identify salient regions (e.g., headings) • Layout features of texts used to identify emphasised words and weight them higher • Finding chains of semantically related words

Challenges • Treating multi word keywords • Assigning a combined weight which takes into account all the aforementioned factors • Multilinguality: finding good settings for all languages, balancing language dependent and language independent features

Treatment of keyphrases • Keyphrases have to be restricted wrt to length (max 3 words) and frequency (min 2 occurrences) • Keyphrase patterns must be restricted wrt to linguistic categories (style of learningis acceptable; of learning stylesis not)

KWE Evaluation • Human annotators marked n keywords in document d • First n choices of KWE for document d extracted • Measure overlap between both sets • measure also partial matches

NLP has lots to offer • Resources: • Lexical semantic resources, e.g. WordNet • Web 2.0 resources, e.g. Wikipedia, Wiktionary • Tools: • Tokeniserand sentence splitting • Morphological analysis • Part of speech tagging • Parsing and chunking • Word sense disambiguation • Summarisation • Keyword extraction

Tasks and applications • To assist instructors • Automatic generation of questions and exercises • Assessment of learner-generated discourse • To assist learners • Reading and writing assistance • Electronic career guidance • Educational question answering • For all users in the Web 2.0 • NLP for wikis • Quality assessment of user generated contents

A lot more research is done on: • Computer-Assisted Language Learning • Intelligent Tutoring Systems • Information search for eLearning • Educational blogging • Annotations and social tagging • Analysingcollaborative learning processes automatically • Learners' corpora and resources • eLearning standards, e.g. SCORM

Requirements (Team:max2 persons, Deadline: 12April) • 1a) Extract definitions from a given Wikipedia page • 1b) Generate questions such as “what is …" or “what is the meaning of …" from the list above • 2) Automatic generation of “fill the blanks” questions • Dacă nu ainimicplanificatdiseară, hai __ teatru. • (a) la (b) de(c) pentru(d) null • Input: a sentence and the key • Dacă nu ainimicplanificatdiseară, hai la teatru. • Key: la • Output: generate three distractors using different approaches: • baseline: word frequencies • Collocations • "creative" method, devised by the students

Further reading • Jill Burstein: Opportunities for Natural Language Processing Research in EducationProceedingCICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing, Springer-Verlag Berlin, Heidelberg, 2009 • Paola Monachesi, ElineWesterhout. What can NLP techniques do for eLearning? Presented at INFOS 2008, 27-29 March. • Adrian Iftene, Diana Trandabăţ, Ionuţ Pistol: Grammar-based Automatic Extraction of Definitions and Applications for Romanian. RANLP 2007 workshop: Natural Language Processing and Knowledge Representation for eLearning Environments.

Links • EducationalApplications of NLP http://www.ets.org/research/topics/as_nlp/educational_applications

Thanks!

Types of plagiarism (1) Plagiarism of authorship: the direct case of putting your ownname to someone else’s work (2) Word-for-word plagiarism: copying of phrases or passages from published text without quotation or acknowledgement. (3) Paraphrasing plagiarism: words or syntax are changed(rewritten), but the source text can still be recognized. (4) Plagiarism of the form of a source: the structure of an argumentin a source is copied (verbatim or rewritten) (5) Plagiarism of ideas: the reuse of an original thought from asource text without dependence on the words or form of the source (6) Plagiarism of secondary sources: original sources arereferenced or quoted, but obtained from a secondary source text withoutlooking up the original.

Applications of Natural Language Processing