1 / 42

Applications of Natural Language Processing

Applications of Natural Language Processing. Course 7 – 05 April 2012 Diana Trandab ă ț dtrandabat@info.uaic.ro. Content. NLP in eLearning Generating test questions Keywords identification Extraction of definitions. eLearning.

claus
Download Presentation

Applications of Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applications of Natural Language Processing Course 7 – 05 April 2012 Diana Trandabățdtrandabat@info.uaic.ro

  2. Content • NLP in eLearning • Generating test questions • Keywordsidentification • Extraction of definitions

  3. eLearning • eLearning comprises all forms of electronically supported learning and teaching. • eLearning 2.0- withthe emergence of Web 2.0  • Conventionale-learning systems were based on instructional packets, which were delivered to students using assignments. Assignments were evaluated by the teacher. • In contrast, the new e-learning places increased emphasis on social learning and use of social software such as blogs, wikis, podcasts etc.

  4. NLP in eLearning • NLP techniques in educational applications working with textual data: • intelligent tutoring systems • automatic generation of exercises • assessment of learner generated discourse • reading and writting assistance • These applications require an adaptation of NLP techniques to various types of discourse, e.g. tutoring dialogues, which are different from typical task-oriented spoken dialogue systems. • Moreover, educational applications place strong requirements on NLP systems, which have to be robust yet accurate.

  5. Educational Natural LanguageProcessing

  6. Educational NLP • Definition: • Field of research exploring the use ofNLP techniques in educational contexts • Why? • Large text repositories with user generateddiscourse and user generated metadata are created • These repositories need advanced informationmanagement and NLP to be efficiently accessed • Using these repositories to create structuredknowledge bases can improve NLP

  7. Computer-based Testing • Definition: All forms of assessment delivered with the helpof computers • also calledComputer Assisted/Aided Assessment (CAA) • Adequate question types for CAA (McKenna & Bull, 1999): • Multiple choice questions (MCQs) • True/False questions • Matching questions • Ranking questions • Sequencing questions • etc.

  8. NLP for Computer Assisted Assessment • Generation of questions and exercises • Writing test questions, especially objective test items, is anextremely difficult and time consuming task for teachers • Use of NLP to automatically generate objective test items,esp. for language learning • Assessment and evaluation of answers to subjectivetest items • Use of NLP to automatically: • Diagnose errors in short-answer essays • Grade essays

  9. Automatic Generation of TestItems • Source data • Corpora: texts should be chosen according to • the learner model (level, mastered vocabulary) • the instructor model (target language, word category) • Lexical semantic resources, e.g. WordNet • Tools • Tokeniserand sentence splitter • Lemmatiser • Conjugation and declension tools • POS tagger • Parser and chunker

  10. Multiple-Choice Questions • Choose the correct answer among a set of possible answers: • Who was voted the best international footballer for 2004? (a) Henry (b) Beckham (c) Ronaldinho (d) Ronaldo • Usually 3 to 5 alternative answers Question focus Distractors Correct answer / Key

  11. Distractors • Distractors (also distracters) are the incorrect answerspresented as a choice in a multiple-choice test • Challenge: Generation of "good" distractors • Ensure that there is only one correct response for singleresponse MCQ • The key should not always occur at the same position in thelist of answers • Distractorsshould be grammatically parallel with each otherand approximately equal in length • Distractorsshould be plausible and attractive • However, distractors should not be too close to the correctanswer and risk confusing students

  12. Multiple-Choice Questions 1. Selection of the key • Unknownwords that appear in a reading • Domain-specific terms 2. Generation of the question focus • Constrained patterns • Transformation of source clauses to questionfocuses. Transitive verbs require objects → Which kind of verbs require objects?

  13. Multiple-Choice Questions 3, Generation of the distractors • WordNet concepts which are semantically close to the key,e.g. hypernyms and co-hyponyms • "Which part of speech serves as the most centralelement in a clause?" • Key: "verb", • Distractors: "noun", "adjective", "preposition“ • Same POS • Similar frequency range • For grammar questions, use a declension or a conjugation tool togenerate different forms of the key, e.g. change case, number,person, mode, tense, etc. • Common student errors in the given context • Collocations: frequent co-occurrence with either the left or theright context

  14. Fill-in-the-Blank Questions • Consists of a portion of text with certain words removed • The student is asked to "fill in the blanks“ • Challenges: • Phrase the question so that only one correct answer ispossible (e.g. verb to be conjugated)

  15. Fill-in-the-Blank Question Generation • 1. Selection of an input corpus • 2. POS tagging • 3. Selection of the blanks in the input corpus • Every "n-th" (e.g. fifth or eighth) word in the text • Words in specified frequency ranges, e.g. only highfrequency or low frequency words • Words belonging to a given grammatical category • Open-class words, given their POS • Machine learning, based on a pool of input questions usedas training data • 4. Where needed, provide some information about the wordin the blank, e.g. verb lemma when the test targets verbconjugation

  16. Overview on assessment of learnergenerated data • Short answer assessment • Learner's response, one + target responses, question,source reading passage • Linguistic analysis: annotation, alignment, diagnosis • Essays • Plagiarismdetection • Speech generation

  17. Automatic Text Simplification • Related techniques: summarisation and sentencecompression • Syntactic simplification: • Removal or replacement of difficult syntactic structures, usinghand-built transformational rules applied to dependency andparsetrees • Lexical simplification: • Replacedifficult words with simpler ones • Difficult words are identified using the number of syllablesand/or frequency counts in a corpus • Choose the simplest synonym for difficult words in WordNet

  18. Vocabulary Assistance for Reading • Overall goal: support vocabulary acquisition during readingfor: • children, who learn to read • foreign language learners, who read texts in a foreignlanguage • Problem: a word's context may not provide enoughinformation about its meaning • Solution: augment documents with dynamically generatedannotations about (problematic) words

  19. Automatic detection of definitions A grammar iscreated for the automatic identification of definitions in texts Types of definitions • “is_def” –“HTML este tot un protocol folosit de World Wide Web.” (HTML is also a protocol used by World Wide Web). • “verb_def” –“Poştaelectronicăreprezintătransmisiamesajelorprinintermediulunorreţeleelectronice.” (Electronic mail represents sending messages through electronic networks). • “punct_def” – “Bit – prescurtareapentru binary digit” (Bit – shortcut for binary digit)

  20. Types of definitions • layout_def • “pron_def” – “…definirii conceptului de baze de date. Acesta descrie metode de ….” (…defining the database concept. It describes methods of ….) • “other_def” – “triunghi echilateral, adică cu toate laturile egale” (equilateral triangle i.e. having all sides equal).

  21. Distribution of the definitions

  22. Rules • Simple grammar rules • Composed grammar rules • “is_def” grammar rule: <rule name="may_be_term"> <seq> <query match="tok[@base='fi' and substring(@ctag,1,5)='vmip3']"/> <first> <ref name="UndefNominal" /> <ref name="DefNominal" /> </first> </seq> </rule>

  23. Evaluation • Lxtransduce (Tobin 2005) is used to match the grammar in files

  24. Question Answering • Accordingly to the answer type, we have the following type of questions (Harabagiu, Moldovan 2007): • Factoid – “Who discovered the oxygen?” or “When did Hawaii become a state?” or “What football team won the World Coup in 1992?”. • List – “What countries export oil?” or “What are the regions preferred by the Americans for holidays?”. • Definition – “What is quasar?” or “What is a question-answering system?”

  25. QA – Example • Question: Cine este Zeus? (Cine, zeus, PERSON) • Snippet: 0026#10014#1.0#Zeus#Zeus\zeus\NP este\fi\V3\ cel\cel\TSR\ mai\mai\R\ puternic\puternic\ASN\ dintre\dintre\S\ olimpieni\olimpieni\NPN\ ,\,\COMMA\ socotit\socoti\VP\ drept\drept\S\ stăpânul\stăpân\NSRY\ suprem\suprem\ASN\ al\al\TS\ oamenilor\om\NPOY\ şi\şi\CR\ al\al\TS\ zeilor\zeu\NPOY\ .\.\PERIOD\ • Our pattern for “is_def” (\zeus\.*\NP .*\fi\V3\(.*))match the snippet

  26. Keywordsextraction • Using a trening corpus of documentsannotatedwithkeywords • Measuring distribution of manually marked keywords over documents

  27. Reflection • Did the human annotators annotate keywords of domain terms? • Was the task adequately contextualised?

  28. Keyword extraction • Good keywords have a typical, non random distribution in and across documents • Keywords tend to appear more often at certain places in texts (headings etc.) • Keywords are often highlighted / emphasised by authors • Keywords express / represent the topic(s) of a text

  29. Modelling Keywordiness • Linguistic filtering of KW candidates, based on part of speech and morphology • Distributional measures are used to identify unevenly distributed words • TFIDF • Knowledge of text structure used to identify salient regions (e.g., headings) • Layout features of texts used to identify emphasised words and weight them higher • Finding chains of semantically related words

  30. Challenges • Treating multi word keywords • Assigning a combined weight which takes into account all the aforementioned factors • Multilinguality: finding good settings for all languages, balancing language dependent and language independent features

  31. Treatment of keyphrases • Keyphrases have to be restricted wrt to length (max 3 words) and frequency (min 2 occurrences) • Keyphrase patterns must be restricted wrt to linguistic categories (style of learningis acceptable; of learning stylesis not)

  32. KWE Evaluation • Human annotators marked n keywords in document d • First n choices of KWE for document d extracted • Measure overlap between both sets • measure also partial matches

  33. NLP has lots to offer • Resources: • Lexical semantic resources, e.g. WordNet • Web 2.0 resources, e.g. Wikipedia, Wiktionary • Tools: • Tokeniserand sentence splitting • Morphological analysis • Part of speech tagging • Parsing and chunking • Word sense disambiguation • Summarisation • Keyword extraction

  34. Tasks and applications • To assist instructors • Automatic generation of questions and exercises • Assessment of learner-generated discourse • To assist learners • Reading and writing assistance • Electronic career guidance • Educational question answering • For all users in the Web 2.0 • NLP for wikis • Quality assessment of user generated contents

  35. A lot more research is done on: • Computer-Assisted Language Learning • Intelligent Tutoring Systems • Information search for eLearning • Educational blogging • Annotations and social tagging • Analysingcollaborative learning processes automatically • Learners' corpora and resources • eLearning standards, e.g. SCORM

  36. Requirements (Team:max2 persons, Deadline: 12April) • 1a) Extract definitions from a given Wikipedia page • 1b) Generate questions such as “what is …" or “what is the meaning of …" from the list above • 2) Automatic generation of “fill the blanks” questions • Dacă nu ainimicplanificatdiseară, hai __ teatru. • (a) la (b) de(c) pentru(d) null • Input: a sentence and the key • Dacă nu ainimicplanificatdiseară, hai la teatru. • Key: la • Output: generate three distractors using different approaches: • baseline: word frequencies • Collocations • "creative" method, devised by the students

  37. Further reading • Jill Burstein: Opportunities for Natural Language Processing Research in EducationProceedingCICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing, Springer-Verlag Berlin, Heidelberg, 2009  • Paola Monachesi, ElineWesterhout. What can NLP techniques do for eLearning? Presented at INFOS 2008, 27-29 March. • Adrian Iftene, Diana Trandabăţ, Ionuţ Pistol: Grammar-based Automatic Extraction of Definitions and Applications for Romanian. RANLP 2007 workshop: Natural Language Processing and Knowledge Representation for eLearning Environments. 

  38. Links • EducationalApplications of NLP http://www.ets.org/research/topics/as_nlp/educational_applications

  39. Thanks!

  40. Types of plagiarism (1) Plagiarism of authorship: the direct case of putting your ownname to someone else’s work (2) Word-for-word plagiarism: copying of phrases or passages from published text without quotation or acknowledgement. (3) Paraphrasing plagiarism: words or syntax are changed(rewritten), but the source text can still be recognized. (4) Plagiarism of the form of a source: the structure of an argumentin a source is copied (verbatim or rewritten) (5) Plagiarism of ideas: the reuse of an original thought from asource text without dependence on the words or form of the source (6) Plagiarism of secondary sources: original sources arereferenced or quoted, but obtained from a secondary source text withoutlooking up the original.

More Related