240 likes | 343 Views
A Framework for Automatic Generation of Grammar and Vocabulary Questions. Ayako Hishino ( 星野綾子 ) Lunan Huang Hiroshi Nakagawa ( 中川裕志 ) University of Tokyo ( 東京大学 ) WorldCALL 2008. Outline. Introduction Related Work The Data Structure Preprocess Extension to Japanese Language Summary.
E N D
A Framework for Automatic Generation of Grammar and Vocabulary Questions Ayako Hishino (星野綾子) Lunan Huang Hiroshi Nakagawa (中川裕志) University of Tokyo (東京大学) WorldCALL 2008
Outline • Introduction • Related Work • The Data Structure • Preprocess • Extension to Japanese Language • Summary
Introduction (1/5) • With the Internet, the latest information spreads throughout the world with almost no time lag. • One of the notable phenomena in the "flat" world is the outsourcing of highly specialized work throughout the world. • This raises the need for education for ESP (English for Special Purposes) with which professionals in non-English speaking world master English in their own specailities.
Introduction (2/5) • While there are abundant resources to learn a language online, there are very scarce materials and few teachers that can help language learning in specialized areas. • For example, the latest news on the Internet would be the perfect reading material, as there are online news websites specialized in many areas. • Also, learning, and at the same time knowing, about the latest topics would be an exciting experience, thus helping to keep the learner motivated.
Introduction (3/5) • An automatic question generator provides independent learners with inexhaustive materials for their practice. • It also makes it possible for a learner to practice with a variety of materials, from the latest news to a document of their own interest. • We present two applications for AQG (automatic question generation): Sakumon, a question making assistance system and SakumonChallenge, a CAT (Computer Adaptive Testing) system that administers automatically generated questions.
Introduction (5/5) • These questions are of the same format, which is multiple-choicefill-in-the-blank. • We believe that this same format of question can test different kinds of knowledge. • Question A tests on vocabulary, B tests on grammar, and C is symmetric combination of the two.
Related Work (1/3) • AQG has gained attention only recently as an application of NLP (natural language processing) and there have been only a few studies reported so far.
Related Work (2/3) • We take advantage of the output of a syntactic parser, which is a technology that analyzes the sentence into a nested phrases structure according to the language's grammar.
Related Work (3/3) • The result of sentence parsing is also called a parse tree, because of its resemblance to an up-side-down tree with branches. • The lowest level next to the words shows POS (Part-Of-Speech) tags, which are assigned one to each word. • In addition to these POS tags, a parse result tells us such information as which of adjacent words are grouped to make a phrase and which noun phrase goes with which verb.
The Data Structure (1/5) • In place of an authoring tool for learning objects for general frameworks, our system has an authoring assistance system that allows the user to make questions on an online news article, just by clicking on a word in the text and selecting from the suggestions for alternatives. • The data structure is designed to contain one article (whatever passages serve the same) on which questions are generated.
The Data Structure (3/5) • At the beginning of the XML document, the basic information on the article, such as title, news source (which website it is from), and date of publication, is recorded. • Two main parts come after the heading: 1) article and 2) grammar distractor candidates.
The Data Structure (4/5) • The article part contains parsed sentences with up to seven candidate vocabulary alternatives attached to each word. • To each candidate vocabulary alternative, all inflectional forms are attached. • For example, for a verb, a candidate alternative contains the infinitive form, past, past participle, and gerund form.
The Data Structure (5/5) • The second part is called cphrases, which contain grammar distractor candidates. • In our methodology, the grammar distractor candidates are generated by converting a phrase in the parse tree. • Each phrase refers to the original one by IDs given to the phrases.
Preprocess (1/3) • The data in the framework we have defined are automatically generated. • In preprocess, data go through many steps in a pipeline manner. • HTML Parser: the raw texts are extracted from downloaded HTML files. We retain paragraph tags (<p>). • Sentence Splitter: The sentence boundaries are determined.
Preprocess (2/3) • Sentence Parser: A sentence parser tokenizes and analyzes sentences in a bracketed structure. • POS Tagger: The TreeTagger lemmatizes and annotates POS tags to each token. The look-up and annotating frequency is also done here. • Distractor Selector: By consulting WordNet, the system appends a list of candidate vocabulary alternatives to each document.
Preprocess (3/3) • Morphological Generator: A morphological generator is used to generate all possible inflectional forms for each word and each vocabulary alternative. • GrammarTarget Annotator: This finds the phrases matching the patterns and appends the converted phrases (grammar alternatives) to the document. • Distractor Indexer: The system indexes vocabulary alternatives for each token (to quicken the response time).
Extension to the Japanese Language (1/4) • Before doing migration work, people should pay attention to the differences between Japanese and English. • First, in Japanese sentences, all morphemes are conjunct without spaces. • Second, Japanese emphasizes dependency structure while English emphasizes phrase structure. • Therefore, the points for a Japanese test ought to focus on katsuyo, or inflection and Ko-ou, adverb-predicate agreement, rather than grammatical structures as is done in English.
Extension to the Japanese Language (2/4) • Generation of a vocabulary question inherits the method based on frequency, which is language-dependent. • As mentioned above, there are eight steps to getting the final XML file. • First, the program should be adjusted for downloading Japanese news from designated websites.
Extension to the Japanese Language (3/4) • Since Japanese punctuation markers are simpler than English, a complex sentence-splitting algorithm is not necessary. • For the subsequent steps, we need to employ the Japanese processing tool, Cabocha, which recognizes the inflectional forms of verbs, tags POS, and analyzes sentences into dependency structures. • The main manual work to be done by people is to program grammar target rules.
Extension to the Japanese Language (4/4) • Once the distractors are obtained and put into XML files, the sakumon framework will do the rest of the work. • The main work needed is to tag grammar targets. • In general, people should analyze the target language using the NLP tools available for that language.
Summary (1/2) • We have described a framework for automatic generation of grammar and vocabulary questions. • Currently, we have two applications based on this framework: Sakumon, a question-authoring assistance system, and SakumonChallenge, a computer adaptive testing system with automatically generated questions. • We have defined the data structure and the method for automatically generating the data.
Summary (2/2) • We have discussed possible extensions to this framework, using an example of extension to the Japanese language. • Lastly, we would like to remind the readers that the framework we have shown is only a working example. • Currently, we are working on improvements for future versions.