610 likes | 697 Views
Automated Question Answering. Motivation: support for students. Demand is for 365 x 24 support Students set aside time to complete task If problem encountered immediate help required Majority of responses direct students to teaching materials; so not a case of “not there” Poor search forums
E N D
Motivation: support for students • Demand is for 365 x 24 support • Students set aside time to complete task • If problem encountered immediate help required • Majority of responses direct students to teaching materials; so not a case of “not there” • Poor search forums • Search per forum - not course • Free-text search options fixed by RDBMS • No explicit operators (AND, OR, NEAR)
Research questions • Given the current level of development of natural language processing (NLP) tools, is it possible to: • Classify messages as question/non-question • Identify the topic of the question • Direct users to specific course resources
Natural Language Processing tools • Tokenisation (words, numbers, punctuation, whitespace) • Sentence detection • Part of speech tagging (verbs, nouns, pronouns, etc.) • Named entity recognition (names, locations, events, organisations) • Chunking/Parsing (noun/verb phrases and relationships) • Statistical modelling tools • Dictionaries, word-lists, WordNet , VerbNet • Corpora tools (Lucene, Lemur)
Question answering solutions • Open domain • No restrictions on question topic • Typically answers from web resources • Extensive literature • Closed domain • Restricted question topics • Typically answers from small corpus • Company documents • Structured data
Open domain QA research • Well established over two decades • TREC (Text REtrievalConference) • funded by NIST/DARPA since 1992 • QA track 1999 – 2007, directed at ‘Factoids’ • CLEF (Cross Language Evaluation Forum) • 2001- current • Information Retrieval, language resources • NTCIR (NII Test Collection for IR Systems) • 1997 – current • IR, question answering, summarization, extraction
TREC Factoids • Given a fact-based question: • How many calories in a Big Mac? • Who as the 16th President of the United States? • Where is the TajMahal? • Return an exact answer in 50/250 bytes • 540 calories • Abraham Lincoln • Agra, India
Minimal factoid process • Question analysis • Normalisation (verbs, auxiliaries, modifiers) • Identify entities (people, locations, events) • Pattern detection (who was X?, how high is Y?) • Query creation, expansion, and execution • Ordered terms, combined terms, weighted terms • Answer analysis • Match answer type to question type
OpenEphyra: open source QA Source: http://www.cs.cmu.edu/~nico/ephyra/doc/images/overall_architecture.jpg
OpenEphyra: question analysis Question ‘who was the fourth president of the USA’ Normalization ‘who be fourth president of USA’ Answer type NEproperName->NEperson Interpretation property: NAME target: fourth president context: USA
OpenEphyra: query expansion • "fourth president USA" • (fourth OR 4th OR quaternary) president (USA OR US OR U.S.A. OR U.S. OR "United States" OR "United States of America" OR "the States" OR America) • "fourth president" "USA" fourth president USA • "was fourth president of USA“ • "fourth president of USA was”
OpenEphyra: result answer: James Madison score: 0.7561732 docid: http://www.squidoo.com/james-madison-presidentusa Document content: <meta property="og:title" content="James Madison - 4th President of USA"/> <h1>James Madison - 4th President of USA</h1> <div class="module_intro>James Madison (March 16, 1751 - June 28, 1836) was fourth President of the United States (1809-1817), and one of the Founding Fathers of the United States...
Shallow answer selection • Answer based on reformulation of question • Who was the fourth president of the <location>United States</location>? • <person>James Maddison</person> was the fourth president of the <location>United States</location> Students don’t ask questions and we don’t provide answers!
Importance of named entities Search engine Answer matching Extracted NEs link question and answer Search results tagged with NEs Question processed for NEs
Task list: the real work • Create database of forum messages • Adapt open source NLP tools • Tokenisation, sentence detection, Parts Of Speech, parsing • Establish question patterns • Create language analysis tools • Word frequency • Named-entities: define, build, and train models • Prepare corpus • Format and tag documents (doc, html, pdf) • Build Indri catalogue and search interface Iterative process: build, test, refine
NLP tools • Predominantly Java • Stanford, OpenNLP, Lingpipe • GATE: complete analysis + processing system • IKVM permits use with .NET framework • Some C++, C# • WordNet, Lemur/Indri, Nooj, SharpNLP • Python NLTK • Complete NLP toolset and corpus • Lisp, Prolog
Message database • MySQL database for FirstClass messages • Extract: • Forum, Subject, Date, Author • Body • Use subject to classify as Original or Reply No clean-up or filtering of message content undertaken at this stage
Raw forum message (Sample 1) <?xml version="1.0"?> <firstclass> <FCFORMSHEADER> <fcobjectobjtype="oConfItem" formid="141" objname="Daniel Hughes 5"> <field id="3" index="0" type="number">-959014497</field> <subject index="0" >Help Please!!!? Urgent</subject> <tonames index="0" >T320 09B Eclipse Support</tonames> </fcobject> </FCFORMSHEADER> <body> I am trying to open an existing project but can't do it. It's driving me mad. I know the project folders are located in the workspaceblock4 folder. I have deleted all the open projects in the project explorer window (without deleting content). BUT how on earth do I know proceed to reload some of the projects without starting from scratch? When I select open file ... it doesn't let me open any projects files - only the individual files in the project folder. In other words I cannot get any project files to appear in the project explorer window. Please can anyone help me as I have booked a lot of time off work to concentrate on the project, but I am a dead end. </body> </firstclass>
Raw forum message (Sample 2) <?xml version="1.0"?> <firstclass> <FCFORMSHEADER> <fcobjectobjtype="oConfItem" formid="141" objname="Simon Shadbolt"> <field id="3" index="0" type="number">-962619805</field> <subject index="0" >Block 4 Practical booklet 6 activity 4- Unable to get a fault!</subject> <tonames index="0" >T320 09B Eclipse Support</tonames> </fcobject> </FCFORMSHEADER> <body> I have followed the set up and altered the fault to "none" and simulation to normal, but I do not get any faults at all or a listing that resembles the list on page 12, particularly line 12. I have attached my bpel file and my screenshot, any help appreciated. Simon Process bpelEcho3pScope: Instance 1 created. Process bpelEcho3pScope: Executing [/process] Process Suspended [/process] Receive ClientRequestMessage: Executing [/process/flow/receive[@name='ClientRequestMessage']] . Scope : Completed normally [/process/flow/scope] Reply ClientResponseMessage: Executing [/process/flow/reply[@name='ClientResponseMessage']] Reply ClientResponseMessage: Completed normally [/process/flow/reply[@name='ClientResponseMessage']] Process bpelEcho3pScope: Completed normally [/process] </body> </firstclass> Eclipse console listingor XML
T320 09B database properties • Total messages: 4246 • Non-replies: 1051 • Manually tagged questions: 777 • Average length (lines) 7.9 • Containing XML: 17 • Containing Eclipse content: 37
Creating question patterns • Extract text from forum messages (non-replies) • Create n-grams (‘n’ adjacent words) • Perform frequency analysis of n-grams • Manually review n-grams to create question patterns
5-word frequency analysis Top 20 results
Generalisation of patterns using POS Question part POS tag any|someDT advice|comment|guidanceNN appreciated|welcomedVB(N|D) ../. Can/MD anyone/NN offer/VB some/DT help/NN ?/. Can/MD someone/NN offer/VB some/DT help/NN ?/. Can/MD anybody/RB give/VB some/DT guidance/NN ?/. Could/MD somebody/RB give/VB some/DT direction/NN ?/. POS pattern matching failed due to errors in assigning tags
Final question patterns: RegExs * Pattern derived from Eclipse error message 169 patterns using ‘explicit capture’
Poor message style Incorrect POS tagging due to spelling errors when/WRB I/PRP tried/VBD to/TO generate/VB the/DT sample/NN ,/, it/PRP said/VBD the/DT data/NNS is/VBZ available/JJ ./.
XML within messages Detected as single sentence
Eclipse console listing within message Line breaks not recognised as end of sentence
Open-source NLP problems • Sentence detection failures: • Bad style (capitalisation, punctuation) • Ellipsis (i tried... it failed... error message...) • XML, BPEL segments concatenated to single sentence • Tokenisation failures: • Multiple punctuation ???, !!! (student emphasis) • Abbreviations (im, cant, doesnt, etc.) • POS errors • Spelling, grammar
Purpose built tools • Tokeniser • Re-coded for typical forum content/style • Multiple punctuation • Abbreviations • Common contractions • Sentence detector • New detector based on token sequences • Pre-filter messages • Remove XML, console listing, error messages
Message pre-filters • Short-forms • i’m, im, i m i am • can’t, cant, can t can not • Line numbers • Repeated punctuation (!!!, ???, ...) • Smilies • Salutations (Hi all, Hiya, etc.) • Names, signature, course codes
Filtered message Raw message containing Eclipse console listing Filtered message ready to process
Message-set properties • Number of messages: 1051 (100%) • Number of questions(M): 777 (73.9%)(100%) • Number of questions(A): 756 (97.3%) • False Positives (A not M): 58 (7.4%) • False Negatives (M not A): 79 (10.2%) Approx 90% success rate M = manually annotated question, A = automatically annotated question
Message-set properties – cont. • Average # pattern matches: 2.7606 • Min # pattern matches: 1 • Max # pattern matches: 12 • Average # of lines (ASCII linefeed) 7.9 • Min # Lines in a message 1 • Max # Lines in a message 68 • Average # of sentences 5.0 • Min # Sentences in a message 1 • Max # Sentences in a message 89 • Messages containing XML 17 • Messages containing BPEL 37
Distribution of pattern match count Number of messages Number of pattern matches
Messages matching question pattern Pattern IDs Number of messages Pattern ID
Common question patterns (10) • any • (advice|clarification|clue|comment| • further thought|guidance| • help|hint|idea|opinion| • pointer|reason|suggestion|taker)(s)? • .* • appreciated|welcome|welcomed Terms added over time to improve detection of questions 216 matches
Common question patterns (50) • get|getting|gives|got|receive • .* • error(s)? 102 matches
Discrimination vs Classification Number of messages Pattern ID Low discrimination >>> Increases successful classification at the risk of false-positives High discrimination >>> Reduces successful classification and risk of false-positives
Does process transfer? • Tested against TT380 forums 04J – 07J • Preliminary results look promising • Need to manually tag >4000 messages • Review message pre-filters • Need access to Humanities course material