1 / 17

NL Question-Answering using Naïve Bayes and LSA

NL Question-Answering using Naïve Bayes and LSA. By Kaushik Krishnasamy. Agenda. Problem Key Methodology Naïve Bayes LSA Results & Comparison Conclusion. Problem: QA. QA/discourse is complex. Overhead in Knowledge extraction & decoding (grammar based systems).

martine
Download Presentation

NL Question-Answering using Naïve Bayes and LSA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy

  2. Agenda • Problem • Key • Methodology • Naïve Bayes • LSA • Results & Comparison • Conclusion

  3. Problem: QA • QA/discourse is complex. • Overhead in Knowledge extraction & decoding (grammar based systems). • Restrictions due to inherent language & culture constructs. • Context of word usage

  4. Key: IR • Information is hidden in relevant documents. • Frequency of a word - Its Importance. • Neighborhood of a word - Context

  5. Methodology • Question posed is a new document. • How close is this document with respect to all documents in KB? • Naïve Bayes: Probabilistic Approach (C#) • LSA: Dimensionality Reduction (MATLAB) • Closest document has the answer.

  6. Naïve Bayes • vMAP = argmax P(vj | a1, a2, a3,…, an) vjε V • As all documents are possible target documents, P(v1)=P(v2)=…=P(vj)=constant vNB = argmax P(ai|vj) vjε V i • Words are independent and identically distributed.

  7. Naïve Bayes - Algorithm • Pre-process all documents. • Store number of unique words in each document (Ni). • Concatenate all documents and store words that occur at least 2 times as unique words. Count the number of such unique words as the ‘Vocabulary’. • For each of these unique words for each document, estimate P(word|document) using the formula, (Freq of the word in doc ‘i’ + 1)/(Ni + Vocabulary) • Store (word, doc, probability/frequency) to a file.

  8. Contd… • Obtain an input query from the user. • Retrieve individual words after pre-processing. • Penalize if words are not one amongst the unique ones. • For each doc estimate the product of the probabilities of all the retrieved words given this document from the file. P(input|vi)=P(w1|vi)*P(w2|vi)*P(w3|vi)*…*P(wn|vi) • The document having the maximum P(input|vi) is the document having the answer. • WORDNET: Resolve unknown input words

  9. LSA: Latent Semantic Analysis • Method to extract & represent contextual-usage meaning of words. • Set of words are points in a very high dimensional “semantic space”. • Uses SVD to reduce dimensionality. • Application of correlation analysis to arrive at results.

  10. LSA: Algorithm • Obtain (word, doc, frequency). • Basic Matrix: Form the (word x doc) matrix with the frequency entries. • Preprocess the input query. • Query Matrix: Form the (word x (doc+1) ) matrix with the query as the last column with individual word frequencies. • Perform SVD: USVT • Select the two largest singular values and reconstruct the matrix.

  11. Contd… • Find the document that is maximally correlated to the query document column. • This is the document having the answer to the query.

  12. Testing • Documents: Basic Electrical Engineering (EXP, Lessons) • The documents have an average of app. 250 words and each deal with a new topic (Cannot partition into training and testing docs) – (11 + 46 = 57 docs) • Naïve Bayes: • Automated trivial input testing • Real input testing • LSA • Trivial input testing • Real input testing (to be tested for Lesson)

  13. Results • Naïve Bayes: • Automated Trivial Input

  14. Results • Naïve Bayes • Real Input EXP docs (11 docs): Input have less than 10 words: (E.g. “how do i use a dc power supply?”) Accuracy: 8/10 Input 10 to 15 words: (E.g. “what is the law that states that energy is neither created nor destroyed, but just changes from one form to another?”) Accuracy: 8/10 Lesson docs (46 docs): 5 to 15 words Accuracy: 14/20

  15. Results • LSA (flawless with trivial input >20 words) • Without SVD (For EXP only) • Poor accuracy: 4/10 (<10 words) • Good accuracy: 8/10 (10 to 15 words) • With SVD • Very poor accuracy: 1/10 (<10 words) • Poor accuracy: 2/10 (10 to 15 words)

  16. Comparison • Naïve Bayes • Fails for acronyms and irrelevant queries • Indirect references fail - word context • Keywords determine success. • Discrete concept content perform better (EXP) • LSA • Fails miserably for small sentences (<15) • Very effective for large sentences (>20) • Insensitive to indirect references or context

  17. Conclusion • The Naïve Bayes and the LSA techniques were studied. • Software was written to test these methods. • Naïve Bayes is found to be very effective for short sentences (Q-A) type with an app. Accuracy of 80%. • LSA without SVD is better than with SVD for smaller sentences.

More Related