NL Question-Answering using Naïve Bayes and LSA

NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy

Agenda • Problem • Key • Methodology • Naïve Bayes • LSA • Results & Comparison • Conclusion

Problem: QA • QA/discourse is complex. • Overhead in Knowledge extraction & decoding (grammar based systems). • Restrictions due to inherent language & culture constructs. • Context of word usage

Key: IR • Information is hidden in relevant documents. • Frequency of a word - Its Importance. • Neighborhood of a word - Context

Methodology • Question posed is a new document. • How close is this document with respect to all documents in KB? • Naïve Bayes: Probabilistic Approach (C#) • LSA: Dimensionality Reduction (MATLAB) • Closest document has the answer.

Naïve Bayes • vMAP = argmax P(vj | a1, a2, a3,…, an) vjε V • As all documents are possible target documents, P(v1)=P(v2)=…=P(vj)=constant vNB = argmax P(ai|vj) vjε V i • Words are independent and identically distributed.

Naïve Bayes - Algorithm • Pre-process all documents. • Store number of unique words in each document (Ni). • Concatenate all documents and store words that occur at least 2 times as unique words. Count the number of such unique words as the ‘Vocabulary’. • For each of these unique words for each document, estimate P(word|document) using the formula, (Freq of the word in doc ‘i’ + 1)/(Ni + Vocabulary) • Store (word, doc, probability/frequency) to a file.

Contd… • Obtain an input query from the user. • Retrieve individual words after pre-processing. • Penalize if words are not one amongst the unique ones. • For each doc estimate the product of the probabilities of all the retrieved words given this document from the file. P(input|vi)=P(w1|vi)*P(w2|vi)*P(w3|vi)*…*P(wn|vi) • The document having the maximum P(input|vi) is the document having the answer. • WORDNET: Resolve unknown input words

LSA: Latent Semantic Analysis • Method to extract & represent contextual-usage meaning of words. • Set of words are points in a very high dimensional “semantic space”. • Uses SVD to reduce dimensionality. • Application of correlation analysis to arrive at results.

LSA: Algorithm • Obtain (word, doc, frequency). • Basic Matrix: Form the (word x doc) matrix with the frequency entries. • Preprocess the input query. • Query Matrix: Form the (word x (doc+1) ) matrix with the query as the last column with individual word frequencies. • Perform SVD: USVT • Select the two largest singular values and reconstruct the matrix.

Contd… • Find the document that is maximally correlated to the query document column. • This is the document having the answer to the query.

Testing • Documents: Basic Electrical Engineering (EXP, Lessons) • The documents have an average of app. 250 words and each deal with a new topic (Cannot partition into training and testing docs) – (11 + 46 = 57 docs) • Naïve Bayes: • Automated trivial input testing • Real input testing • LSA • Trivial input testing • Real input testing (to be tested for Lesson)

Results • Naïve Bayes: • Automated Trivial Input

Results • Naïve Bayes • Real Input EXP docs (11 docs): Input have less than 10 words: (E.g. “how do i use a dc power supply?”) Accuracy: 8/10 Input 10 to 15 words: (E.g. “what is the law that states that energy is neither created nor destroyed, but just changes from one form to another?”) Accuracy: 8/10 Lesson docs (46 docs): 5 to 15 words Accuracy: 14/20

Results • LSA (flawless with trivial input >20 words) • Without SVD (For EXP only) • Poor accuracy: 4/10 (<10 words) • Good accuracy: 8/10 (10 to 15 words) • With SVD • Very poor accuracy: 1/10 (<10 words) • Poor accuracy: 2/10 (10 to 15 words)

Comparison • Naïve Bayes • Fails for acronyms and irrelevant queries • Indirect references fail - word context • Keywords determine success. • Discrete concept content perform better (EXP) • LSA • Fails miserably for small sentences (<15) • Very effective for large sentences (>20) • Insensitive to indirect references or context

Conclusion • The Naïve Bayes and the LSA techniques were studied. • Software was written to test these methods. • Naïve Bayes is found to be very effective for short sentences (Q-A) type with an app. Accuracy of 80%. • LSA without SVD is better than with SVD for smaller sentences.

NL Question-Answering using Naïve Bayes and LSA