Natural Language Processing Assignment – Final Presentation Varun Suprashanth , 09005063

Natural Language Processing Assignment – Final Presentation VarunSuprashanth, 09005063 TarunGujjula, 09005068 Asok Ramachandran, 09005072

Part 1 : POS Tagger

Tasks Completed • Implementation of Viterbi – Unigram, Bigram. • Five Fold Evaluation. • Per POS Accuracy. • Confusion Matrix.

Per POS Accuracy for Bigram Assumption.

Screen shot of Confusion Matrix

Part 2 : Discriminative VS Generative

Problem Statement • Generate unigram parameters of P(t_i|w_i). You already have the annotated corpus. • Compute the argmax of P(T|W); do not invert through Bayes theorem. • Compare with unigram based unigram performance between (2) and the HMM based system.

Tasks Completed • Generated unigram parameters of P(ti|wi). • Computed the argmax of P(T|W). • Compared with unigram based unigram performance between the HMM based system and the above. • Better results were produced by the generative model in cases of ambiguous sentences.

Discriminative • P(T|W) = P(| ) = P(| ). P(| ). ……… P(| ) • Assuming word tag pair to be independent, • P(T|W) = • precision 0.896788 • F-measure 0.896788

Per-PoS Accuracy

Generative • P(T|W) = P(T|W). P(T). • Assuming unigram assumption and word tag pairs to be independent, • P(T|W) = P(| ). P()

Part 3 : Analysis of Corpora Using Word Prediction

Tasks Completed • Predicted the next word on the basis of the patterns occurring in both the corpora. • First Corpus had untagged-word sentences and the second one had tagged-word sentences. • The corpus with the tagged words gives better results for word prediction.

Untagged Corpus • P(|) = • Where c() is the count. • By Bigram Assumption, • P(|) = • By Trigram Assumption, • P(|) =

Tagged Corpus • P(|,) = • Using Bigram Assumption, • P(|,) = • Using Trigram Assumption, • P(|,) =

Examples. • Example 1 : • TO0_to VBI_beCJC_or XX0_not TO0_to • VBI_be • to be or not to • The • Example 2: • AJ0_complete CJC_andAJ0_utter • NN1_contempt • complete and utter • Loud

Examples Cont. • Example 3: • PNQ_whoVBZ_isDPS_your AJ0-NN1_favourite • NN1_gardening • who is your favourite • is

Results • Raw text LM : • Word Prediction Accuracy: 13.21% • POS tagged text LM : • Word Prediction Accuracy : 15.53%

Part 4 : A-star Implementation

Problem Statement • The goal is to see which algorithm is better for POS tagging, Viterbi or A* • Look upon the column of POS tags above all the words as forming the state space graph. • The start state S is '^' and the goal stage G is '$'. 6. Your job is to come up with a good heuristic. One possibility is that the heuristic value h(N), where N is a node on a word W, is the product of the distance of W from '$' and the least arc cost in the state space graph. • G(N) is the cost of the best path found so far to W from '^'. • Run A* with this heuristic and see the result. • Compare the result with Viterbi.

A-Star Implementation. • precision 0.937254 • F-measure 0.937254

Screen shot of Confusion Matrix

Heuristics. • h = g * (N - n)/ n • Where N is the length of the sentence, and n is the index of the current word in the sentence.

A-star Vs. Viterbi

Part 5 : YAGO

ProblemStatement • Take as input two words and show A PATH between them listing all the concepts that are encountered on the way. • For example, in the path from 'bulldog' to 'cheshire cat', one would presumably encounter 'bulldog-dog-mammal-cat-cheshire cat'. Similarly for 'VVS Laxman' and 'Hyderabad', 'Tendulkar' and 'Tennis' (you will be surprised!!).

Part 6: Parser Projection

Example • English: Dhoniis the captain of India. • Hindi: dhonibhaaratkekaptaanhai. • Hindi -parse: [ [ [dhoni]NN]NP [ [[[bhaarat]NNP]NP [ke]P ]PP [kaptaan]NN]NP [hai]VBZ ]VP ]S • English -parse: [ [ [Delhi]NN]NP [ [is]VBZ [[the]ART [capital]NN]NP [[of]P [[India]NNP]NP]PP]VP ]S

Problems and Conclusions • Many Idioms in English are translated directly, even though they mean something else, • E.g. Phrases like “break a leg”, “He Lost His Head”, “French kiss”, “Flip the bird” • Noise because of misalignments.

Natural Language Tool Kit • The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. • NLTK includes graphical demonstrations and sample data. • It is accompanied by extensive documentation, including a book that explains the underlying concepts behind the language processing tasks supported by the toolkitIt provides lexical resources such as WordNet. • It has a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

[EOF]

Natural Language Processing Assignment – Final Presentation Varun Suprashanth , 09005063