310 likes | 415 Views
Natural Language Processing Assignment – Final Presentation Varun Suprashanth , 09005063 Tarun Gujjula , 09005068 Asok Ramachandran, 09005072. Part 1 : POS Tagger. Tasks Completed. Implementation of Viterbi – Unigram, Bigram. Five Fold Evaluation. Per POS Accuracy. Confusion Matrix.
E N D
Natural Language Processing Assignment – Final Presentation VarunSuprashanth, 09005063 TarunGujjula, 09005068 Asok Ramachandran, 09005072
Tasks Completed • Implementation of Viterbi – Unigram, Bigram. • Five Fold Evaluation. • Per POS Accuracy. • Confusion Matrix.
Problem Statement • Generate unigram parameters of P(t_i|w_i). You already have the annotated corpus. • Compute the argmax of P(T|W); do not invert through Bayes theorem. • Compare with unigram based unigram performance between (2) and the HMM based system.
Tasks Completed • Generated unigram parameters of P(ti|wi). • Computed the argmax of P(T|W). • Compared with unigram based unigram performance between the HMM based system and the above. • Better results were produced by the generative model in cases of ambiguous sentences.
Discriminative • P(T|W) = P(| ) = P(| ). P(| ). ……… P(| ) • Assuming word tag pair to be independent, • P(T|W) = • precision 0.896788 • F-measure 0.896788
Generative • P(T|W) = P(T|W). P(T). • Assuming unigram assumption and word tag pairs to be independent, • P(T|W) = P(| ). P()
Tasks Completed • Predicted the next word on the basis of the patterns occurring in both the corpora. • First Corpus had untagged-word sentences and the second one had tagged-word sentences. • The corpus with the tagged words gives better results for word prediction.
Untagged Corpus • P(|) = • Where c() is the count. • By Bigram Assumption, • P(|) = • By Trigram Assumption, • P(|) =
Tagged Corpus • P(|,) = • Using Bigram Assumption, • P(|,) = • Using Trigram Assumption, • P(|,) =
Examples. • Example 1 : • TO0_to VBI_beCJC_or XX0_not TO0_to • VBI_be • to be or not to • The • Example 2: • AJ0_complete CJC_andAJ0_utter • NN1_contempt • complete and utter • Loud
Examples Cont. • Example 3: • PNQ_whoVBZ_isDPS_your AJ0-NN1_favourite • NN1_gardening • who is your favourite • is
Results • Raw text LM : • Word Prediction Accuracy: 13.21% • POS tagged text LM : • Word Prediction Accuracy : 15.53%
Problem Statement • The goal is to see which algorithm is better for POS tagging, Viterbi or A* • Look upon the column of POS tags above all the words as forming the state space graph. • The start state S is '^' and the goal stage G is '$'. 6. Your job is to come up with a good heuristic. One possibility is that the heuristic value h(N), where N is a node on a word W, is the product of the distance of W from '$' and the least arc cost in the state space graph. • G(N) is the cost of the best path found so far to W from '^'. • Run A* with this heuristic and see the result. • Compare the result with Viterbi.
A-Star Implementation. • precision 0.937254 • F-measure 0.937254
Heuristics. • h = g * (N - n)/ n • Where N is the length of the sentence, and n is the index of the current word in the sentence.
ProblemStatement • Take as input two words and show A PATH between them listing all the concepts that are encountered on the way. • For example, in the path from 'bulldog' to 'cheshire cat', one would presumably encounter 'bulldog-dog-mammal-cat-cheshire cat'. Similarly for 'VVS Laxman' and 'Hyderabad', 'Tendulkar' and 'Tennis' (you will be surprised!!).
Example • English: Dhoniis the captain of India. • Hindi: dhonibhaaratkekaptaanhai. • Hindi -parse: [ [ [dhoni]NN]NP [ [[[bhaarat]NNP]NP [ke]P ]PP [kaptaan]NN]NP [hai]VBZ ]VP ]S • English -parse: [ [ [Delhi]NN]NP [ [is]VBZ [[the]ART [capital]NN]NP [[of]P [[India]NNP]NP]PP]VP ]S
Problems and Conclusions • Many Idioms in English are translated directly, even though they mean something else, • E.g. Phrases like “break a leg”, “He Lost His Head”, “French kiss”, “Flip the bird” • Noise because of misalignments.
Natural Language Tool Kit • The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. • NLTK includes graphical demonstrations and sample data. • It is accompanied by extensive documentation, including a book that explains the underlying concepts behind the language processing tasks supported by the toolkitIt provides lexical resources such as WordNet. • It has a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.