50 likes | 61 Views
Explore NLTK and Brown corpus to build word-based unigram model with Laplace smoothing. Calculate probabilities for words, apply tricks for text generation, and delve into information theory for code breaking.
E N D
CPSC 7373: Artificial IntelligenceLecture 13: Natural Language Processing Jiang Bian, Fall 2012 University of Arkansas at Little Rock
NLP Assignment 2 • NLTK + Brown corpus • Word-based unigram model + Laplace smoothing. e.g., • P(“this”) = 0.00424935611437 • log(P(“this”)) = -5.46 • P(“het”) = 8.2575905837e-07 • log(P(“het”)) = -14.01
Step 1 • First run on the last row: • |hi| | |in| | | t| | | | |ye| |ar| |s | | |. | • (-17.015945081500426, [3, 6, 0, 15, 11, 13, 18], 'in this year. ') • (-17.015945081500426, [3, 6, 0, 15, 18, 11, 13], 'in this . year') • (-17.015945081500426, [3, 18, 6, 0, 15, 11, 13], 'in. this year') • (-17.015945081500426, [3, 18, 11, 13, 6, 0, 15], 'in. year this ') • (-17.015945081500426, [6, 0, 15, 3, 18, 11, 13], ' this in. year') • (-17.015945081500426, [11, 13, 18, 3, 6, 0, 15], 'year. in this ') • (-17.015945081500426, [18, 3, 6, 0, 15, 11, 13], '. in this year') • Other tricks?
Step 2 • Based on: (4.074449568896497e-08, [3, 6, 0, 15, 11, 13, 18], 'in this year. ') • Collapse [3, 6, 0, 15, 11, 13, 18] into one fixed string ('in this year. ’) • Randomly pick 3 strips from the rest [1, 2, 4, 5, 7, 8, 9, 10, 12, 14, 16, 17]; and • Randomly pick another row (e.g., 4th row) • |he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br| • (-27.422343645594154, [3, 6, 0, 15, 11, 13, 18, 2, 14, 17], 'of the code breaking') • Recursion… until…
Final Claude Shannon founded information theory, which is the basis of probabilistic language models and of the code breaking methods that you would use to solve this problem with the paper titled "A Mathematic Theory of Communication," published in this year.