250 likes | 268 Views
CSCE 590 Web Scraping - NLTK. Topics Introduction to NLTK Parsing with the NLTK Readings: Online book. February 21, 2017. http://www.nltk.org/book/. 0. Preface 1. Language Processing and Python 2. Accessing Text Corpora and Lexical Resources 3. Processing Raw Text
E N D
CSCE 590 Web Scraping - NLTK • Topics • Introduction to NLTK • Parsing with the NLTK • Readings: • Online book February 21, 2017
http://www.nltk.org/book/ • 0. Preface • 1. Language Processing and Python • 2. Accessing Text Corpora and Lexical Resources • 3. Processing Raw Text • 4. Writing Structured Programs • 5. Categorizing and Tagging Words (minor fixes still required) • 6. Learning to Classify Text • 7. Extracting Information from Text • 8. Analyzing Sentence Structure • 9. Building Feature Based Grammars • 10. Analyzing the Meaning of Sentences (minor fixes still required) • 11. Managing Linguistic Data (minor fixes still required) • 12. Afterword: Facing the Language Challenge • Bibliography • Term Index • http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk
Installing NLTK • Install Setuptools: http://pypi.python.org/pypi/setuptools • Install Pip: run sudo easy_install pip • Install Numpy (optional): run sudo pip install -U numpy • Install PyYAML and NLTK: run sudo pip install -U pyyamlnltk • Test installation: run python then type import nltk
Installing NLTK Data • >>> import nltk • >>> nltk.download()
Test NLTK Installation • 1) Test Brown Corpus: • >>> from nltk.corpus import brown • >>> brown.words()[0:10] • ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of'] • >>> brown.tagged_words()[0:10] • [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')] • >>> len(brown.words()) • 1161192
Sent Tokenize(sentence boundary detection, sentence segmentation), Word Tokenize and Pos Tagging: • >>> from nltk import sent_tokenize, word_tokenize, pos_tag • >>> text = "Machine learning …” • >>> sents = sent_tokenize(text) • >>> sents • >>> tokens = word_tokenize(text) • >>> tokens
Part of Speech Tagging • >>> len(tokens) • 161 • >>> tagged_tokens = pos_tag(tokens) • >>> tagged_tokens • [('Machine', 'NN'), ('learning', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('science', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('act', 'VB'), …
Recursive Descent Paring with NLTK • Parsers • nltk.parse_cfg( grammar) # build cfg • nltk.ChartParser(g) • nltk.RecursiveDescentParser(g) # build parser from grammar • nltk.app.rdparser_app.RecursiveDescentApp • nltk.app.srparser_app.ShiftReduceApp • Imports • import string • import nltk • from nltk import parse, tokenize, Tree, in_idle • from nltk.draw.util import * • from nltk.draw.tree import * • from nltk.draw.cfg import *
Groucho Grammar • groucho_grammar = nltk.parse_cfg(""" • S -> NP VP • PP -> P NP • NP -> Det N | Det N PP | 'I' • VP -> V NP | VP PP • Det -> 'an' | 'my' • N -> 'elephant' | 'pajamas' • V -> 'shot' • P -> 'in' • """)
The ChartParser program • sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] • print sent • parser = nltk.ChartParser(groucho_grammar) • trees = parser.nbest_parse(sent) • for tree in trees: • print tree
Groucho Output • ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] • (S • (NP I) • (VP • (V shot) • (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas)))))) • (S • (NP I) • (VP • (VP (V shot) (NP (Det an) (N elephant))) • (PP (P in) (NP (Det my) (N pajamas)))))
Loading grammars • # NLTK - mygrammar.cfg - to illustrate loading of grammars • # grammar1 = nltk.data.load('file:mygrammar.cfg') • S -> NP VP • VP -> V NP • NP -> N | DET N • N -> 'Mary' | 'Bob' | 'dog' • V -> 'saw' • DET -> 'the' | 'a'
Example loading “mygrammar.cfg” • grammar1 = nltk.data.load('file:mygrammar.cfg') • print grammar1 • sent = "Mary saw Bob".split() • print sent • rd_parser = nltk.RecursiveDescentParser(grammar1) • for tree in rd_parser.nbest_parse(sent): • print tree
Checking the grammar • # to dump the grammar • grammar1 = nltk.data.load('file:mygrammar.cfg') • print grammar1 • # or you can iterate through the productions • for p in grammar1.productions(): print p
Extending the grammar • sent = 'Mary saw a cat'.split() • for t in rd_parser.nbest_parse(sent): • print t • Traceback (most recent call last): • File "C:/Python25/PythonCodeExamplesMMM/rdparser.py", line 59, in <module> for t in rd_parser.nbest_parse(sent): • File "C:\Python25\lib\site-packages\nltk\parse\rd.py", line 77, in nbest_parseself._grammar.check_coverage(tokens) • File "C:\Python25\lib\site-packages\nltk\grammar.py", line 431, in check_coverage "input words: %r." % missing) • ValueError: Grammar does not cover some of the input words: "'cat'".
Tracing • RecursiveDescentParser() takes an optional parameter trace. If trace is greater than zero, then the parser will report the steps that it takes as it parses a text. • rd_parser = nltk.RecursiveDescentParser(grammar1, 2) • Parsing 'Mary saw a dog' • [ * S ] • E [ * NP VP ] • E [ * N VP ] • E [ * 'Mary' VP ] • M [ 'Mary' * VP ] • E [ 'Mary' * V NP ] • E [ 'Mary' * 'saw' NP ] • M [ 'Mary' 'saw' * NP ] • E [ 'Mary' 'saw' * N ] • E [ 'Mary' 'saw' * 'Mary' ] • E [ 'Mary' 'saw' * 'Bob' ] • E [ 'Mary' 'saw' * 'dog' ] • E [ 'Mary' 'saw' * DET N ] • E [ 'Mary' 'saw' * 'the' N ] • … • (S (NP (N Mary)) (VP (V saw) (NP (DET a) (N dog))))
Example grammar L0 based on the ATIS corpus • S -> NP VP • NP -> Pronoun • | Proper-noun • | Det Nominal • Nominal -> Nominal Noun • VP -> Verb • | Verb NP • | Verb NP PP • | Verb PP • PP -> Preposition NP
Lexicon for L0 • Noun -> flights | breeze | trip | morning • Verb -> is | prefer | like | need | want | fly • …
nltk.app.rdparser_app Lines 864-886 • -def app(): • """ Create a recursive descent parser demo, using a simple grammar and text. • """ • from nltk import parse_cfg • grammar = parse_cfg(""" • # Grammatical productions. • S -> NP VP • NP -> Det N PP | Det N • VP -> V NP PP | V NP | V • PP -> P NP • # Lexical productions. • NP -> 'I' • Det -> 'the' | 'a' • N -> 'man' | 'park' | 'dog' | 'telescope' • V -> 'ate' | 'saw' • P -> 'in' | 'under' | 'with' • """)
Example nltk.app.rdparser • import string • import nltk • from nltk import parse, tokenize, Tree, in_idle • from nltk.draw.util import * • from nltk.draw.tree import * • from nltk.draw.cfg import * • sent = 'the dog saw a man in the park'.split() • RecursiveDescentApp(grammar, sent).mainloop()
Example nltk.app.srparser • #import string • import nltk • from nltk import parse, tokenize, Tree, in_idle • from nltk.draw.util import * • from nltk.draw.tree import * • from nltk.draw.cfg import * • from nltk import parse_cfg • from nltk.app import * • nltk.app.srparser()