1 / 24

CSCE 590 Web Scraping - NLTK

CSCE 590 Web Scraping - NLTK. Topics Introduction to NLTK Parsing with the NLTK Readings: Online book. February 21, 2017. http://www.nltk.org/book/. 0. Preface 1. Language Processing and Python 2. Accessing Text Corpora and Lexical Resources 3. Processing Raw Text

elynch
Download Presentation

CSCE 590 Web Scraping - NLTK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE 590 Web Scraping - NLTK • Topics • Introduction to NLTK • Parsing with the NLTK • Readings: • Online book February 21, 2017

  2. http://www.nltk.org/book/ • 0. Preface • 1. Language Processing and Python • 2. Accessing Text Corpora and Lexical Resources • 3. Processing Raw Text • 4. Writing Structured Programs • 5. Categorizing and Tagging Words (minor fixes still required) • 6. Learning to Classify Text • 7. Extracting Information from Text • 8. Analyzing Sentence Structure • 9. Building Feature Based Grammars • 10. Analyzing the Meaning of Sentences (minor fixes still required) • 11. Managing Linguistic Data (minor fixes still required) • 12. Afterword: Facing the Language Challenge • Bibliography • Term Index • http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk

  3. Installing NLTK • Install Setuptools: http://pypi.python.org/pypi/setuptools • Install Pip: run sudo easy_install pip • Install Numpy (optional): run sudo pip install -U numpy • Install PyYAML and NLTK: run sudo pip install -U pyyamlnltk • Test installation: run python then type import nltk

  4. Installing NLTK Data • >>> import nltk • >>> nltk.download()

  5. Test NLTK Installation • 1) Test Brown Corpus: • >>> from nltk.corpus import brown • >>> brown.words()[0:10] • ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of'] • >>> brown.tagged_words()[0:10] • [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')] • >>> len(brown.words()) • 1161192

  6. Sent Tokenize(sentence boundary detection, sentence segmentation), Word Tokenize and Pos Tagging: • >>> from nltk import sent_tokenize, word_tokenize, pos_tag • >>> text = "Machine learning …” • >>> sents = sent_tokenize(text) • >>> sents • >>> tokens = word_tokenize(text) • >>> tokens

  7. Part of Speech Tagging • >>> len(tokens) • 161 • >>> tagged_tokens = pos_tag(tokens) • >>> tagged_tokens • [('Machine', 'NN'), ('learning', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('science', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('act', 'VB'), …

  8. Parsing

  9. Recursive Descent Paring with NLTK • Parsers • nltk.parse_cfg( grammar) # build cfg • nltk.ChartParser(g) • nltk.RecursiveDescentParser(g) # build parser from grammar • nltk.app.rdparser_app.RecursiveDescentApp • nltk.app.srparser_app.ShiftReduceApp • Imports • import string • import nltk • from nltk import parse, tokenize, Tree, in_idle • from nltk.draw.util import * • from nltk.draw.tree import * • from nltk.draw.cfg import *

  10. Groucho Grammar • groucho_grammar = nltk.parse_cfg(""" • S -> NP VP • PP -> P NP • NP -> Det N | Det N PP | 'I' • VP -> V NP | VP PP • Det -> 'an' | 'my' • N -> 'elephant' | 'pajamas' • V -> 'shot' • P -> 'in' • """)

  11. The ChartParser program • sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] • print sent • parser = nltk.ChartParser(groucho_grammar) • trees = parser.nbest_parse(sent) • for tree in trees: • print tree

  12. Groucho Output • ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] • (S • (NP I) • (VP • (V shot) • (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas)))))) • (S • (NP I) • (VP • (VP (V shot) (NP (Det an) (N elephant))) • (PP (P in) (NP (Det my) (N pajamas)))))

  13. Loading grammars • # NLTK - mygrammar.cfg - to illustrate loading of grammars • # grammar1 = nltk.data.load('file:mygrammar.cfg') • S -> NP VP • VP -> V NP • NP -> N | DET N • N -> 'Mary' | 'Bob' | 'dog' • V -> 'saw' • DET -> 'the' | 'a'

  14. Example loading “mygrammar.cfg” • grammar1 = nltk.data.load('file:mygrammar.cfg') • print grammar1 • sent = "Mary saw Bob".split() • print sent • rd_parser = nltk.RecursiveDescentParser(grammar1) • for tree in rd_parser.nbest_parse(sent): • print tree

  15. Checking the grammar • # to dump the grammar • grammar1 = nltk.data.load('file:mygrammar.cfg') • print grammar1 • # or you can iterate through the productions • for p in grammar1.productions(): print p

  16. Extending the grammar • sent = 'Mary saw a cat'.split() • for t in rd_parser.nbest_parse(sent): • print t • Traceback (most recent call last): • File "C:/Python25/PythonCodeExamplesMMM/rdparser.py", line 59, in <module> for t in rd_parser.nbest_parse(sent): • File "C:\Python25\lib\site-packages\nltk\parse\rd.py", line 77, in nbest_parseself._grammar.check_coverage(tokens) • File "C:\Python25\lib\site-packages\nltk\grammar.py", line 431, in check_coverage "input words: %r." % missing) • ValueError: Grammar does not cover some of the input words: "'cat'".

  17. Tracing • RecursiveDescentParser() takes an optional parameter trace. If trace is greater than zero, then the parser will report the steps that it takes as it parses a text. • rd_parser = nltk.RecursiveDescentParser(grammar1, 2) • Parsing 'Mary saw a dog' • [ * S ] • E [ * NP VP ] • E [ * N VP ] • E [ * 'Mary' VP ] • M [ 'Mary' * VP ] • E [ 'Mary' * V NP ] • E [ 'Mary' * 'saw' NP ] • M [ 'Mary' 'saw' * NP ] • E [ 'Mary' 'saw' * N ] • E [ 'Mary' 'saw' * 'Mary' ] • E [ 'Mary' 'saw' * 'Bob' ] • E [ 'Mary' 'saw' * 'dog' ] • E [ 'Mary' 'saw' * DET N ] • E [ 'Mary' 'saw' * 'the' N ] • … • (S (NP (N Mary)) (VP (V saw) (NP (DET a) (N dog))))

  18. Example grammar L0 based on the ATIS corpus • S -> NP VP • NP -> Pronoun • | Proper-noun • | Det Nominal • Nominal -> Nominal Noun • VP -> Verb • | Verb NP • | Verb NP PP • | Verb PP • PP -> Preposition NP

  19. Lexicon for L0 • Noun -> flights | breeze | trip | morning • Verb -> is | prefer | like | need | want | fly • …

  20. nltk.app.rdparser_app Lines 864-886 • -def app(): • """ Create a recursive descent parser demo, using a simple grammar and text. • """ • from nltk import parse_cfg • grammar = parse_cfg(""" • # Grammatical productions. • S -> NP VP • NP -> Det N PP | Det N • VP -> V NP PP | V NP | V • PP -> P NP • # Lexical productions. • NP -> 'I' • Det -> 'the' | 'a' • N -> 'man' | 'park' | 'dog' | 'telescope' • V -> 'ate' | 'saw' • P -> 'in' | 'under' | 'with' • """)

  21. Example nltk.app.rdparser • import string • import nltk • from nltk import parse, tokenize, Tree, in_idle • from nltk.draw.util import * • from nltk.draw.tree import * • from nltk.draw.cfg import * • sent = 'the dog saw a man in the park'.split() • RecursiveDescentApp(grammar, sent).mainloop()

  22. Example nltk.app.srparser • #import string • import nltk • from nltk import parse, tokenize, Tree, in_idle • from nltk.draw.util import * • from nltk.draw.tree import * • from nltk.draw.cfg import * • from nltk import parse_cfg • from nltk.app import * • nltk.app.srparser()

More Related