CSCE 590 Web Scraping - NLTK

CSCE 590 Web Scraping - NLTK • Topics • Introduction to NLTK • Parsing with the NLTK • Readings: • Online book February 21, 2017

http://www.nltk.org/book/ • 0. Preface • 1. Language Processing and Python • 2. Accessing Text Corpora and Lexical Resources • 3. Processing Raw Text • 4. Writing Structured Programs • 5. Categorizing and Tagging Words (minor fixes still required) • 6. Learning to Classify Text • 7. Extracting Information from Text • 8. Analyzing Sentence Structure • 9. Building Feature Based Grammars • 10. Analyzing the Meaning of Sentences (minor fixes still required) • 11. Managing Linguistic Data (minor fixes still required) • 12. Afterword: Facing the Language Challenge • Bibliography • Term Index • http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk

Installing NLTK • Install Setuptools: http://pypi.python.org/pypi/setuptools • Install Pip: run sudo easy_install pip • Install Numpy (optional): run sudo pip install -U numpy • Install PyYAML and NLTK: run sudo pip install -U pyyamlnltk • Test installation: run python then type import nltk

Installing NLTK Data • >>> import nltk • >>> nltk.download()

Test NLTK Installation • 1) Test Brown Corpus: • >>> from nltk.corpus import brown • >>> brown.words()[0:10] • ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of'] • >>> brown.tagged_words()[0:10] • [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')] • >>> len(brown.words()) • 1161192

Sent Tokenize(sentence boundary detection, sentence segmentation), Word Tokenize and Pos Tagging: • >>> from nltk import sent_tokenize, word_tokenize, pos_tag • >>> text = "Machine learning …” • >>> sents = sent_tokenize(text) • >>> sents • >>> tokens = word_tokenize(text) • >>> tokens

Part of Speech Tagging • >>> len(tokens) • 161 • >>> tagged_tokens = pos_tag(tokens) • >>> tagged_tokens • [('Machine', 'NN'), ('learning', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('science', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('act', 'VB'), …

Parsing

Recursive Descent Paring with NLTK • Parsers • nltk.parse_cfg( grammar) # build cfg • nltk.ChartParser(g) • nltk.RecursiveDescentParser(g) # build parser from grammar • nltk.app.rdparser_app.RecursiveDescentApp • nltk.app.srparser_app.ShiftReduceApp • Imports • import string • import nltk • from nltk import parse, tokenize, Tree, in_idle • from nltk.draw.util import * • from nltk.draw.tree import * • from nltk.draw.cfg import *

The ChartParser program • sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] • print sent • parser = nltk.ChartParser(groucho_grammar) • trees = parser.nbest_parse(sent) • for tree in trees: • print tree

Groucho Output • ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] • (S • (NP I) • (VP • (V shot) • (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas)))))) • (S • (NP I) • (VP • (VP (V shot) (NP (Det an) (N elephant))) • (PP (P in) (NP (Det my) (N pajamas)))))

Loading grammars • # NLTK - mygrammar.cfg - to illustrate loading of grammars • # grammar1 = nltk.data.load('file:mygrammar.cfg') • S -> NP VP • VP -> V NP • NP -> N | DET N • N -> 'Mary' | 'Bob' | 'dog' • V -> 'saw' • DET -> 'the' | 'a'

Example loading “mygrammar.cfg” • grammar1 = nltk.data.load('file:mygrammar.cfg') • print grammar1 • sent = "Mary saw Bob".split() • print sent • rd_parser = nltk.RecursiveDescentParser(grammar1) • for tree in rd_parser.nbest_parse(sent): • print tree

Checking the grammar • # to dump the grammar • grammar1 = nltk.data.load('file:mygrammar.cfg') • print grammar1 • # or you can iterate through the productions • for p in grammar1.productions(): print p

Extending the grammar • sent = 'Mary saw a cat'.split() • for t in rd_parser.nbest_parse(sent): • print t • Traceback (most recent call last): • File "C:/Python25/PythonCodeExamplesMMM/rdparser.py", line 59, in <module> for t in rd_parser.nbest_parse(sent): • File "C:\Python25\lib\site-packages\nltk\parse\rd.py", line 77, in nbest_parseself._grammar.check_coverage(tokens) • File "C:\Python25\lib\site-packages\nltk\grammar.py", line 431, in check_coverage "input words: %r." % missing) • ValueError: Grammar does not cover some of the input words: "'cat'".

Tracing • RecursiveDescentParser() takes an optional parameter trace. If trace is greater than zero, then the parser will report the steps that it takes as it parses a text. • rd_parser = nltk.RecursiveDescentParser(grammar1, 2) • Parsing 'Mary saw a dog' • [ * S ] • E [ * NP VP ] • E [ * N VP ] • E [ * 'Mary' VP ] • M [ 'Mary' * VP ] • E [ 'Mary' * V NP ] • E [ 'Mary' * 'saw' NP ] • M [ 'Mary' 'saw' * NP ] • E [ 'Mary' 'saw' * N ] • E [ 'Mary' 'saw' * 'Mary' ] • E [ 'Mary' 'saw' * 'Bob' ] • E [ 'Mary' 'saw' * 'dog' ] • E [ 'Mary' 'saw' * DET N ] • E [ 'Mary' 'saw' * 'the' N ] • … • (S (NP (N Mary)) (VP (V saw) (NP (DET a) (N dog))))

Example nltk.app.rdparser • import string • import nltk • from nltk import parse, tokenize, Tree, in_idle • from nltk.draw.util import * • from nltk.draw.tree import * • from nltk.draw.cfg import * • sent = 'the dog saw a man in the park'.split() • RecursiveDescentApp(grammar, sent).mainloop()

Example nltk.app.srparser • #import string • import nltk • from nltk import parse, tokenize, Tree, in_idle • from nltk.draw.util import * • from nltk.draw.tree import * • from nltk.draw.cfg import * • from nltk import parse_cfg • from nltk.app import * • nltk.app.srparser()

CSCE 590 Web Scraping - NLTK

CSCE 590 Web Scraping - NLTK

Presentation Transcript

NLTK Tagging

NLTK (Natural Language Tool Kit) nltk/

NLTK

NLTK (Natural Language Tool Kit) nltk/

NLTK Tagging

Web Scraping Services

Web Scraping ,Data Scraping,Web Extraction,Data Extraction - USA

Data scraping services- worth web scraping services

Web Scraping

Web Scraping Google

CSCE 590 Web Scraping – NLTK IE

CSCE 590 Web Scraping Lecture 6

CSCE 590 Web Scraping - Selenium

590 Web Scraping – Handling Images

590 Scraping – Social Web

CSCE 590 Web Scraping – NLTK

590 Scraping – NER shape features

Web data scraping services

Best Web Scraping Service

Web Scraping Services