440 likes | 469 Views
Web Scraping Lecture7 -. Topics More Beautiful Soup Crawling Crawling by Hand Readings: Chapter 3. January 26, 2017. Overview. Last Time: Lecture 6 Slides 1-29 BeautifulSoup revisited Crawling Today: Chapter 3: Lecture 6 Slides 29-40 3-crawlSite.py - 4-getExternalLinks.py –
E N D
Web Scraping Lecture7 - • Topics • More Beautiful Soup • Crawling • Crawling by Hand • Readings: • Chapter 3 January 26, 2017
Overview • Last Time: Lecture 6 Slides 1-29 • BeautifulSoup revisited • Crawling • Today: • Chapter 3: Lecture 6 Slides 29-40 • 3-crawlSite.py - • 4-getExternalLinks.py – • 5-getAllExternalLinks.py – • Warnings • Chapter 4 • APIs • JSON • Javascript • References • Scrapy site:
Getting the code from the text again! • https://github.com/REMitchell/python-scraping
Warnings • Chapter 2 – • pp 26 Regular Expressions are not always regular • Chapter 3 • pp 35 Handle your exceptions • pp 38 Recursion limit – depth limit 1000 (ridiculous) • pp 40 multiple elements in a try block might lead to confusion as to which caused the exception • pp 41 “Unknown Waters Ahead” – be prepared to run into sights that are not respectable • pp 43 “Don’t put examples programs into production”
Regular Expressions are not always regular • ls a*c // Unix command line; • dir T*c // Windows cmd • POSIX Standard • sh, bash, csh • perl
Recursion limit = 1000 • #fibonnacci • def fib (current=1, previous=0, limit=100): • new = current + previous • print (new) • if new < limit: • fib (previous, new, limit) • fib(1,0, 1000000) • print ("completed") • #Ackermann • def ackermann(m, n): • if m == 0: • return(n+1) • elif m > 0 and n == 0: • return(ackermann(m-1, 1)) • elif m > 0 and n > 0: • return(ackermann(m-1, ackermann(m, n-1))) • else: • print("Should not reach here, unless bad arguments are passed.") • print ("ackermann(3,5)=", ackermann(3,5)) • print ("ackermann(4,2)=", ackermann(4,2))
Anaconda3 • sudo apt-get install anaconda3
Crawling with Scrapy • “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.” • https://doc.scrapy.org/en/latest/intro/overview.html • Tutorial • https://doc.scrapy.org/en/latest/intro/tutorial.html • Download • pip install scrapy
Walk-through of the example spider • Starting the example • $ scrapystartprojectwikiSpider • “You can start your first spider with: • $ cd wikiSpider • $ scrapygenspider example example.com
Walk-through of an example spider • $ scrapy startproject wikiSpider • Configuration file for our scrapy projects • Code directory for our new project
from scrapy.selector import Selector • from scrapy import Spider • from wikiSpider.items import Article • class ArticleSpider( Spider): • name =" article" • allowed_domains = [" en.wikipedia.org"] • start_urls = [" http:// en.wikipedia.org/ wiki/ Main_Page", • "http:// en.wikipedia.org/ wiki/ Python_% 28programming_language% 29"]
def parse( self, response): • item = Article() • title = response.xpath('// h1/ text()')[ 0]. extract() • print(" Title is: "+ title) • item[' title'] = title • return item
Running our crawler • $ scrapy crawl article
Logging with Scrapy • Add logging level to the file settings.py • LOG_LEVEL = ‘ERROR’ • There are five levels of logging in Scrapy, listed in order here: • CRITICAL • ERROR • WARNING • DEBUG • INFO • $ scrapy crawl article -s LOG_FILE = wiki.log
Varying the format of the output • $ scrapy crawl article -o articles.csv -t csv • $ scrapy crawl article -o articles.json -t json • $ scrapy crawl article -o articles.xml -t xml
Chapter 4: Using APIs • API - In computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. • https://en.wikipedia.org/wiki/Application_programming_interface • A web API is an application programming interface (API) for either a web server or a web browser. • Program request in HTML • Response in XML or JSON
ECMA-404 The JSON Data Interchange Standard • JSON (JavaScript Object Notation) is a lightweight data-interchange format. • It is easy for humans to read and write. • It is easy for machines to parse and generate. • It is based on a subset of the JavaScript Programming Language • JSON is built on two structures: • A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array. • An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
JSON objects • Similar to python dictionaries • Example • { “first” : “John”, “last” : “Donne”, “Phone number” : 7773452 }
JSON Numbers • Note no Hex and no octal
FreeGeoIP – Where is this IP address? • http://freegeoip.net/json/50.78.253.58 • {"ip":"50.78.253.58","country_code":"US","country_name":"United States","region_code":"MA","region_name":"Massachusetts","city":"Boston","zip_code":"02116","time_zone":"America/New_York","latitude":42.3496,"longitude":-71.0746,"metro_code":506} • http://freegeoip.net/json/129.252.11.130 • {"ip":"129.252.11.130","country_code":"US","country_name":"United States","region_code":"SC","region_name":"South Carolina","city":"Columbia","zip_code":"29208","time_zone":"America/New_York","latitude":33.9937,"longitude":-81.02,"metro_code":546}
HTTP protocol revisited -- History • The term hypertext was coined by Ted Nelson in 1965 in the Xanadu Project, • which was in turn inspired by Vannevar Bush's vision (1930s) of the microfilm-based information retrieval and management "memex" system described in his essay As We May Think (1945). • Tim Berners-Lee and his team at CERN are credited with inventing the original HTTP along with HTML and the associated technology for a web server and a text-based web browser. • Berners-Lee first proposed the "WorldWideWeb" project in 1989 — now known as the World Wide Web. • The first version of the protocol had only one method, namely GET, which would request a page from a server.[3] • The response from the server was always an HTML page.[4] https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
Get and Head Packets • GET • The GET method requests a representation of the specified resource. Requests using GET should only retrieve data and should have no other effect. • HEAD • The HEAD method asks for a response identical to that of a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.
Post and Put Packets • POST • The POST method requests that the server accept the entity enclosed in the request as a new subordinate of the web resource identified by the URI. The data POSTed might be, for example, an annotation for existing resources; a message for a bulletin board, newsgroup, mailing list, or comment thread; a block of data that is the result of submitting a web form to a data-handling process; or an item to add to a database.[14] • PUT • The PUT method requests that the enclosed entity be stored under the supplied URI. If the URI refers to an already existing resource, it is modified; if the URI does not point to an existing resource, then the server can create the resource with that URI.[15]
Other HTTP commands • DELETE • The DELETE method deletes the specified resource. • TRACE • The TRACE method echoes the received request so that a client can see what (if any) changes or additions have been made by intermediate servers. • OPTIONS • The OPTIONS method returns the HTTP methods that the server supports for the specified URL. This can be used to check the functionality of a web server by requesting '*' instead of a specific resource. • CONNECT • [16] The CONNECT method converts the request connection to a transparent TCP/IP tunnel, usually to facilitate SSL-encrypted communication (HTTPS) through an unencrypted HTTP proxy.[17][18] See HTTP CONNECT tunneling. • PATCH • The PATCH method applies partial modifications to a resource.[
Authentication • Identify users – for charges etc. • http:// developer.echonest.com/ api/ v4/ artist/ songs? api_key = < your api key here > %20& name = guns% 20n% 27% 20roses& format = json& start = 0& results = 100 • Using urlopen • token = "< your api key >" • webRequest = • urllib.request.Request(" http:// myapi.com", • headers ={" token": token}) • html = urlopen( webRequest)
XML versus JSON • XML • < user > < firstname > Ryan </ firstname > < lastname > Mitchell </ lastname > < username > Kludgist </ username > </ user > • which clocks in at 98 characters, • JSON: • {" user":{" firstname":" Ryan"," lastname":" Mitchell"," username":" Kludgist"}} • which clocks in at 73 characters
XML versus JSON; now prettified • XML • < user > • < firstname > Ryan </ firstname > • < lastname > Mitchell </ lastname > • < username > Kludgist </ username > • </ user > • JSON: • {" user": • {“firstname” : “Ryan“ , • “lastname“ : “Mitchell“ , • “username“ : “Kludgist”} • }
Syntax of API calls • When retrieving data through a GET request, • the URL path describes how you would like to drill down into the data, • while the query parameters serve as filters or additional requests tacked onto the search. • Example • http://socialmediasite.com/users/1234/posts?from08012014&to=08312014
Echo Nest • Web scrapes to identify music instead of human tagging like Pandora • http:// developer.echonest.com/ api/ v4/ artist/ search? api_key = < your api key >& name = monty% 20python • This produces the following result: • {" response": {" status": {" version": "4.2", "code": 0, • "message": "Success"}, "artists": [{" id": "AR5HF791187B9ABAF4", "name": "Monty Pytho n"}, {" id": "ARWCIDE13925F19A33", "name": "Monty Python's SPAMALOT"}, {" id": "ARVPRCC12FE0862033", "name": "Monty Python's Graham Chapman" }]}} • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1438-1447). O'Reilly Media. Kindle Edition.
Twitter • Twitter is notoriously protective of its API and rightfully so. With • over 230 million active users • and a revenue of over $ 100 million a month, • the company is hesitant to let just anyone come along and have any data they want. • Twitter’s rate limits (the number of calls it allows each user to make) fall into two categories: • 15 calls per 15-minute period, and • 180 calls per 15-minute period, depending on the type of call. • For instance, you can make up to 12 calls a minute to retrieve basic information about Twitter users, but only one call a minute to retrieve lists of those users’ Twitter followers.
Yield in Python • def _get_child_candidates(self, distance, min_dist, max_dist): • if self._leftchild and distance - max_dist < self._median: • yield self._leftchild • if self._rightchild and distance + max_dist >= self._median: • yield self._rightchild • result, candidates = list(), [self] • while candidates: • node = candidates.pop() • distance = node._get_dist(obj) • if distance <= max_dist and distance >= min_dist: • result.extend(node._values) • candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) • return result http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do
Iterators and Generators • When you create a list, you can read its items one by one. Reading its items one by one is called iteration: • >>> mylist = [1, 2, 3] • >>> for i in mylist: • ... print(i) • Generators are iterators, but you can only iterate over them once. It's because they do not store all the values in memory, they generate the values on the fly: • >>> mygenerator = (x*x for x in range(3)) • >>> for i in mygenerator: • ... print(i)
Yield • Yield is a keyword that is used like return, except the function will return a generator. • >>> defcreateGenerator(): • ... mylist = range(3) • ... for i in mylist: • ... yield i*i • ... • >>> mygenerator = createGenerator() # create a generator • >>> print(mygenerator) # mygenerator is an object! • <generator object createGenerator at 0xb7555c34> • >>> for i in mygenerator: • ... print(i)
Next Time Requests Library and DB • Requests: HTTP for Humans • >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) • >>> r.status_code • 200 • >>> r.headers['content-type'] • 'application/json; charset=utf8' • >>> r.encoding • 'utf-8' • >>> r.text • u'{"type":"User"...' • >>> r.json() • {u'private_gists': 419, u'total_private_repos': 77, ...}