440 likes | 472 Views
Explore web scraping techniques using Beautiful Soup and Scrapy. Learn about crawling websites, handling exceptions, recursion limits, and more. Dive into APIs, JSON, and JavaScript data interchange. Includes tutorials and examples.
E N D
Web Scraping Lecture7 - • Topics • More Beautiful Soup • Crawling • Crawling by Hand • Readings: • Chapter 3 January 26, 2017
Overview • Last Time: Lecture 6 Slides 1-29 • BeautifulSoup revisited • Crawling • Today: • Chapter 3: Lecture 6 Slides 29-40 • 3-crawlSite.py - • 4-getExternalLinks.py – • 5-getAllExternalLinks.py – • Warnings • Chapter 4 • APIs • JSON • Javascript • References • Scrapy site:
Getting the code from the text again! • https://github.com/REMitchell/python-scraping
Warnings • Chapter 2 – • pp 26 Regular Expressions are not always regular • Chapter 3 • pp 35 Handle your exceptions • pp 38 Recursion limit – depth limit 1000 (ridiculous) • pp 40 multiple elements in a try block might lead to confusion as to which caused the exception • pp 41 “Unknown Waters Ahead” – be prepared to run into sights that are not respectable • pp 43 “Don’t put examples programs into production”
Regular Expressions are not always regular • ls a*c // Unix command line; • dir T*c // Windows cmd • POSIX Standard • sh, bash, csh • perl
Recursion limit = 1000 • #fibonnacci • def fib (current=1, previous=0, limit=100): • new = current + previous • print (new) • if new < limit: • fib (previous, new, limit) • fib(1,0, 1000000) • print ("completed") • #Ackermann • def ackermann(m, n): • if m == 0: • return(n+1) • elif m > 0 and n == 0: • return(ackermann(m-1, 1)) • elif m > 0 and n > 0: • return(ackermann(m-1, ackermann(m, n-1))) • else: • print("Should not reach here, unless bad arguments are passed.") • print ("ackermann(3,5)=", ackermann(3,5)) • print ("ackermann(4,2)=", ackermann(4,2))
Anaconda3 • sudo apt-get install anaconda3
Crawling with Scrapy • “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.” • https://doc.scrapy.org/en/latest/intro/overview.html • Tutorial • https://doc.scrapy.org/en/latest/intro/tutorial.html • Download • pip install scrapy
Walk-through of the example spider • Starting the example • $ scrapystartprojectwikiSpider • “You can start your first spider with: • $ cd wikiSpider • $ scrapygenspider example example.com
Walk-through of an example spider • $ scrapy startproject wikiSpider • Configuration file for our scrapy projects • Code directory for our new project
from scrapy.selector import Selector • from scrapy import Spider • from wikiSpider.items import Article • class ArticleSpider( Spider): • name =" article" • allowed_domains = [" en.wikipedia.org"] • start_urls = [" http:// en.wikipedia.org/ wiki/ Main_Page", • "http:// en.wikipedia.org/ wiki/ Python_% 28programming_language% 29"]
def parse( self, response): • item = Article() • title = response.xpath('// h1/ text()')[ 0]. extract() • print(" Title is: "+ title) • item[' title'] = title • return item
Running our crawler • $ scrapy crawl article
Logging with Scrapy • Add logging level to the file settings.py • LOG_LEVEL = ‘ERROR’ • There are five levels of logging in Scrapy, listed in order here: • CRITICAL • ERROR • WARNING • DEBUG • INFO • $ scrapy crawl article -s LOG_FILE = wiki.log
Varying the format of the output • $ scrapy crawl article -o articles.csv -t csv • $ scrapy crawl article -o articles.json -t json • $ scrapy crawl article -o articles.xml -t xml
Chapter 4: Using APIs • API - In computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. • https://en.wikipedia.org/wiki/Application_programming_interface • A web API is an application programming interface (API) for either a web server or a web browser. • Program request in HTML • Response in XML or JSON
ECMA-404 The JSON Data Interchange Standard • JSON (JavaScript Object Notation) is a lightweight data-interchange format. • It is easy for humans to read and write. • It is easy for machines to parse and generate. • It is based on a subset of the JavaScript Programming Language • JSON is built on two structures: • A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array. • An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
JSON objects • Similar to python dictionaries • Example • { “first” : “John”, “last” : “Donne”, “Phone number” : 7773452 }
JSON Numbers • Note no Hex and no octal
FreeGeoIP – Where is this IP address? • http://freegeoip.net/json/50.78.253.58 • {"ip":"50.78.253.58","country_code":"US","country_name":"United States","region_code":"MA","region_name":"Massachusetts","city":"Boston","zip_code":"02116","time_zone":"America/New_York","latitude":42.3496,"longitude":-71.0746,"metro_code":506} • http://freegeoip.net/json/129.252.11.130 • {"ip":"129.252.11.130","country_code":"US","country_name":"United States","region_code":"SC","region_name":"South Carolina","city":"Columbia","zip_code":"29208","time_zone":"America/New_York","latitude":33.9937,"longitude":-81.02,"metro_code":546}
HTTP protocol revisited -- History • The term hypertext was coined by Ted Nelson in 1965 in the Xanadu Project, • which was in turn inspired by Vannevar Bush's vision (1930s) of the microfilm-based information retrieval and management "memex" system described in his essay As We May Think (1945). • Tim Berners-Lee and his team at CERN are credited with inventing the original HTTP along with HTML and the associated technology for a web server and a text-based web browser. • Berners-Lee first proposed the "WorldWideWeb" project in 1989 — now known as the World Wide Web. • The first version of the protocol had only one method, namely GET, which would request a page from a server.[3] • The response from the server was always an HTML page.[4] https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
Get and Head Packets • GET • The GET method requests a representation of the specified resource. Requests using GET should only retrieve data and should have no other effect. • HEAD • The HEAD method asks for a response identical to that of a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.
Post and Put Packets • POST • The POST method requests that the server accept the entity enclosed in the request as a new subordinate of the web resource identified by the URI. The data POSTed might be, for example, an annotation for existing resources; a message for a bulletin board, newsgroup, mailing list, or comment thread; a block of data that is the result of submitting a web form to a data-handling process; or an item to add to a database.[14] • PUT • The PUT method requests that the enclosed entity be stored under the supplied URI. If the URI refers to an already existing resource, it is modified; if the URI does not point to an existing resource, then the server can create the resource with that URI.[15]
Other HTTP commands • DELETE • The DELETE method deletes the specified resource. • TRACE • The TRACE method echoes the received request so that a client can see what (if any) changes or additions have been made by intermediate servers. • OPTIONS • The OPTIONS method returns the HTTP methods that the server supports for the specified URL. This can be used to check the functionality of a web server by requesting '*' instead of a specific resource. • CONNECT • [16] The CONNECT method converts the request connection to a transparent TCP/IP tunnel, usually to facilitate SSL-encrypted communication (HTTPS) through an unencrypted HTTP proxy.[17][18] See HTTP CONNECT tunneling. • PATCH • The PATCH method applies partial modifications to a resource.[
Authentication • Identify users – for charges etc. • http:// developer.echonest.com/ api/ v4/ artist/ songs? api_key = < your api key here > %20& name = guns% 20n% 27% 20roses& format = json& start = 0& results = 100 • Using urlopen • token = "< your api key >" • webRequest = • urllib.request.Request(" http:// myapi.com", • headers ={" token": token}) • html = urlopen( webRequest)
XML versus JSON • XML • < user > < firstname > Ryan </ firstname > < lastname > Mitchell </ lastname > < username > Kludgist </ username > </ user > • which clocks in at 98 characters, • JSON: • {" user":{" firstname":" Ryan"," lastname":" Mitchell"," username":" Kludgist"}} • which clocks in at 73 characters
XML versus JSON; now prettified • XML • < user > • < firstname > Ryan </ firstname > • < lastname > Mitchell </ lastname > • < username > Kludgist </ username > • </ user > • JSON: • {" user": • {“firstname” : “Ryan“ , • “lastname“ : “Mitchell“ , • “username“ : “Kludgist”} • }
Syntax of API calls • When retrieving data through a GET request, • the URL path describes how you would like to drill down into the data, • while the query parameters serve as filters or additional requests tacked onto the search. • Example • http://socialmediasite.com/users/1234/posts?from08012014&to=08312014
Echo Nest • Web scrapes to identify music instead of human tagging like Pandora • http:// developer.echonest.com/ api/ v4/ artist/ search? api_key = < your api key >& name = monty% 20python • This produces the following result: • {" response": {" status": {" version": "4.2", "code": 0, • "message": "Success"}, "artists": [{" id": "AR5HF791187B9ABAF4", "name": "Monty Pytho n"}, {" id": "ARWCIDE13925F19A33", "name": "Monty Python's SPAMALOT"}, {" id": "ARVPRCC12FE0862033", "name": "Monty Python's Graham Chapman" }]}} • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1438-1447). O'Reilly Media. Kindle Edition.
Twitter • Twitter is notoriously protective of its API and rightfully so. With • over 230 million active users • and a revenue of over $ 100 million a month, • the company is hesitant to let just anyone come along and have any data they want. • Twitter’s rate limits (the number of calls it allows each user to make) fall into two categories: • 15 calls per 15-minute period, and • 180 calls per 15-minute period, depending on the type of call. • For instance, you can make up to 12 calls a minute to retrieve basic information about Twitter users, but only one call a minute to retrieve lists of those users’ Twitter followers.
Yield in Python • def _get_child_candidates(self, distance, min_dist, max_dist): • if self._leftchild and distance - max_dist < self._median: • yield self._leftchild • if self._rightchild and distance + max_dist >= self._median: • yield self._rightchild • result, candidates = list(), [self] • while candidates: • node = candidates.pop() • distance = node._get_dist(obj) • if distance <= max_dist and distance >= min_dist: • result.extend(node._values) • candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) • return result http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do
Iterators and Generators • When you create a list, you can read its items one by one. Reading its items one by one is called iteration: • >>> mylist = [1, 2, 3] • >>> for i in mylist: • ... print(i) • Generators are iterators, but you can only iterate over them once. It's because they do not store all the values in memory, they generate the values on the fly: • >>> mygenerator = (x*x for x in range(3)) • >>> for i in mygenerator: • ... print(i)
Yield • Yield is a keyword that is used like return, except the function will return a generator. • >>> defcreateGenerator(): • ... mylist = range(3) • ... for i in mylist: • ... yield i*i • ... • >>> mygenerator = createGenerator() # create a generator • >>> print(mygenerator) # mygenerator is an object! • <generator object createGenerator at 0xb7555c34> • >>> for i in mygenerator: • ... print(i)
Next Time Requests Library and DB • Requests: HTTP for Humans • >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) • >>> r.status_code • 200 • >>> r.headers['content-type'] • 'application/json; charset=utf8' • >>> r.encoding • 'utf-8' • >>> r.text • u'{"type":"User"...' • >>> r.json() • {u'private_gists': 419, u'total_private_repos': 77, ...}