Web Data Crawling and Analysis with Scrapy

COMP4332/RMBI4310 Crawling Prepared by Raymond Wong Presented by Raymond Wong

Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output

1. Overview • There are many webpages containing a lot of information • Suppose that we could obtain this information and store it in the database of our own computer • Then, we could perform data analysis easily • This could benefit the company a lot

Crawling is a process of extracting data from websites. • Other terminologies • Web spidering • Web scraping • Web harvesting • Web data extraction

The program used for crawling is called • Web crawler • Spider

In Python, we could use a package called “scrapy” for data crawling

When we want to start “crawling”, we need to create a project of “Scrapy” • Go to a folder (e.g., “C:\Project”) that you want to create a project • Type the following Command Line scrapy startproject testCrawl

Output New Scrapy project 'testCrawl', using template directory 'c:\\python\\python36\\lib\\site-packages\\scrapy\\templates\\project', created in: C:\Project\testCrawl You can start your first spider with: cd testCrawl scrapy genspider example example.com

We could see the following files/folders generated

Project/ • testCrawl/ • scrapy.cfg • testCrawl/ • __init__.py • items.py • pipelines.py • middlewares.py • settings.py • spiders/ • __init__.py A new sub-folder called “testCrawl” A configuration file of “scrapy” Another new sub-folder called “testCrawl” A “project item definition” file A “project pipeline” file A “middleware” file A “project setting” file A sub-folder called “spiders”

Let us create a Python file called “OneWebpageSpider.py” under the sub-folder “spiders” • This python file accesses a webpage (“SimpleWebpage.html”) and then stores the content of this webpage in the working directory of our computer

The webpage (SimpleWebpage.html) to be crawled is shown as follows.

HTML <!DOCTYPE html> <html> ... <body> <h1>Webpage Heading (H1)</h1> ... <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> ... <a href="Table.html">link (No. 2)</a> ... <a href="#ImageHeading">link (No. 3)</a> ... <a href="#ListHeading">link (No. 4)</a> ... <h2><a name="ImageHeading">Image Heading (H2)</a></h2> ... <h3><a name="ListHeading">Ordered/Underordered List Heading (H3)</a></h3> ... </body> </html>

The Python file is shown in the next slide.

python We need to import “scrapy” import scrapy class OneWebpageSpider(scrapy.Spider): name = "OneWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] We need to define a sub-class under “scrapy.Spider” We need to specify the “unique” name within a project We can give a list of URLs to be started for crawling We need to define this “parse” function This function will be called by “scarpy” during crawling. def parse(self, response): crawlFilename = response.url.split("/")[-1] Obtain the filename which is the part of URL just after the last “/” with open(crawlFilename, "wb") as f: f.write(response.body) Write the content of the HTML file to the file with this filename self.log("Saved File {} ".format(crawlFilename)) Write to the “debug” console of “scrapy”

After we create this Python file, we should type the following command under “C:\Project\testCrawl” to execute this Python file Command Line scrapy crawl OneWebpage This is the name specified in the Python script

After that, there are the following output created by “scrapy”

Output 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: testCrawl) 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'testCrawl', 'NEWSPIDER_MODULE': 'testCrawl.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testCrawl.spiders']} 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] … 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider opened 2017-12-25 22:03:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-25 22:03:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/robots.txt> (referer: None) 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html> (referer: None) 2017-12-25 22:03:16 [OneWebpage] DEBUG: Saved File SimpleWebpage.html 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Closing spider (finished) … 'start_time': datetime.datetime(2017, 12, 25, 14, 3, 16, 707764)} 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider closed (finished)

We could see a file called “SimpleWebpage.html” stored under our working directory (i.e, “C:\Project\testCrawl”)

Let us modify the previous Python script to do the following. • to access a webpage • to obtain a list of all links found in "href" of the "a" HTML tags and save the list in a file called "listOfLink.txt" Here, we need to find something based on some “required” conditions. In Scrapy, this task is done by a selector. We will use the technology of “XPath” for this purpose. There is another technology called “CSS” (Cascading Style Sheets) for this purpose. However, “XPath” is more powerful to write an expression of finding the required conditions. In practice, we could use these 2 technologies interleavingly for this purpose.

python import scrapy class OneWebpageSpider(scrapy.Spider): name = "OneWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] If we want to obtain the first one (not a list), we could use “extract_first()” instead of “extract()” def parse(self, response): listOfLink = response.xpath("//a[@href]/@href").extract() Find a list of all links found in "href" of the "a" HTML tags linkFilename = "listOfLink.txt" with open(linkFilename, "w") as f: for link in listOfLink: f.write(link) f.write("\n") Save the list in a file called "listOfLink.txt" self.log("Saved File {} ".format(linkFilename)) Write to the “debug” console of “scrapy”

After we execute the code, we could see a file called “listOfLink.txt” stored under our working directory (i.e, “C:\Project\testCrawl”) File “listOfLink.txt” http://home.cse.ust.hk/~raywong/ Table.html #ImageHeading #ListHeading

We want to do the following. • Operation 1 • access a webpage (“SimpleWebpage.html”) • store the content of this webpage in the working directory of our computer • Operation 2 • obtain a list of all links found in "href" of the "a" HTML tags, • perform data crawling on the webpage of each of the links where each link contains keywords "Table.html" This operation is what we have seen before.

python import scrapy class LinkWebpageSpider(scrapy.Spider): name = "LinkWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] def parse(self, response): # Operation 1 crawlFilename = response.url.split("/")[-1] with open(crawlFilename, "wb") as f: f.write(response.body) self.log("Saved File {} ".format(crawlFilename))

We want to do the following. • Operation 1 • access a webpage (“SimpleWebpage.html”) • store the content of this webpage in the working directory of our computer • Operation 2 • obtain a list of all links found in "href" of the "a" HTML tags, • perform data crawling on the webpage of each of the links where each link contains keyword "Table.html" We will illustrate this operation next.

python import scrapy class LinkWebpageSpider(scrapy.Spider): name = "LinkWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] def parse(self, response): # Operation 1 … # Operation 2 listOfLink = response.xpath("//a[@href]/@href").extract() Find a list of all links found in "href" of the "a" HTML tags for link in listOfLink: if ("Table.html" in link): yield response.follow(link, callback=self.parse) To perform data crawling on the webpage of each of the links where each link contains keyword "Table.html"

After we execute this Python file, we could see a file called “SimpleWebpage.html” and another file called “Table.html” stored under our working directory (i.e, “C:\Project\testCrawl”)

We want to do the following. • perform data crawling on one single webpage (Table.html), • use two “different” XPath methods • obtain the data information, • save the information in a file(one is called “record1.txt” and the is called “record2.txt”) The first method is to call XPath once for each entry.The second method is to call XPath twice for each entry.

The webpage (Table.html) to be crawled is shown as follows.

HTML <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>Table Title</title> </head> <body> <table width="800" border="1"> <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> <tr> <td>87654321</td> <td>Peter Chan</td> <td>1997</td> </tr> <tr> <td>12341234</td> <td>Mary Lau</td> <td>1999</td> </tr> <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> <tr> <td>88888888</td> <td>Test Test</td> <td>1998</td> </tr> </table> </body> </html>

The Python file is shown in the next slide.

python import scrapy class TableWebpageSpider(scrapy.Spider): name = "TableWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/Table.html" ] def parse(self, response): # We perform the following operations yield self.parseMethod1(response) yield self.parseMethod2(response) def parseMethod1(self, response): ... def parseMethod2(self, response): ...

python def parseMethod1(self, response): listOfID = response.xpath("//tr/td[1]/text()").extract() listOfName = response.xpath("//tr/td[2]/text()").extract() listOfByear = response.xpath("//tr/td[3]/text()").extract() Obtain the list of records (ID, name, byear) recordFilename = "record1.txt" no = 0 with open(recordFilename, "w") as f: for id in listOfID: id = listOfID[no] name = listOfName[no] byear = listOfByear[no] no = no + 1 f.write("({}: {}, {}, {})\n".format(no, id, name, byear)) f.write("\n") Save the list of records in a file called "record1.txt" self.log("Saved File {} ".format(recordFilename)) Write to the “debug” console of “scrapy”

After we execute the code, we could see a file called “record1.txt” stored under our working directory (i.e, “C:\Project\testCrawl”)

File “record1.txt” (1: 12345678, Raymond, 1998) (2: 87654321, Peter Chan, 1997) (3: 12341234, Mary Lau, 1999) (4: 56785678, David Lee, 1998) (5: 88888888, Test Test, 1998)

python def parseMethod2(self, response): listOfRecord = response.xpath("//tr[td]") Obtain the list of records (ID, name, byear) recordFilename = "record2.txt" no = 0 with open(recordFilename, "w") as f: for record in listOfRecord: id = record.xpath("./td[1]/text()").extract_first() name = record.xpath("./td[2]/text()").extract_first() byear = record.xpath("./td[3]/text()").extract_first() no = no + 1 f.write("({}: {}, {}, {})\n".format(no, id, name, byear)) f.write("\n") “.” means the current node Save the list of records in a file called "record2.txt" self.log("Saved File {} ".format(recordFilename)) Write to the “debug” console of “scrapy”

After we execute the code, we could see a file called “record2.txt” stored under our working directory (i.e, “C:\Project\testCrawl”) • The file content of “record2.txt” is exactly the same as the file content of “record1.txt”

5. Data Crawler with Constructor and “Closed” • When we execute the data crawler, we want to call a function “once” only at the beginning of the execution. • We also want to call a function “once” only at the end of the execution. The function is called “__init__” The function is called “closed” One example is that 1. We want to connect to the “database” server at the beginning 2. We also want to disconnect from the “database” server at the end

python import scrapy class OneWebpageSpider(scrapy.Spider): name = "OneWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] Constructor (which is called at the beginning) def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) print("This is called at the beginning.") Closed (which is called at the end) def closed(self, reason): print("This is called at the end.") Remember that each “instance” variable should have a prefix “self.” e.g., We should write “self.tempVariable” def parse(self, response): …

Consider the following command under “C:\Project\testCrawl” Command Line scrapy crawl OneWebpage • We have the following output created by “scrapy” which is very “noisy”

Output 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: testCrawl) 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'testCrawl', 'NEWSPIDER_MODULE': 'testCrawl.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testCrawl.spiders']} 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] … 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider opened 2017-12-25 22:03:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-25 22:03:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/robots.txt> (referer: None) 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html> (referer: None) 2017-12-25 22:03:16 [OneWebpage] DEBUG: Saved File SimpleWebpage.html 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Closing spider (finished) … 'start_time': datetime.datetime(2017, 12, 25, 14, 3, 16, 707764)} 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider closed (finished)

We could have a “concise” output after we update “settings.py” • Note that we have the following files/folders.

Project/ • testCrawl/ • scrapy.cfg • testCrawl/ • __init__.py • items.py • pipelines.py • middlewares.py • settings.py • spiders/ • __init__.py

settings.py # -*- coding: utf-8 -*- # Scrapy settings for testCrawl project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'testCrawl' SPIDER_MODULES = ['testCrawl.spiders'] NEWSPIDER_MODULE = 'testCrawl.spiders' Add the following settings # log to file instead of the console. LOG_FILE = 'log.txt' # only log error messages, ignore irrelevant messages # to know more about log levels, see https://doc.scrapy.org/en/latest/topics/logging.html#log-levels LOG_LEVEL = 'ERROR' # does not redirect standard output to the log file # i.e., we want the output from the print() method is shown in the console LOG_STDOUT = False

After that, we re-type the following command under “C:\Project\testCrawl” Command Line scrapy crawl OneWebpage

Web Data Crawling and Analysis with Scrapy

Web Data Crawling and Analysis with Scrapy

Presentation Transcript