510 likes | 522 Views
Learn the process of web crawling using Scrapy and how to extract data from webpages for easy data analysis. Benefit your company by automating data collection.
E N D
COMP4332/RMBI4310 Crawling Prepared by Raymond Wong Presented by Raymond Wong
Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output
1. Overview • There are many webpages containing a lot of information • Suppose that we could obtain this information and store it in the database of our own computer • Then, we could perform data analysis easily • This could benefit the company a lot
Crawling is a process of extracting data from websites. • Other terminologies • Web spidering • Web scraping • Web harvesting • Web data extraction
The program used for crawling is called • Web crawler • Spider
In Python, we could use a package called “scrapy” for data crawling
Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output
When we want to start “crawling”, we need to create a project of “Scrapy” • Go to a folder (e.g., “C:\Project”) that you want to create a project • Type the following Command Line scrapy startproject testCrawl
Output New Scrapy project 'testCrawl', using template directory 'c:\\python\\python36\\lib\\site-packages\\scrapy\\templates\\project', created in: C:\Project\testCrawl You can start your first spider with: cd testCrawl scrapy genspider example example.com
Project/ • testCrawl/ • scrapy.cfg • testCrawl/ • __init__.py • items.py • pipelines.py • middlewares.py • settings.py • spiders/ • __init__.py A new sub-folder called “testCrawl” A configuration file of “scrapy” Another new sub-folder called “testCrawl” A “project item definition” file A “project pipeline” file A “middleware” file A “project setting” file A sub-folder called “spiders”
Let us create a Python file called “OneWebpageSpider.py” under the sub-folder “spiders” • This python file accesses a webpage (“SimpleWebpage.html”) and then stores the content of this webpage in the working directory of our computer
The webpage (SimpleWebpage.html) to be crawled is shown as follows.
HTML <!DOCTYPE html> <html> ... <body> <h1>Webpage Heading (H1)</h1> ... <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> ... <a href="Table.html">link (No. 2)</a> ... <a href="#ImageHeading">link (No. 3)</a> ... <a href="#ListHeading">link (No. 4)</a> ... <h2><a name="ImageHeading">Image Heading (H2)</a></h2> ... <h3><a name="ListHeading">Ordered/Underordered List Heading (H3)</a></h3> ... </body> </html>
python We need to import “scrapy” import scrapy class OneWebpageSpider(scrapy.Spider): name = "OneWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] We need to define a sub-class under “scrapy.Spider” We need to specify the “unique” name within a project We can give a list of URLs to be started for crawling We need to define this “parse” function This function will be called by “scarpy” during crawling. def parse(self, response): crawlFilename = response.url.split("/")[-1] Obtain the filename which is the part of URL just after the last “/” with open(crawlFilename, "wb") as f: f.write(response.body) Write the content of the HTML file to the file with this filename self.log("Saved File {} ".format(crawlFilename)) Write to the “debug” console of “scrapy”
After we create this Python file, we should type the following command under “C:\Project\testCrawl” to execute this Python file Command Line scrapy crawl OneWebpage This is the name specified in the Python script
After that, there are the following output created by “scrapy”
Output 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: testCrawl) 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'testCrawl', 'NEWSPIDER_MODULE': 'testCrawl.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testCrawl.spiders']} 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] … 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider opened 2017-12-25 22:03:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-25 22:03:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/robots.txt> (referer: None) 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html> (referer: None) 2017-12-25 22:03:16 [OneWebpage] DEBUG: Saved File SimpleWebpage.html 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Closing spider (finished) … 'start_time': datetime.datetime(2017, 12, 25, 14, 3, 16, 707764)} 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider closed (finished)
We could see a file called “SimpleWebpage.html” stored under our working directory (i.e, “C:\Project\testCrawl”)
Let us modify the previous Python script to do the following. • to access a webpage • to obtain a list of all links found in "href" of the "a" HTML tags and save the list in a file called "listOfLink.txt" Here, we need to find something based on some “required” conditions. In Scrapy, this task is done by a selector. We will use the technology of “XPath” for this purpose. There is another technology called “CSS” (Cascading Style Sheets) for this purpose. However, “XPath” is more powerful to write an expression of finding the required conditions. In practice, we could use these 2 technologies interleavingly for this purpose.
python import scrapy class OneWebpageSpider(scrapy.Spider): name = "OneWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] If we want to obtain the first one (not a list), we could use “extract_first()” instead of “extract()” def parse(self, response): listOfLink = response.xpath("//a[@href]/@href").extract() Find a list of all links found in "href" of the "a" HTML tags linkFilename = "listOfLink.txt" with open(linkFilename, "w") as f: for link in listOfLink: f.write(link) f.write("\n") Save the list in a file called "listOfLink.txt" self.log("Saved File {} ".format(linkFilename)) Write to the “debug” console of “scrapy”
After we execute the code, we could see a file called “listOfLink.txt” stored under our working directory (i.e, “C:\Project\testCrawl”) File “listOfLink.txt” http://home.cse.ust.hk/~raywong/ Table.html #ImageHeading #ListHeading
Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output
We want to do the following. • Operation 1 • access a webpage (“SimpleWebpage.html”) • store the content of this webpage in the working directory of our computer • Operation 2 • obtain a list of all links found in "href" of the "a" HTML tags, • perform data crawling on the webpage of each of the links where each link contains keywords "Table.html" This operation is what we have seen before.
python import scrapy class LinkWebpageSpider(scrapy.Spider): name = "LinkWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] def parse(self, response): # Operation 1 crawlFilename = response.url.split("/")[-1] with open(crawlFilename, "wb") as f: f.write(response.body) self.log("Saved File {} ".format(crawlFilename))
We want to do the following. • Operation 1 • access a webpage (“SimpleWebpage.html”) • store the content of this webpage in the working directory of our computer • Operation 2 • obtain a list of all links found in "href" of the "a" HTML tags, • perform data crawling on the webpage of each of the links where each link contains keyword "Table.html" We will illustrate this operation next.
python import scrapy class LinkWebpageSpider(scrapy.Spider): name = "LinkWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] def parse(self, response): # Operation 1 … # Operation 2 listOfLink = response.xpath("//a[@href]/@href").extract() Find a list of all links found in "href" of the "a" HTML tags for link in listOfLink: if ("Table.html" in link): yield response.follow(link, callback=self.parse) To perform data crawling on the webpage of each of the links where each link contains keyword "Table.html"
After we execute this Python file, we could see a file called “SimpleWebpage.html” and another file called “Table.html” stored under our working directory (i.e, “C:\Project\testCrawl”)
Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output
We want to do the following. • perform data crawling on one single webpage (Table.html), • use two “different” XPath methods • obtain the data information, • save the information in a file(one is called “record1.txt” and the is called “record2.txt”) The first method is to call XPath once for each entry.The second method is to call XPath twice for each entry.
HTML <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>Table Title</title> </head> <body> <table width="800" border="1"> <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> <tr> <td>87654321</td> <td>Peter Chan</td> <td>1997</td> </tr> <tr> <td>12341234</td> <td>Mary Lau</td> <td>1999</td> </tr> <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> <tr> <td>88888888</td> <td>Test Test</td> <td>1998</td> </tr> </table> </body> </html>
python import scrapy class TableWebpageSpider(scrapy.Spider): name = "TableWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/Table.html" ] def parse(self, response): # We perform the following operations yield self.parseMethod1(response) yield self.parseMethod2(response) def parseMethod1(self, response): ... def parseMethod2(self, response): ...
python def parseMethod1(self, response): listOfID = response.xpath("//tr/td[1]/text()").extract() listOfName = response.xpath("//tr/td[2]/text()").extract() listOfByear = response.xpath("//tr/td[3]/text()").extract() Obtain the list of records (ID, name, byear) recordFilename = "record1.txt" no = 0 with open(recordFilename, "w") as f: for id in listOfID: id = listOfID[no] name = listOfName[no] byear = listOfByear[no] no = no + 1 f.write("({}: {}, {}, {})\n".format(no, id, name, byear)) f.write("\n") Save the list of records in a file called "record1.txt" self.log("Saved File {} ".format(recordFilename)) Write to the “debug” console of “scrapy”
After we execute the code, we could see a file called “record1.txt” stored under our working directory (i.e, “C:\Project\testCrawl”)
File “record1.txt” (1: 12345678, Raymond, 1998) (2: 87654321, Peter Chan, 1997) (3: 12341234, Mary Lau, 1999) (4: 56785678, David Lee, 1998) (5: 88888888, Test Test, 1998)
python def parseMethod2(self, response): listOfRecord = response.xpath("//tr[td]") Obtain the list of records (ID, name, byear) recordFilename = "record2.txt" no = 0 with open(recordFilename, "w") as f: for record in listOfRecord: id = record.xpath("./td[1]/text()").extract_first() name = record.xpath("./td[2]/text()").extract_first() byear = record.xpath("./td[3]/text()").extract_first() no = no + 1 f.write("({}: {}, {}, {})\n".format(no, id, name, byear)) f.write("\n") “.” means the current node Save the list of records in a file called "record2.txt" self.log("Saved File {} ".format(recordFilename)) Write to the “debug” console of “scrapy”
After we execute the code, we could see a file called “record2.txt” stored under our working directory (i.e, “C:\Project\testCrawl”) • The file content of “record2.txt” is exactly the same as the file content of “record1.txt”
Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output
5. Data Crawler with Constructor and “Closed” • When we execute the data crawler, we want to call a function “once” only at the beginning of the execution. • We also want to call a function “once” only at the end of the execution. The function is called “__init__” The function is called “closed” One example is that 1. We want to connect to the “database” server at the beginning 2. We also want to disconnect from the “database” server at the end
python import scrapy class OneWebpageSpider(scrapy.Spider): name = "OneWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] Constructor (which is called at the beginning) def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) print("This is called at the beginning.") Closed (which is called at the end) def closed(self, reason): print("This is called at the end.") Remember that each “instance” variable should have a prefix “self.” e.g., We should write “self.tempVariable” def parse(self, response): …
Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output
Consider the following command under “C:\Project\testCrawl” Command Line scrapy crawl OneWebpage • We have the following output created by “scrapy” which is very “noisy”
Output 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: testCrawl) 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'testCrawl', 'NEWSPIDER_MODULE': 'testCrawl.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testCrawl.spiders']} 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] … 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider opened 2017-12-25 22:03:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-25 22:03:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/robots.txt> (referer: None) 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html> (referer: None) 2017-12-25 22:03:16 [OneWebpage] DEBUG: Saved File SimpleWebpage.html 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Closing spider (finished) … 'start_time': datetime.datetime(2017, 12, 25, 14, 3, 16, 707764)} 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider closed (finished)
We could have a “concise” output after we update “settings.py” • Note that we have the following files/folders.
Project/ • testCrawl/ • scrapy.cfg • testCrawl/ • __init__.py • items.py • pipelines.py • middlewares.py • settings.py • spiders/ • __init__.py
settings.py # -*- coding: utf-8 -*- # Scrapy settings for testCrawl project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'testCrawl' SPIDER_MODULES = ['testCrawl.spiders'] NEWSPIDER_MODULE = 'testCrawl.spiders' Add the following settings # log to file instead of the console. LOG_FILE = 'log.txt' # only log error messages, ignore irrelevant messages # to know more about log levels, see https://doc.scrapy.org/en/latest/topics/logging.html#log-levels LOG_LEVEL = 'ERROR' # does not redirect standard output to the log file # i.e., we want the output from the print() method is shown in the console LOG_STDOUT = False
After that, we re-type the following command under “C:\Project\testCrawl” Command Line scrapy crawl OneWebpage