1 / 51

Web Data Crawling and Analysis with Scrapy

Learn the process of web crawling using Scrapy and how to extract data from webpages for easy data analysis. Benefit your company by automating data collection.

sfarmer
Download Presentation

Web Data Crawling and Analysis with Scrapy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP4332/RMBI4310 Crawling Prepared by Raymond Wong Presented by Raymond Wong

  2. Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output

  3. 1. Overview • There are many webpages containing a lot of information • Suppose that we could obtain this information and store it in the database of our own computer • Then, we could perform data analysis easily • This could benefit the company a lot

  4. Crawling is a process of extracting data from websites. • Other terminologies • Web spidering • Web scraping • Web harvesting • Web data extraction

  5. The program used for crawling is called • Web crawler • Spider

  6. In Python, we could use a package called “scrapy” for data crawling

  7. Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output

  8. When we want to start “crawling”, we need to create a project of “Scrapy” • Go to a folder (e.g., “C:\Project”) that you want to create a project • Type the following Command Line scrapy startproject testCrawl

  9. Output New Scrapy project 'testCrawl', using template directory 'c:\\python\\python36\\lib\\site-packages\\scrapy\\templates\\project', created in: C:\Project\testCrawl You can start your first spider with: cd testCrawl scrapy genspider example example.com

  10. We could see the following files/folders generated

  11. Project/ • testCrawl/ • scrapy.cfg • testCrawl/ • __init__.py • items.py • pipelines.py • middlewares.py • settings.py • spiders/ • __init__.py A new sub-folder called “testCrawl” A configuration file of “scrapy” Another new sub-folder called “testCrawl” A “project item definition” file A “project pipeline” file A “middleware” file A “project setting” file A sub-folder called “spiders”

  12. Let us create a Python file called “OneWebpageSpider.py” under the sub-folder “spiders” • This python file accesses a webpage (“SimpleWebpage.html”) and then stores the content of this webpage in the working directory of our computer

  13. The webpage (SimpleWebpage.html) to be crawled is shown as follows.

  14. HTML <!DOCTYPE html> <html> ... <body> <h1>Webpage Heading (H1)</h1> ... <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> ... <a href="Table.html">link (No. 2)</a> ... <a href="#ImageHeading">link (No. 3)</a> ... <a href="#ListHeading">link (No. 4)</a> ... <h2><a name="ImageHeading">Image Heading (H2)</a></h2> ... <h3><a name="ListHeading">Ordered/Underordered List Heading (H3)</a></h3> ... </body> </html>

  15. The Python file is shown in the next slide.

  16. python We need to import “scrapy” import scrapy class OneWebpageSpider(scrapy.Spider): name = "OneWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] We need to define a sub-class under “scrapy.Spider” We need to specify the “unique” name within a project We can give a list of URLs to be started for crawling We need to define this “parse” function This function will be called by “scarpy” during crawling. def parse(self, response): crawlFilename = response.url.split("/")[-1] Obtain the filename which is the part of URL just after the last “/” with open(crawlFilename, "wb") as f: f.write(response.body) Write the content of the HTML file to the file with this filename self.log("Saved File {} ".format(crawlFilename)) Write to the “debug” console of “scrapy”

  17. After we create this Python file, we should type the following command under “C:\Project\testCrawl” to execute this Python file Command Line scrapy crawl OneWebpage This is the name specified in the Python script

  18. After that, there are the following output created by “scrapy”

  19. Output 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: testCrawl) 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'testCrawl', 'NEWSPIDER_MODULE': 'testCrawl.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testCrawl.spiders']} 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] … 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider opened 2017-12-25 22:03:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-25 22:03:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/robots.txt> (referer: None) 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html> (referer: None) 2017-12-25 22:03:16 [OneWebpage] DEBUG: Saved File SimpleWebpage.html 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Closing spider (finished) … 'start_time': datetime.datetime(2017, 12, 25, 14, 3, 16, 707764)} 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider closed (finished)

  20. We could see a file called “SimpleWebpage.html” stored under our working directory (i.e, “C:\Project\testCrawl”)

  21. Let us modify the previous Python script to do the following. • to access a webpage • to obtain a list of all links found in "href" of the "a" HTML tags and save the list in a file called "listOfLink.txt" Here, we need to find something based on some “required” conditions. In Scrapy, this task is done by a selector. We will use the technology of “XPath” for this purpose. There is another technology called “CSS” (Cascading Style Sheets) for this purpose. However, “XPath” is more powerful to write an expression of finding the required conditions. In practice, we could use these 2 technologies interleavingly for this purpose.

  22. python import scrapy class OneWebpageSpider(scrapy.Spider): name = "OneWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] If we want to obtain the first one (not a list), we could use “extract_first()” instead of “extract()” def parse(self, response): listOfLink = response.xpath("//a[@href]/@href").extract() Find a list of all links found in "href" of the "a" HTML tags linkFilename = "listOfLink.txt" with open(linkFilename, "w") as f: for link in listOfLink: f.write(link) f.write("\n") Save the list in a file called "listOfLink.txt" self.log("Saved File {} ".format(linkFilename)) Write to the “debug” console of “scrapy”

  23. After we execute the code, we could see a file called “listOfLink.txt” stored under our working directory (i.e, “C:\Project\testCrawl”) File “listOfLink.txt” http://home.cse.ust.hk/~raywong/ Table.html #ImageHeading #ListHeading

  24. Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output

  25. We want to do the following. • Operation 1 • access a webpage (“SimpleWebpage.html”) • store the content of this webpage in the working directory of our computer • Operation 2 • obtain a list of all links found in "href" of the "a" HTML tags, • perform data crawling on the webpage of each of the links where each link contains keywords "Table.html" This operation is what we have seen before.

  26. python import scrapy class LinkWebpageSpider(scrapy.Spider): name = "LinkWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] def parse(self, response): # Operation 1 crawlFilename = response.url.split("/")[-1] with open(crawlFilename, "wb") as f: f.write(response.body) self.log("Saved File {} ".format(crawlFilename))

  27. We want to do the following. • Operation 1 • access a webpage (“SimpleWebpage.html”) • store the content of this webpage in the working directory of our computer • Operation 2 • obtain a list of all links found in "href" of the "a" HTML tags, • perform data crawling on the webpage of each of the links where each link contains keyword "Table.html" We will illustrate this operation next.

  28. python import scrapy class LinkWebpageSpider(scrapy.Spider): name = "LinkWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] def parse(self, response): # Operation 1 … # Operation 2 listOfLink = response.xpath("//a[@href]/@href").extract() Find a list of all links found in "href" of the "a" HTML tags for link in listOfLink: if ("Table.html" in link): yield response.follow(link, callback=self.parse) To perform data crawling on the webpage of each of the links where each link contains keyword "Table.html"

  29. After we execute this Python file, we could see a file called “SimpleWebpage.html” and another file called “Table.html” stored under our working directory (i.e, “C:\Project\testCrawl”)

  30. Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output

  31. We want to do the following. • perform data crawling on one single webpage (Table.html), • use two “different” XPath methods • obtain the data information, • save the information in a file(one is called “record1.txt” and the is called “record2.txt”) The first method is to call XPath once for each entry.The second method is to call XPath twice for each entry.

  32. The webpage (Table.html) to be crawled is shown as follows.

  33. HTML <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>Table Title</title> </head> <body> <table width="800" border="1"> <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> <tr> <td>87654321</td> <td>Peter Chan</td> <td>1997</td> </tr> <tr> <td>12341234</td> <td>Mary Lau</td> <td>1999</td> </tr> <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> <tr> <td>88888888</td> <td>Test Test</td> <td>1998</td> </tr> </table> </body> </html>

  34. The Python file is shown in the next slide.

  35. python import scrapy class TableWebpageSpider(scrapy.Spider): name = "TableWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/Table.html" ] def parse(self, response): # We perform the following operations yield self.parseMethod1(response) yield self.parseMethod2(response) def parseMethod1(self, response): ... def parseMethod2(self, response): ...

  36. python def parseMethod1(self, response): listOfID = response.xpath("//tr/td[1]/text()").extract() listOfName = response.xpath("//tr/td[2]/text()").extract() listOfByear = response.xpath("//tr/td[3]/text()").extract() Obtain the list of records (ID, name, byear) recordFilename = "record1.txt" no = 0 with open(recordFilename, "w") as f: for id in listOfID: id = listOfID[no] name = listOfName[no] byear = listOfByear[no] no = no + 1 f.write("({}: {}, {}, {})\n".format(no, id, name, byear)) f.write("\n") Save the list of records in a file called "record1.txt" self.log("Saved File {} ".format(recordFilename)) Write to the “debug” console of “scrapy”

  37. After we execute the code, we could see a file called “record1.txt” stored under our working directory (i.e, “C:\Project\testCrawl”)

  38. File “record1.txt” (1: 12345678, Raymond, 1998) (2: 87654321, Peter Chan, 1997) (3: 12341234, Mary Lau, 1999) (4: 56785678, David Lee, 1998) (5: 88888888, Test Test, 1998)

  39. python def parseMethod2(self, response): listOfRecord = response.xpath("//tr[td]") Obtain the list of records (ID, name, byear) recordFilename = "record2.txt" no = 0 with open(recordFilename, "w") as f: for record in listOfRecord: id = record.xpath("./td[1]/text()").extract_first() name = record.xpath("./td[2]/text()").extract_first() byear = record.xpath("./td[3]/text()").extract_first() no = no + 1 f.write("({}: {}, {}, {})\n".format(no, id, name, byear)) f.write("\n") “.” means the current node Save the list of records in a file called "record2.txt" self.log("Saved File {} ".format(recordFilename)) Write to the “debug” console of “scrapy”

  40. After we execute the code, we could see a file called “record2.txt” stored under our working directory (i.e, “C:\Project\testCrawl”) • The file content of “record2.txt” is exactly the same as the file content of “record1.txt”

  41. Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output

  42. 5. Data Crawler with Constructor and “Closed” • When we execute the data crawler, we want to call a function “once” only at the beginning of the execution. • We also want to call a function “once” only at the end of the execution. The function is called “__init__” The function is called “closed” One example is that 1. We want to connect to the “database” server at the beginning 2. We also want to disconnect from the “database” server at the end

  43. python import scrapy class OneWebpageSpider(scrapy.Spider): name = "OneWebpage" start_urls = [ "http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html" ] Constructor (which is called at the beginning) def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) print("This is called at the beginning.") Closed (which is called at the end) def closed(self, reason): print("This is called at the end.") Remember that each “instance” variable should have a prefix “self.” e.g., We should write “self.tempVariable” def parse(self, response): …

  44. Outline • Overview • Data Crawler which Assesses One Simple Webpage • Data Crawler which Assesses One Simple Webpage and Recursively Assesses Other Webpages • Data Crawler which Assesses One Table Webpage • Data Crawler with Constructor and “Closed” • Data Crawler with a Concise Output

  45. Consider the following command under “C:\Project\testCrawl” Command Line scrapy crawl OneWebpage • We have the following output created by “scrapy” which is very “noisy”

  46. Output 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: testCrawl) 2017-12-25 22:03:16 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'testCrawl', 'NEWSPIDER_MODULE': 'testCrawl.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testCrawl.spiders']} 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] … 2017-12-25 22:03:16 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider opened 2017-12-25 22:03:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-25 22:03:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/robots.txt> (referer: None) 2017-12-25 22:03:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cse.ust.hk/~raywong/temp/SimpleWebpage.html> (referer: None) 2017-12-25 22:03:16 [OneWebpage] DEBUG: Saved File SimpleWebpage.html 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Closing spider (finished) … 'start_time': datetime.datetime(2017, 12, 25, 14, 3, 16, 707764)} 2017-12-25 22:03:16 [scrapy.core.engine] INFO: Spider closed (finished)

  47. We could have a “concise” output after we update “settings.py” • Note that we have the following files/folders.

  48. Project/ • testCrawl/ • scrapy.cfg • testCrawl/ • __init__.py • items.py • pipelines.py • middlewares.py • settings.py • spiders/ • __init__.py

  49. settings.py # -*- coding: utf-8 -*- # Scrapy settings for testCrawl project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'testCrawl' SPIDER_MODULES = ['testCrawl.spiders'] NEWSPIDER_MODULE = 'testCrawl.spiders' Add the following settings # log to file instead of the console. LOG_FILE = 'log.txt' # only log error messages, ignore irrelevant messages # to know more about log levels, see https://doc.scrapy.org/en/latest/topics/logging.html#log-levels LOG_LEVEL = 'ERROR' # does not redirect standard output to the log file # i.e., we want the output from the print() method is shown in the console LOG_STDOUT = False

  50. After that, we re-type the following command under “C:\Project\testCrawl” Command Line scrapy crawl OneWebpage

More Related