200 likes | 223 Views
590 Scraping – NER shape features. Topics Scrapy – items.py Readings: Srapy documentation. April 4, 2017. Today. Scrapers from scrapy_documentation loggingSpider.py openAllLinks.py Cleaning NLTK data Removing common words Testing in Python unitest Testing websites . Scrapy notes.
E N D
590 Scraping – NER shape features • Topics • Scrapy – items.py • Readings: • Srapy documentation April 4, 2017
Today • Scrapers from scrapy_documentation • loggingSpider.py • openAllLinks.py • Cleaning NLTK data • Removing common words • Testing in Python • unitest • Testing websites
Scrapy notes • Focused narrow scrape (one domain) • Broad scrapes – better suited to • Dealing with javascript in scrapy
Selenium and Scrapy • from scrapy.http import HtmlResponse • from selenium import webdriver • class JSMiddleware(object): • defprocess_request(self, request, spider): • driver = webdriver.PhantomJS() • driver.get(request.url) • body = driver.page_source • return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
Cfg file • Populating the settings • Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence: • Command line options (most precedence) • Settings per-spider • Project settings module • Default settings per-command • Default global settings (less precedence)
Command line settings • scrapy crawl myspider -s LOG_FILE=scrapy.log
DEPTH_LIMIT • Default: 0 • Scope: scrapy.spidermiddlewares.depth.DepthMiddleware • The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
DEPTH_PRIORITY • Default: 0 • Scope: scrapy.spidermiddlewares.depth.DepthMiddleware • An integer that is used to adjust the request priority based on its depth: • if zero (default), no priority adjustment is made from depth • a positive value will decrease the priority, i.e. higher depth requests will be processed later ; this is commonly used when doing breadth-first crawls (BFO) • a negative value will increase priority, i.e., higher depth requests will be processed sooner (DFO) • See also: Does Scrapy crawl in breadth-first or depth-first order? about tuning Scrapy for BFO or DFO.
DOWNLOAD_DELAY • Default: 0 • The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example: • DOWNLOAD_DELAY = 0.25 # 250 ms of delay • This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By default, Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY.
Items.py • import scrapy • class ConifersItem(scrapy.Item): • # define the fields for your item here like: • name = scrapy.Field() • genus = scrapy.Field() • species = scrapy.Field() • pass
Middleware.py • from scrapy import signals • class ConifersSpiderMiddleware(object): • # Not all methods need to be defined. If a method is not defined, • # scrapy acts as if the spider middleware does not modify the • # passed objects. • @classmethod • deffrom_crawler(cls, crawler): • # This method is used by Scrapy to create your spiders. • s = cls() • crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) • return s • defprocess_spider_input(response, spider): • # Called for each response that goes through the spider • # middleware and into the spider. • # Should return None or raise an exception. • return None
defprocess_spider_output(response, result, spider): • # Called with the results returned from the Spider, after • # it has processed the response. • # Must return an iterable of Request, dict or Item objects. • for i in result: • yield i • defprocess_spider_exception(response, exception, spider): • # Called when a spider or process_spider_input() method • # (from other spider middleware) raises an exception. • # Should return either None or an iterable of Response, dict • # or Item objects. • pass
defprocess_start_requests(start_requests, spider): • # Called with the start requests of the spider, and works • # similarly to the process_spider_output() method, except • # that it doesn’t have a response associated. • # Must return only requests (not items). • for r in start_requests: • yield r • defspider_opened(self, spider): • spider.logger.info('Spider opened: %s' % spider.name)
BOT_NAME = 'conifers' • SPIDER_MODULES = ['conifers.spiders'] • NEWSPIDER_MODULE = 'conifers.spiders' • # Crawl responsibly by identifying yourself (and your website) on the user-agent • #USER_AGENT = 'conifers (+http://www.yourdomain.com)' • # Obey robots.txt rules • ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16) • #CONCURRENT_REQUESTS = 32 • # Configure a delay for requests for the same website (default: 0) • # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay • # See also autothrottle settings and docs • #DOWNLOAD_DELAY = 3 • # The download delay setting will honor only one of: • #CONCURRENT_REQUESTS_PER_DOMAIN = 16 • #CONCURRENT_REQUESTS_PER_IP = 16
Pipelines.py • # Define your item pipelines here • # • # Don't forget to add your pipeline to the ITEM_PIPELINES setting • # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html • class ConifersPipeline(object): • defprocess_item(self, item, spider): • return item
coniferSpider • from conifers.items import ConifersItem • class ConiferSpider(scrapy.Spider): • name = "conifer" • allowed_domains = ["greatplantpicks.org"] • start_urls = ['http://greatplantpicks.org/by_plant_type/conifer'] • def parse(self, response): • #filename = response.url.split("/")[-2] + '.html' • filename = 'conifers' + '.html' • with open(filename, 'wb') as f: • f.write(response.body) • pass
import scrapy • from conifers.items import ConifersItem • #from scrapy.selector import Selector • #from scrapy.http import HtmlResponse • class ConifersextractSpider(scrapy.Spider): • name = "conifersExtract" • allowed_domains = ["greatplantpicks.org"] • start_urls = ['http://www.greatplantpicks.org/plantlists/by_plant_type/conifer']
def parse(self, response): • for sel in response.xpath('//tbody/tr'): • item = ConifersItem() • item['name']= sel.xpath('td[@class="common-name"]/a/ text()').extract() • item['genus'] = sel.xpath('td[@class="plantname"]/a/span[@class="genus"]/text()').extract() • item['species'] = sel.xpath('td[@class="plantname"]/a/span[@class="species"]/text()').extract() • yield item