340 likes | 348 Views
Learn about web scraping, testing websites, and cleaning natural language data using NLTK. Topics include Scrapy, logging spider, openAllLinks, and removing common words.
E N D
590 Web Scraping – testing • Topics • Chapter 13 - Testing • Readings: • Text – chapter 13 April 4, 2017
Today • Scrapers from scrapy_documentation • loggingSpider.py • openAllLinks.py • Cleaning NLTK data • Removing common words • Testing in Python • unitest • Testing websites
Rest of the semester • Tuesday April 4 • Thursday April 6 • Tuesday April 11 • Thursday April 13 – Test 2 • Tuesday April 18 • Thursday April 20 • Tuesday April 25 – Reading Day • Tuesday May 2 – 9:00 a.m. EXAM
Test 2 • 50% in class • 50% take-home
Exam – Scraping project • Proposalstatement (April 11) – one sentence description • Project description (April 18) • Demo (May 2)
Cleaning Natural Language data • Removing common words • Corpus of Contemporary English • http://corpus.byu.edu/coca • In addition to this online interface, you can also download extensive data for offline use -- full-text, word frequency, n-grams, and collocates data. You can also access the data via WordAndPhrase (including the ability to analyze entire texts that you input).
Most common words in English • 1rst 25 make up 1/3 of English text • 1rst 100 makeup ½ • common = [‘the’, ‘be’, …] • if isCommon(word) …
More Scrapy • Logging spider • openAllLinks • LxmlLinkExtractor
loggingSpider.py • import scrapy • class MySpider(scrapy.Spider): • name = 'example.com' • allowed_domains = ['example.com'] • start_urls = [ • 'http://www.example.com/1.html', • 'http://www.example.com/2.html', • 'http://www.example.com/3.html', • ] • def parse(self, response): • self.logger.info('A response from %s just arrived!', response.url) scrapy documentation page 36
openAllLinks.py • #multiple Requests and items from a single callback • import scrapy • class MySpider(scrapy.Spider): • name = 'example.com' • allowed_domains = ['example.com'] • start_urls = [ • 'http://www.example.com/1.html', scrapy • … • ] • def parse(self, response): • for h3 in response.xpath('//h3').extract(): • yield {"title": h3} • for url in response.xpath('//a/@href').extract(): • yield scrapy.Request(url, callback=self.parse) scrapy documentation page 36
LxmlLinkExtractor • class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor( • allow=(), • deny=(), • allow_domains=(), • deny_domains=(), • deny_extensions=None, • restrict_xpaths=(), • restrict_css=(), • tags=(‘a’, ‘area’), • attrs=(‘href ’, ), • canonicalize=True, • unique=True, • process_value=None) scrapy documentation
allow (a regular expression (or list of)) – a single regular expression (or list) that (absolute) urls must match • deny (a regular expression (or list of)) • allow_domains (str or list) – • deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links • deny_extensions (list) – a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractors package. scrapy documentation
restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below. • restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. • tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to (’a’, ’area’). • attrs (list) – an attribute or list of attributes which should be considered scrapy documentation
canonicalize (boolean) – canonicalize each extracted url (using w3lib.url.canonicalize_url). Defaults to True. • unique (boolean) – whether duplicate filtering should be applied to extracted links. • process_value (callable) – a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x. scrapy documentation
Chapter 13 Testing • 1-wikiUnitTest.py • 2-wikiSeleniumTest • 3-interactiveTest • 4-dragAndDrop • 5-takeScreenshot • 6-combinedTest • ghostdriver scrapy documentation
Unit Testing • Junit • public class MyUnit { • public String concatenate(String one, String two){ • return one + two; • } • } http://tutorials.jenkov.com/java-unit-testing/simple-test.html
import org.junit.Test; • import static org.junit.Assert.*; • public class MyUnitTest { • @Test • public void testConcatenate() { • MyUnitmyUnit = new MyUnit(); • String result = myUnit.concatenate("one", "two"); • assertEquals("onetwo", result); • } • } http://tutorials.jenkov.com/java-unit-testing/simple-test.html
Python unittest • Comes standard with python • Import and extend unittest.TestCase • setup – run before test to initialize testcase • Teardown – run after • Provide several types of asserts • Run all fubctions that begin with test_ as unit tests
UnitestExample • class TestStringMethods(unittest.TestCase): • deftest_upper(self) • self.assertEqual('foo'.upper(), 'FOO') • deftest_isupper(self): • self.assertTrue('FOO'.isupper()) • self.assertFalse('Foo'.isupper()) • deftest_split(self): • s = 'hello world' • self.assertEqual(s.split(), ['hello', 'world']) • # check that s.split fails when the separator is not a string • with self.assertRaises(TypeError): • s.split(2) • if __name__ == '__main__': • unittest.main()
1-wikiUnitTest.py • from urllib.request import urlopen • from urllib.parse import unquote • import random • import re • from bs4 import BeautifulSoup • import unittest • class TestWikipedia(unittest.TestCase): • bsObj = None • url = None scrapy documentation
deftest_PageProperties(self): • global bsObj • global url • url = "http://en.wikipedia.org/wiki/Monty_Python" • #Test the first 100 pages we encounter • for i in range(1, 100): • bsObj = BeautifulSoup(urlopen(url)) • titles = self.titleMatchesURL() • self.assertEquals(titles[0], titles[1]) • self.assertTrue(self.contentExists()) • url = self.getNextLink() • print("Done!")
deftitleMatchesURL(self): • global bsObj • global url • pageTitle = bsObj.find("h1").get_text() • urlTitle = url[(url.index("/wiki/")+6):] • urlTitle = urlTitle.replace("_", " ") • urlTitle = unquote(urlTitle) • return [pageTitle.lower(), urlTitle.lower()]
defcontentExists(self): • global bsObj • content = bsObj.find("div",{"id":"mw-content-text"}) • if content is not None: • return True • return False • defgetNextLink(self): • global bsObj • links = bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) • link = links[random.randint(0, len(links)-1)].attrs['href'] • print("Next link is: "+link) • return "http://en.wikipedia.org"+link
if __name__ == '__main__': • unittest.main()
2-wikiSeleniumTest • from selenium import webdriver • driver = webdriver.PhantomJS(executable_path='/Users/ryan/Documents/pythonscraping/code/headless/phantomjs-1.9.8-macosx/bin/phantomjs') • driver.get("http://en.wikipedia.org/wiki/Monty_Python") • assert "Monty Python" in driver.title • print("Monty Python was not in the title") • driver.close()
3-interactiveTest • from selenium import webdriver • from selenium.webdriver.remote.webelement import WebElement • from selenium.webdriver.common.keys import Keys • from selenium.webdriver import ActionChains • #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS • driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') • #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') • driver.get("http://pythonscraping.com/pages/files/form.html")
firstnameField = driver.find_element_by_name("firstname") • lastnameField = driver.find_element_by_name("lastname") • submitButton = driver.find_element_by_id("submit") • ### METHOD 1 ### • firstnameField.send_keys("Ryan") • lastnameField.send_keys("Mitchell") • submitButton.click()
### METHOD 2 ### • actions = ActionChains(driver).click(firstnameField).send_keys("Ryan").click(lastnameField).send_keys("Mitchell").send_keys(Keys.RETURN) • actions.perform() • ################ • print(driver.find_element_by_tag_name("body").text) • driver.close()
4-dragAndDrop • from selenium import webdriver • from selenium.webdriver.remote.webelement import WebElement • from selenium.webdriver import ActionChains • #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS • driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') • #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') • driver.get('http://pythonscraping.com/pages/javascript/draggableDemo.html') • print(driver.find_element_by_id("message").text)
print(driver.find_element_by_id("message").text) • element = driver.find_element_by_id("draggable") • target = driver.find_element_by_id("div2") • actions = ActionChains(driver) • actions.drag_and_drop(element, target).perform() • print(driver.find_element_by_id("message").text)
5-takeScreenshot • from selenium import webdriver • from selenium.webdriver.remote.webelement import WebElement • from selenium.webdriver import ActionChains • #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS • driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') • driver.implicitly_wait(5) • driver.get('http://www.pythonscraping.com/') • driver.get_screenshot_as_file('tmp/pythonscraping.png')
6-combinedTest • from selenium import webdriver • from selenium.webdriver.remote.webelement import WebElement • from selenium.webdriver import ActionChains • import unittest
class TestAddition(unittest.TestCase): • driver = None • defsetUp(self): • global driver • #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS • driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') • #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') • url = 'http://pythonscraping.com/pages/javascript/draggableDemo.html' • driver.get(url)
deftearDown(self): • print("Tearing down the test") • deftest_drag(self): • global driver • element = driver.find_element_by_id("draggable") • target = driver.find_element_by_id("div2") • actions = ActionChains(driver) • actions.drag_and_drop(element, target).perform() • self.assertEqual("You are definitely not a bot!", driver.find_element_by_id("message").text)