Web Scraping with Python: Handling Images, CAPTCHA, and Text Extraction

590 Web Scraping – Handling Images • Topics • CAPTCHA’s • Pillow • Tesseract -- OCR • Readings: • Text – chapters 11 April 11, 2017

CAPTCHA • A CAPTCHA (a backronym for "Completely AutomatedPublic Turing test to tell Computers and Humans Apart") is a type of challenge-response test used in computing to determine whether or not the user is human.[1] • The term was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford.[ https://en.wikipedia.org/wiki/CAPTCHA

Computer Vision Mitchell, Ryan. Web Scraping with Python

Optical Character Recognition • Extracting information from Scanned documents • Python is a fantastic language for”: • image processing and reading, • image-based machine-learning, and • even image creation. • Libraries for image processing: • Pillow and Tesseract • http:// pillow.readthedocs.org/ en/ 3.0. x/ and • https:// pypi.python.org/ pypi/ pytesseract Mitchell, Ryan. Web Scraping with Python

Pillow • Pillow allows you to easily import and manipulate images with a variety of filters, masks, and even pixel-specific transformations: Mitchell, Ryan. Web Scraping with Python

Chapter 11 1-basicImage.py • from PIL import Image, ImageFilter • kitten = Image.open("../files/kitten.jpg") • blurryKitten = kitten.filter(ImageFilter.GaussianBlur) • blurryKitten.save("kitten_blurred.jpg") • blurryKitten.show() Mitchell, Ryan. Web Scraping with Python

Tesseract • Tesseract is an OCR library. • Sponsored by Google, known for its OCR and machine learning technologies • Tesseract is widely regarded to be the best, most accurate, open source OCR system available.

Chapter 11 -- 2-cleanImage.py • from PIL import Image • import subprocess • defcleanFile(filePath, newFilePath): • image = Image.open(filePath) • #Set a threshold value for the image, and save • image = image.point(lambda x: 0 if x<143 else 255) • image.save(newFilePath) Mitchell, Ryan. Web Scraping with Python

#call tesseract to do OCR on the newly created image • subprocess.call(["tesseract", newFilePath, "output"]) • #Open and read the resulting data file • outputFile = open("output.txt", 'r') • print(outputFile.read()) • outputFile.close() • cleanFile("text_2.png", "text_2_clean.png") Mitchell, Ryan. Web Scraping with Python

Installing Tesseract • Installing Tesseract For Windows users there is a convenient executable installer. As of this writing, the current version is 3.02, although newer versions should be fine as well. • Linux users can install Tesseract with apt-get: $ sudo apt-get tesseract-ocr • Installing Tesseract on a Mac is slightly more complicated, although it can be done easily with many third-party installers such as Homebrew, Mitchell, Ryan. Web Scraping with Python

NumPy again Mitchell, Ryan. Web Scraping with Python

Well-formatted text • Well-formatted text: • Is written in one standard font (excluding handwriting fonts, cursive fonts, or excessively “decorative” fonts) • If copied or photographed has extremely crisp lines, with no copying artifacts or dark spots • Is well-aligned, without slanted letters • Does not run off the image, nor is there cut-off text or margins on the edges of the image Mitchell, Ryan. Web Scraping with Python

3-Read-Web-Images

Chapter 11 -- 3-readWebImages.py • import time • from urllib.request import urlretrieve • import subprocess • from selenium import webdriver • #driver = webdriver.PhantomJS(executable_path='/Users/ryan/Documents/pythonscraping/code/headless/phantomjs-1.9.8-macosx/bin/phantomjs') • driver = webdriver.Chrome() • driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200") • time.sleep(2) • driver.find_element_by_id("img-canvas").click() • #The easiest way to get exactly one of every page • imageList = set() Mitchell, Ryan. Web Scraping with Python

#Wait for the page to load • time.sleep(10) • print(driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style")) • while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"): • #While we can click on the right arrow, move through the pages Mitchell, Ryan. Web Scraping with Python

driver.find_element_by_id("sitbReaderRightPageTurner").click()driver.find_element_by_id("sitbReaderRightPageTurner").click() • time.sleep(2) • #Get any new pages that have loaded (multiple pages can load at once) • pages = driver.find_elements_by_xpath("//div[@class='pageImage']/div/img") • for page in pages: • image = page.get_attribute("src") • imageList.add(image) • driver.quit() Mitchell, Ryan. Web Scraping with Python

#Start processing the images we've collected URLs for with Tesseract • for image in sorted(imageList): • urlretrieve(image, "page.jpg") • p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE,stderr=subprocess.PIPE) • p.wait() • f = open("page.txt", "r") • print(f.read()) Mitchell, Ryan. Web Scraping with Python

4-CAPTCHA

Chapter 11 --- 4-solveCaptcha.py • from urllib.request import urlretrieve • from urllib.request import urlopen • from bs4 import BeautifulSoup • import subprocess • import requests • from PIL import Image • from PIL import ImageOps • defcleanImage(imagePath): • image = Image.open(imagePath) • image = image.point(lambda x: 0 if x<143 else 255) • borderImage = ImageOps.expand(image,border=20,fill='white') • borderImage.save(imagePath) Mitchell, Ryan. Web Scraping with Python

html = urlopen("http://www.pythonscraping.com/humans-only") • bsObj = BeautifulSoup(html, "html.parser") • #Gather prepopulated form values • imageLocation = bsObj.find("img", {"title": "Image CAPTCHA"})["src"] • formBuildId = bsObj.find("input", {"name":"form_build_id"})["value"] • captchaSid = bsObj.find("input", {"name":"captcha_sid"})["value"] • captchaToken = bsObj.find("input", {"name":"captcha_token"})["value"] • captchaUrl = "http://pythonscraping.com"+imageLocation • urlretrieve(captchaUrl, "captcha.jpg") • cleanImage("captcha.jpg") Mitchell, Ryan. Web Scraping with Python

p = subprocess.Popen(["tesseract", "captcha.jpg", "captcha"], stdout= • subprocess.PIPE,stderr=subprocess.PIPE) • p.wait() • f = open("captcha.txt", "r") • #Clean any whitespace characters • captchaResponse = f.read().replace(" ", "").replace("\n", "") • print("Captcha solution attempt: "+captchaResponse) Mitchell, Ryan. Web Scraping with Python

if len(captchaResponse) == 5: • params = {"captcha_token":captchaToken, "captcha_sid":captchaSid, • "form_id":"comment_node_page_form", "form_build_id": formBuildId, "captcha_response":captchaResponse, "name":"Ryan Mitchell", • "subject": "I come to seek the Grail", • "comment_body[und][0][value]": • "...and I am definitely not a bot"} Mitchell, Ryan. Web Scraping with Python

r = requests.post( "http://www.pythonscraping.com/comment/reply/10", data=params) • responseObj = BeautifulSoup(r.text) • if responseObj.find("div", {"class":"messages"}) is not None: • print(responseObj.find("div", {"class":"messages"}).get_text()) • else: • print("There was a problem reading the CAPTCHA correctly!") Mitchell, Ryan. Web Scraping with Python

Mitchell, Ryan. Web Scraping with Python

Web Scraping with Python: Handling Images, CAPTCHA, and Text Extraction

Web Scraping with Python: Handling Images, CAPTCHA, and Text Extraction

Presentation Transcript

Diversityjobs.com - Data Scraping

Riabiz.com - Data Scraping

Martindale.Com – Data Scraping

Usrg.com - Data Scraping

Weichert.com - Data Scraping

Carsoup.com - Data Scraping

Bmwusa.com - Data Scraping

Nmbar.org - Data Scraping

Remax.com - Data Scraping

Alabar.org - Data Scraping

Gayot.com - Data Scraping

Overseasjobs.com - Data Scraping

Wyndham.com - Data Scraping

Jobs.com - Data Scraping

Homefinder.com - Data Scraping

Americajobs.com - Data Scraping

Wisbar.org - Data Scraping

Utahbar.org - Data Scraping

Wisbar.org - Data Scraping

Deals Information Scraping From Groupon

3 worth-a-shot Dosâ€™ of Web Scraping Service for the beginners to follow up each time

Scraping data from amazon| Amazon web scraping