500 likes | 1.04k Views
Web Scraping Lecture 10 - Selenium. Topics Selenium Webdriver ChromeDriver , PhantomJS Readings: Chapter 10. January 26, 2017. Overview. Last Time: Lecture 8 Slides 1-29 Chapter 9 : the Requests Library – filling out forms 1-simpleForm.py 2-fileSubmission.py 3- cookies.py
E N D
Web Scraping Lecture 10 - Selenium • Topics • Selenium Webdriver • ChromeDriver, PhantomJS • Readings: • Chapter 10 January 26, 2017
Overview • Last Time: Lecture 8 Slides 1-29 • Chapter 9: the Requests Library – filling out forms • 1-simpleForm.py • 2-fileSubmission.py • 3- cookies.py • 4-sessionCookies.py– • 5-BasicAuth.py • Software Architecture of systems • Today: • Chapter 13: • References: Chapter 13, websites
Selenium Web Driver Big Picture • Big Picture = Software Architecture – how components of the software fit together
References • Windows Installation • YouTube video • https://www.youtube.com/watch?v=V69wc4Tmwjc • Linux Installation • http://blog.likewise.org/2015/01/setting-up-chromedriver-and-the-selenium-webdriver-python-bindings-on-ubuntu-14-dot-04/ • Chrome Driver • https://sites.google.com/a/chromium.org/chromedriver/getting-started • PhantomJS • Selenium Site
JavaScript • < script > alert(" This creates a pop-up using JavaScript"); </ script > • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3813-3814). O'Reilly Media. Kindle Edition. Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell
jQuery • jQuery is an extremely common library, • used by 70% of the most popular Internet sites and • about 30% of the rest of the Internet. • A site using jQuery is readily identifiable because it will contain an import to jQuery somewhere in its code, such as: • < script src =" http:// ajax.googleapis.com/ ajax/ libs/ jquery/ 1.9.1/ jquery.min.js" > </ script > • dynamically creates HTML content that appears only after the JavaScript is executed.
Google Maps • Embedded in websites
Installation • Not just pip here; there is the separate ChromeDriver executable that forms the interface between your python program using selenium and the Browser (in this case Chrome)
ChromeDriver - WebDriver for Chrome • Latest Release: ChromeDriver2.27 • https://sites.google.com/a/chromium.org/chromedriver/downloads • Pick your OS • Unzip and remember where it is
PhantonJS – headless WebDriver • http://phantomjs.org/download.html
Setting Up ChromeDriver and the Selenium-WebDriver Python bindings on Ubuntu 14.04 • install Google Chrome for Debian/Ubuntu: • sudo apt-get install libxss1 libappindicator1 libindicator7 • wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb • sudo dpkg -i google-chrome*.deb • sudo apt-get install –f • install xvfb so we can run Chrome headlessly: • sudo apt-get install xvfb https://christopher.su/2015/selenium-chromedriver-ubuntu/
Chromedriver – Unbuntu 14.4 • sudo apt-get install unzip • wget -N http://chromedriver.storage.googleapis.com/2.26/chromedriver_linux64.zip • unzip chromedriver_linux64.zip • chmod +x chromedriver • sudo mv -f chromedriver /usr/local/share/chromedriver • sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver • sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver https://christopher.su/2015/selenium-chromedriver-ubuntu/
Install Selenium and pyvirtualdisplay • pip install pyvirtualdisplay selenium • Now, we can do stuff like this with Selenium in Python: • from pyvirtualdisplay import Display • from selenium import webdriver • display = Display(visible=0, size=(800, 600)) • display.start() • driver = webdriver.Chrome() • driver.get('http://christopher.su') • print driver.title
PhantonJS – headless WebDriver Again • http://phantomjs.org/download.html
XPath Syntax • XPath (short for XML Path) is a query language used for navigating and selecting portions of an XML document. • founded by the W3C in 1999 • used in languages such as Python, Java, and C# when dealing with XML documents. • Although BeautifulSoup does not support XPath, many of the other libraries in this book do. • It can often be used in the same way as CSS selectors (such as mytag# idname), although it is designed to work with more generalized XML documents rather than HTML documents in particular. • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 4051-4056). O'Reilly Media. Kindle Edition.
Selenium Self Service Carolina Demo • if __name__ == "__main__": • driver = init_driver() • password = "MyPassword" • #password = input("Enter MySC password: ") • lookup(driver, "Selenium") • time.sleep(5) • driver.quit()
import time • from selenium import webdriver • from selenium.webdriver.common.by import By • from selenium.webdriver.support.ui import WebDriverWait • from selenium.webdriver.support import expected_conditions as EC • from selenium.common.exceptions import TimeoutException • from bs4 import BeautifulSoup • definit_driver(): • driver = webdriver.Chrome("E:/chromedriver_win32/chromedriver.exe") • driver.wait = WebDriverWait(driver, 5) • return driver
def lookup(driver, query): • driver.get("https://my.sc.edu/") • print ("SSC opened") • try: • link = driver.wait.until(EC.presence_of_element_located( • (By.PARTIAL_LINK_TEXT, "Sign in to"))) • #https://ssb.onecarolina.sc.edu/BANP/twbkwbis.P_WWWLogin?pkg=twbkwbis.P_GenMenu%3Fname%3Dbmenu.P_MainMnu • print ("Found link", link) • link.click() • print ("Clicked link") • #button = driver.wait.until(EC.element_to_be_clickable( • # (By.NAME, "btnK"))) • #box.send_keys(query) • #button.click() • except TimeoutException: • print("Houston we have a problem First Page")
# Now try to login • try: • user_box = driver.wait.until(EC.presence_of_element_located( • (By.NAME, "username"))) • #https://ssb.onecarolina.sc.edu/BANP/twbkwbis.P_WWWLogin?pkg=twbkwbis.P_GenMenu%3Fname%3Dbmenu.P_MainMnu • print ("Found box", user_box) • user_box.send_keys("01069379") • print ("ID entered") • passwd_box = driver.wait.until(EC.presence_of_element_located( • (By.ID, "vipid-password"))) • print ("Found password box", passwd_box) • passwd_box.send_keys(password) • print ("password entered") • button = driver.wait.until(EC.element_to_be_clickable( • (By.NAME, "submit"))) • print ("Found submit button", button) • #box.send_keys(query) • button.click() • except TimeoutException: • print("Houston we have a problem Login Page")