410 likes | 449 Views
CSCE 590 Web Scraping Lecture 6. Topics More Beautiful Soup Crawling Crawling by Hand Readings: Chapter 3. January 26, 2017. Overview. Last Time: Regular Expressions Again BeautifulSoup findall revisited Today: A little more Beautiful Soup Starting to Crawl Finding links
E N D
CSCE 590 Web Scraping Lecture 6 • Topics • More Beautiful Soup • Crawling • Crawling by Hand • Readings: • Chapter 3 January 26, 2017
Overview • Last Time: • Regular Expressions Again • BeautifulSoupfindall revisited • Today: • A little more Beautiful Soup • Starting to Crawl • Finding links • References • Chapter 2 • Chapter 3 • .
Example HTML page for examples 3-6 • http://www.pythonscraping.com/pages/page3.html • Structure • Html • body • div id=wrapper • h1 Totally Normal • div id=content • table id=giftList • tr • th • … • th • tr id=gift1 http://www.pythonscraping.com/pages/page3.html
Navigation examples • bsObj.body.h1 – finds the first “h1” tag that is a descendant of the body tag • bsObj.div.findAll (“img”) find first “div” tag then find all contained img tags
HTML Div and Span • In HTML, span and div elements are used to define parts of a document so that they are identifiable when no other HTML element is suitable. • While other HTML elements such as p (paragraph), em (emphasis) and so on accurately represent the semantics of the content, • the use of span and div leads to better accessibility for readers and easier maintainability for authors. • Div = Document division, but really just a semantics-less container for changing attributes such as color or associating an image with its caption • Span represents an inline portion of a document, for example words within a sentence. https://en.wikipedia.org/wiki/Span_and_div
Some HTTP Status Codes • https://docs.python.org/3/library/http.html
#Chapter 2 – findDescendants.py • # Actually this code is find children • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://www.pythonscraping.com/pages/page3.html") • bsObj = BeautifulSoup(html, "html.parser") • for child in bsObj.find("table",{"id":"giftList"}).children: • print(child) • Replacing the call to children with a call to descendants catches many more tags
#Chapter 2: 4-findSiblings.py • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://www.pythonscraping.com/pages/page3.html") • bsObj = BeautifulSoup(html, "html.parser") • for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings: • print(sibling) • Finds the table, then the first “tr” tag (the first row) then repeatedly moves to the next row and prints
Code snippets from Text • All of the code from the text is available online (again): • https:// github.com/ REMitchell/ python-scraping. • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Location 103). O'Reilly Media. Kindle Edition. • We will in class start omitting essential parts of the code inorder to shorten it for presentation (of course it will not run as shown!) • Imports – the usual suspects • Exceptional handling !! • The Code in the text is not production code because of things like the exception handling! Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell
Chapter 2: 5-findParents.py • print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())
Chapter 2: 6-regularExpressions.py • … • images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) • for image in images: • print(image["src"])
Chapter 2: 7-lambdaExpressions.py • … • tags = bsObj.findAll(lambda tag: len(tag.attrs) == 2) • for tag in tags: • print(tag)
Accessing Attributes • myTag.attrs – returns the dictionary of attributes and values for myTag • So • myTag.attrs[‘href’] would return the value of the href attribute of myTag
Chapter 3: Crawling • Each web document is the rootnode of a tree. • Well actually a graph • Well actually a multigraph • Well actually a multigraph with loops • 6-degrees of Kevin Bacon example (Wikipedia style) • Find shortest path in Wikipedia pages from a given one to the page for Kevin bacon
How do you want to crawl? • First we must limit the search somehow. Why? • Depth First • Breadth First • Random
dfs • #!/usr/bin/env python3 • # -*- coding: utf-8 -*- • """ • Created on Thu Jan 26 06:47:35 2017 • @author: matthews • """ • def dfs(graph, vertex, path = []): • path.append(vertex) • for newv in graph[vertex]: • if newv not in path: • path = dfs(graph, newv, path) • return path • graph = {1: [4, 5], • 2: [1, 3], • 3: [1, 2], • 4: [1,2,3], • 5: [2, 4, 6], • 6: [2, 5]} • print (dfs(graph, 1))
Chapter 3: findLinks.py • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen(" http:// en.wikipedia.org/ wiki/ Kevin_Bacon") • bsObj = BeautifulSoup( html) • for link in bsObj.findAll(" a"): • if 'href' in link.attrs: • print( link.attrs[' href'])
Good Links – Bad links • Bad links • // wikimediafoundation.org/ wiki/ Privacy_policy // • en.wikipedia.org/ wiki/ Wikipedia:Contact_us • /wiki/ Category:Articles_withstatements_from_April_2014 • /wiki/ Talk:Kevin_Bacon • So Good links have three things in common: • They reside within the div with the id set to bodyContent • The URLs do not contain colons • The URLs begin with /wiki/
Refined inner loop • for link in bsObj.find(" div", {" id":" bodyContent"}). • findAll(" a", href = re.compile("^(/ wiki/)((?!:).)* $")): • if 'href' in link.attrs: print( link.attrs[' href'])
Chapter 3: 1-getWikiLinks.py • #Web Scraping with Python by Ryan Mitchell • #Chapter 3: 1-getWikiLinks.py • from urllib.request import urlopen • from bs4 import BeautifulSoup • import datetime • import random • import re • (continued on next slide; Henceforth leaving off imports. Etc.)
random.seed(datetime.datetime.now()) • defgetLinks(articleUrl): • html = urlopen("http://en.wikipedia.org"+articleUrl) • bsObj = BeautifulSoup(html, "html.parser") • return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) • links = getLinks("/wiki/Kevin_Bacon") • while len(links) > 0: • newArticle = links[random.randint(0, len(links)-1)].attrs["href"] • print(newArticle) • links = getLinks(newArticle)
#Chapter 3: 2-crawlWikipedia.py • … • pages = set() • defgetLinks(pageUrl): • global pages • html = urlopen("http://en.wikipedia.org"+pageUrl) • bsObj = BeautifulSoup(html, "html.parser") • try: • print(bsObj.h1.get_text()) • print(bsObj.find(id ="mw-content-text").findAll("p")[0]) • print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href']) • except AttributeError: • print("This page is missing something! No worries though!")
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")): • if 'href' in link.attrs: • if link.attrs['href'] not in pages: • #We have encountered a new page • newPage = link.attrs['href'] • print("----------------\n"+newPage) • pages.add(newPage) • getLinks(newPage) • getLinks("")
Surface web vs Deep Web • Surface web – the part google sees • Deep Web – the part google doesn’t see • ? Forms etc. • The Dark web or Darknet • the Dark Web is a collection of websites that are publicly visible, yet hide the IP addresses of the servers that run them. • That means anyone can visit a Dark Web site, but it can be very difficult to figure out where they’re hosted—or by whom. • http://www.wired.com/2014/11/hacker-lexicon-whats-dark-web/
#Chapter 3: 3-crawlSite.py • pages = set() • random.seed(datetime.datetime.now()) • #Retrieves a list of all Internal links found on a page • defgetInternalLinks(bsObj, includeUrl): • internalLinks = [ ] • #Finds all links that begin with a "/" • for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")): • if link.attrs['href'] is not None: • if link.attrs['href'] not in internalLinks: • internalLinks.append(link.attrs['href']) • return internalLinks
3-crawlSite.py continued • #Retrieves a list of all external links found on a page • defgetExternalLinks(bsObj, excludeUrl): • externalLinks = [] • #Finds all links that start with "http" or "www" that do • #not contain the current URL • for link in bsObj.findAll("a", href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")): • if link.attrs['href'] is not None: • if link.attrs['href'] not in externalLinks: • externalLinks.append(link.attrs['href']) • return externalLinks
3-crawlSite.py continued • defsplitAddress(address): • addressParts = address.replace("http://", "").split("/") • return addressParts • defgetRandomExternalLink(startingPage): • html = urlopen(startingPage) • bsObj = BeautifulSoup(html, "html.parser") • externalLinks = getExternalLinks(bsObj, splitAddress(startingPage)[0]) • if len(externalLinks) == 0: • internalLinks = getInternalLinks(startingPage) • return getNextExternalLink(internalLinks[random.randint(0, • len(internalLinks)-1)]) • else: • return externalLinks[random.randint(0, len(externalLinks)-1)]
3-crawlSite.py continued • deffollowExternalOnly(startingSite): • externalLink = getRandomExternalLink("http://oreilly.com") • print("Random external link is: "+externalLink) • followExternalOnly(externalLink) • followExternalOnly("http://oreilly.com")
# Chapter3: 4-getExternalLinks.py • pages = set() • random.seed(datetime.datetime.now()) • #Retrieves a list of all Internal links found on a page • defgetInternalLinks(bsObj, includeUrl): • includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc • internalLinks = [] • #Finds all links that begin with a "/" • for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")): • if link.attrs['href'] is not None: • if link.attrs['href'] not in internalLinks: • if(link.attrs['href'].startswith("/")): • internalLinks.append(includeUrl+link.attrs['href']) • else: • internalLinks.append(link.attrs['href']) • return internalLinks
#Retrieves a list of all external links found on a page • defgetExternalLinks(bsObj, excludeUrl): • externalLinks = [] • #Finds all links that start with "http" or "www" that do • #not contain the current URL • for link in bsObj.findAll("a", href=re.compile( • "^(http|www)((?!"+excludeUrl+").)*$")): • if link.attrs['href'] is not None: • if link.attrs['href'] not in externalLinks: • externalLinks.append(link.attrs['href']) • return externalLinks
defgetRandomExternalLink(startingPage): • html = urlopen(startingPage) • bsObj = BeautifulSoup(html, "html.parser") • externalLinks = getExternalLinks(bsObj, urlparse(startingPage).netloc) • if len(externalLinks) == 0: • print("No external links, looking around the site for one") • domain = urlparse(startingPage).scheme+"://"+urlparse(startingPage).netloc • internalLinks = getInternalLinks(bsObj, domain) • return getRandomExternalLink(internalLinks[random.randint(0,len(internalLinks)-1)]) • else: • return externalLinks[random.randint(0, len(externalLinks)-1)]
deffollowExternalOnly(startingSite): • externalLink = getRandomExternalLink(startingSite) • print("Random external link is: "+externalLink) • followExternalOnly(externalLink) • followExternalOnly("http://oreilly.com")
# Chapter3: 5-getAllExternalLinks.py • #Collects a list of all external URLs found on the site • allExtLinks = set() • allIntLinks = set() • defgetAllExternalLinks(siteUrl): • html = urlopen(siteUrl) • domain = urlparse(siteUrl).scheme+"://"+urlparse(siteUrl).netloc • bsObj = BeautifulSoup(html, "html.parser") • internalLinks = getInternalLinks(bsObj,domain) • externalLinks = getExternalLinks(bsObj,domain)
for link in externalLinks: • if link not in allExtLinks: • allExtLinks.add(link) • print(link) • for link in internalLinks: • if link not in allIntLinks: • allIntLinks.add(link) • getAllExternalLinks(link) • followExternalOnly("http://oreilly.com") • allIntLinks.add("http://oreilly.com") • getAllExternalLinks("http://oreilly.com")
CSCE 590 HW 2 - Regular expressions, • Due Jan 29 Sunday night 11:55PM • 1) Give regular expressions that denotes the Languages: • a) { strings x such that x starts with 'aa' followed by anynumber of b's and c's and then end in an 'a'. • b) Phone numbers with optional area codes and optional extensions of the form " ext 432". • c) Email Addresses • d) a Python function definition that just has pass as the body • e) A link in a web page • 2) What languages(sets of strings that match the re) are denoted by the following regular expressions: • a) (a|b)[cde]{3,4}b • b) \w+\W+ • c) \d{3}-\d{2}-\d{4}
3) Give a regular expressions that extracts the "login" for a USC student from their email address "login@email.sc.edu" • (after the match one could use login=match.group(1) ) • 4) Write a Python program that processes a file line by line and cleans it by removing (re.sub) • social security numbers (replacing with <social-security>) • email addresses (replacing with "<email>") • phone numbers (replacing with <phone-number> • For extra credit replacing Soc-sec numbers that leave the number but replace the first three digits with the last three and replace the last three with the first three in the original string, i.e. 123-45-6789 becomes 789-45-6123 .