250 likes | 291 Views
Web Scraping Lecture 11 - Document Encoding. Topics File extensions Txt, utf-8, pdf, docx Readings: Chapter 6. January 26, 2017. Overview. Last Time: Lecture 10 Selenium Webdriver Software Architecture of systems Today: Chapter 6 : document encodings Test 2 - thoughts
E N D
Web Scraping Lecture 11 - Document Encoding • Topics • File extensions • Txt, utf-8, pdf, docx • Readings: • Chapter 6 January 26, 2017
Overview • Last Time: Lecture 10 Selenium Webdriver • Software Architecture of systems • Today: • Chapter 6: document encodings • Test 2 - thoughts • References: Chapter 6
File Extensions • f.jpg • f.txt • f.doc • f.pdf • f.docx • f.html • Internet Engineering Task Force (IETF) stores all of its published documents as HTML, PDF, and text files (see https:// www.ietf.org/ rfc/ rfc1149. txt
Text • from urllib.request import urlopen • textPage = urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1.txt") • print(textPage.read())
Unicode • In the early 1990sThe Unicode Consortium attempted to bring about a universal text encoder by establishing encodings for every character that needs to be used in any text document, in any language. • The goal was to include everything from the • Latin alphabet this book is written in, to • Cyrillic (кириллица), • Chinese pictograms (象 形), math and • logic symbols (⨊, ≥), and even • emoticons and • “miscellaneous” symbols, such as the biohazard sign (☣) and peace symbol (☮). • The resulting encoder, UTF-8, which stands for, confusingly, "Universal Character Set - Transformation Format 8 bit”.
0 100 0011 – C // the first bit =0 means ASCII • Non ASCII
2-getUtf8Text.py • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)") • bsObj = BeautifulSoup(html, "html.parser") • content = bsObj.find("div", {"id":"mw-content-text"}).get_text() • content = bytes(content, "UTF-8") • content = content.decode("UTF-8") • print(content)
Meta tag • Most English sites • <meta charset=“utf-8” /> • For international sites ?
CSV - accessing individual columns • # 3-readingCsv.py • from urllib.request import urlopen • from io import StringIO • import csv • data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ascii', 'ignore') • dataFile = StringIO(data) • csvReader = csv.reader(dataFile) • for row in csvReader: • print("The album \""+row[0]+"\" was released in "+str(row[1]))
4-readingCsvDict.py • from urllib.request import urlopen • from io import StringIO • import csv • data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ascii', 'ignore') • dataFile = StringIO(data) • dictReader = csv.DictReader(dataFile) • print(dictReader.fieldnames) • for row in dictReader: • print(row)
PDF – Portable Document Format • Adobe 1993 • The pain of dealing with Microsoft “doc” files • PDFMiner3K python library
PDFMiner 5-readPdf.py • from pdfminer.pdfinterp import PDFResourceManager, process_pdf • from pdfminer.converter import TextConverter • from pdfminer.layout import LAParams • from io import StringIO • from io import open • from urllib.request import urlopen
defreadPDF(pdfFile): • rsrcmgr = PDFResourceManager() • retstr = StringIO() • laparams = LAParams() • device = TextConverter(rsrcmgr, retstr, laparams=laparams) • process_pdf(rsrcmgr, device, pdfFile) • device.close() • content = retstr.getvalue() • retstr.close() • return content • pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") • outputString = readPDF(pdfFile) • print(outputString) • pdfFile.close()
Microsoft .doc and .docx • Proprietary .doc format - binary-file format was difficult to read and poorly supported by other word processors. • In 2008 in an effort to get with the times and adopt a standard that was used by many other pieces of software, Microsoft decided to use the Open Office XML-based standard, which made the files compatible with open source and other software.
Reading .docx files • from zipfile import ZipFile • from urllib.request import urlopen • from io import BytesIO • from bs4 import BeautifulSoup • wordFile = urlopen("http://pythonscraping.com/pages/AWordDocument.docx").read() • wordFile = BytesIO(wordFile) • document = ZipFile(wordFile) • xml_content = document.read('word/document.xml') • wordObj = BeautifulSoup(xml_content.decode('utf-8')) • textStrings = wordObj.findAll("w:t") • for textElem in textStrings: • print(textElem.text)
Test 1 Thursday • Test Thursday Feb 16. • You can bring a notes cheat to test 1, subject to the following restrictions: • 8 ½ x 11 sheet of paper • One side only • Handwritten (nothing electronically generated) • Not in black ink. • This will be signed and turned in with your test.
Lectures • Lecture 1 - Overview • Lecture 2 – Python classes, dictionaries, sets etc. • Lecture 3 – BeautifulSoup • Lecture 4 – Regular Expressions • Lecture 5 – Regular Expressions II • Lecture 6 - Crawling • Lecture 7 - Scrapy • Lecture 8 – Storing data • Lecture 9 – Requests library • Lecture 10 – Selenium Web driver
Homework 2: Regular expressions CSCE 590 HW 2 - Regular expressions, Due Jan 29 Sunday night 11:55PM • Give regular expressions that denotes the Languages: • a) { strings x such that x starts with 'aa' followed by any number of b's and c's and then end in an 'a'. • Phone numbers with optional area codes and optional extensions of the form " ext 432". • Email Addresses • a Python function definition that just has pass as the body • A link in a web page • What languages(sets of strings that match the re) are denoted by the following regular expressions: • (a|b)[cde]{3,4}b • \w+\W+ • \d{3}-\d{2}-\d{4} • )
Give a regular expressions that extracts the "login" for a USC student from their email address "login@email.sc.edu" (after the match one could use login=match.group(1) ) • Write a Python program that processes a file line by line and cleans it by removing (re.sub) social security numbers (replacing with ) email addresses (replacing with "") phone numbers (replacing with For extra credit replacing Soc-Sec numbers that leave the number but replace the first three digits with the last three and replace the last three with the first three in the original string, i.e. 123-45-6789 becomes 789-45-6123 .
Homework 3 • Monday Feb 6 at 11:55PM Write a short program named 3_1.py (less than 10 lines) that imports only urlopen and BeautifulSoup and then builds a list of all links ( tag). Then it should process the list one element at a time and print the link. Use the URL https://cse.sc.edu/ . Finally it should print a count of the number of links. • Modify the previous program to obatin 3_2.py that a) write to the file "allLinks.txt" b) write only the URL, i.e. the value of the href • Run 5-getAllExternalLinks.py from chapter 3 on the URL https://cse.sc.edu/ . Modify the code to handle the exceptions that occur, by logging, then ignoring and continuing to handle the other links. • Modify 5-getAllExternalLinks.py to check the website 5-getAllExternalLinks.py for "Bad Links" (404 is sufficient).
Homework 4 • Copy the table from the Master Schedule of CSCE courses online (^A^C select all then copy) and paste (^V) into excel then save as a CSV file, sched.csv • Write a program, table.py, to grab this same table. • Use Requests to login to the CSE site (https://cse.sc.edu/user/login?destination=node) and then use BeautifulSoup to prettify the page returned.
Lectures • Lecture 1 - Overview • Lecture 2 – Python classes, dictionaries, sets etc. • Lecture 3 – BeautifulSoup • Lecture 4 – Regular Expressions • Lecture 5 – Regular Expressions II • Lecture 6 - Crawling • Lecture 7 - Scrapy • Lecture 8 – Storing data • Lecture 9 – Requests library • Lecture 10 – Selenium Web driver