200 likes | 357 Views
COMPSCI 101 S1 2014 Principles of Programming. 33 Web programming. Learning outcomes. At the end of this lecture, students should be able to: use Python libraries to access and process data from the Web Examples and Exercises: Example 1: Opening a URL Case Study 1: Word Count
E N D
COMPSCI 101 S1 2014Principles of Programming 33 Web programming
Learning outcomes • At the end of this lecture, students should be able to: • use Python libraries to access and process data from the Web • Examples and Exercises: • Example 1: Opening a URL • Case Study 1: Word Count • Case Study 2: Downloading Files • Case Study 3: Working on the headers COMPSCI101
Internet: A collection of networks • The Internet is a network of networks. • If you put a device in your home so that your computers can talk to one another, you have a network. • A wireless base station, or an Ethernet router, perhaps. • You can probably reach printers on your network, or copy files between computers. • If you now connect your network (through an Internet Service Provider (ISP)) to the global Internet, your network becomes yet another part of the whole Internet. COMPSCI101
The World Wide Web • Tim Berners-Lee wanted a way to create readable documents that could reference material on the Internet in a hypertext format. • It is a set of agreements, started by Tim Berners-Lee • On how to refer to everything on the Internet: The URL (Uniform Resource Locator) • On how to create documents that refer to things all over the Internet: HTTP (HyperText Transfer Protocol) • On how those documents will be formatted: Using HTML (HyperText Markup Language) COMPSCI101
HyperText Transfer Protocol (HTTP) • HTTP defines a very simple protocol for how to exchange information between computers. • It defines the pieces of the communication. • What resource do you want? • Where is it? • Okay, here’s the type of thing it is (JPEG, HTML, whatever), and here it is. • It is a set of rules to allow browsers to retrieve web documents from servers over the Internet COMPSCI101
Uniform Resource Locators (URL) • URLs allow us to reference any material anywhere on the Internet. • Address used for any web resource • URLs have four parts: • The protocol to use to reach this resource: http • The domain name of the computer where the resource is, • Name of a host computer (domain name) • The path on the computer to the resource, • courses/compsci101s1c/ • And the name of the resource. http:///www.cs.auckland.ac.nz/en.html Protocol Filename Domain name COMPSCI101
Terms • Web Site • A collection of Web pages related to a single topic or theme. Normally designed and maintained by a single individual or organization • Web Page • A hypermedia document designed for the WWW • Web Browser • Software used to access information on the World Wide Web • Sends requests to a web server • Client (Internet Explorer or Firefox or Safari …) • They know how to interpret HTML and display it graphically. • Web Server • Software that makes local files available through the web • Fulfils requests from a web browser • Server COMPSCI101
Accessing a web page • Client (Web Browser) runs on the local machine • User requests a web page Browser Web page Requested COMPSCI101
Accessing a web page • Web server runs on the destination machine • Request sent to destination domain • Web server accepts the request and finds the web page Web Server Browser Web page Requested COMPSCI101
Accessing a web page • Web page is sent from the server to the client • Client (web browser) displays the page Web Server Browser Web page Requested COMPSCI101
Using urllib in Python • Python has modules that allow you to use these protocols. • In Python, we can read any URL as if it was a file. • The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) • Add an import statement to your .py file import urllib.request COMPSCI101
Example 1: Opening a URL and reading it • The urlopen() function: • Opens the URL url, which can be either a string or a Request object. • Creates a file-like object that allows you to read the identified resource def viewpage(url): con = urllib.request.urlopen(url) contents = con.read() print (len(contents)) viewpage("http://www.cs.auckland.ac.nz/courses/compsci101s1c") 23488 COMPSCI101
The Info() • The info() function • returns the meta-information of the page, such as headers, • The geturl() function • returns the URL of the resource retrieved print (con.info()) print (con.geturl()) Server: Apache … Content-Type: text/html; charset=UTF-8 Content-Length: 23488 Accept-Ranges: bytes Date: Mon, 26 May 2014 00:35:46 GMT … https://www.cs.auckland.ac.nz/courses/compsci101s1c/ COMPSCI101
Encoding • Note that urlopen returns a bytes object. This is because there is no way for urlopen to automatically determine the encoding of the byte stream it receives from the http server • Use ‘utf-8’ for decoding the bytes object. viewpage("https://www.cs.auckland.ac.nz/courses/compsci101s1c/lectures/words.txt") Byte format b'The woods are lovely dark and deep\r\nBut … print (con.read().decode('utf-8')) The woods are lovely dark and deep … COMPSCI101
Case Study 1Word Count Revisit • Task: • Complete the following program which reads a web page, counts the frequency of each word in the page using a dictionary, and prints the dictionary url = "https://www.cs.auckland.ac.nz/courses/compsci101s1c/lectures/words.txt" con = urllib.request.urlopen(url) contents = con.read().decode('utf-8') ... {'keep': 1, 'promises': 1, 'And': 2, 'sleep': 2, 'But': 1, 'before': 2, 'have': 1, 'to': 3, 'The': 1, 'and': 1, 'dark': 1, 'I': 3, 'miles': 2, 'go': 2, 'deep': 1, 'are': 1, 'lovely': 1, 'woods': 1} COMPSCI101
Case Study 1 Word Count Revisit • Algorithm: COMPSCI101
Case Study 2Downloading Files • Task: • Complete the get_files()function which takes a url and a list of filenames as parameters and downloads the list of files into your current working directory file_list = ["words.txt", "sample.txt"] url = "http://www.cs.auckland.ac.nz/courses/compsci101s1c/lectures/" get_files(url, file_list) COMPSCI101
Case Study 1 Downloading Files • Algorithm COMPSCI101
Case Study 3Working on the Headers • Task: • Complete the get_headers() function which reads the headers (string) of a web page, and returns a dictionary containing all headers {'Strict-Transport-Security': 'max-age=31536000', 'Age': '0', 'Server': 'Apache', 'Vary': 'Accept-Encoding', 'X-Webroute-Cache': 'MISS', 'Date': 'Wed, 28 May 2014 00:45:48 GMT', …'} url = "https://www.cs.auckland.ac.nz/courses/compsci101s1c/" con = urllib.request.urlopen(url) print(get_headers(con.info())) COMPSCI101
Case Study 3Working on the Headers • Algorithm COMPSCI101