540 likes | 635 Views
The Web. EE 122: Intro to Communication Networks Fall 2010 (MW 4-5:30 in 101 Barker) Scott Shenker TAs: Sameer Agarwal, Sara Alspaugh, Igor Ganichev, Prayag Narula http://inst.eecs.berkeley.edu/~ee122/
E N D
The Web EE 122: Intro to Communication Networks Fall 2010 (MW 4-5:30 in 101 Barker) Scott Shenker TAs: Sameer Agarwal, Sara Alspaugh, Igor Ganichev, Prayag Narula http://inst.eecs.berkeley.edu/~ee122/ Materials with thanks to Jennifer Rexford, Ion Stoica, Vern Paxsonand other colleagues at Princeton and UC Berkeley
Announcements • Project #1A and HW #2 due in one week • Bspace and web page were out of synch • Granting you a five day extension on project • But: • Get started on part B soon! • Don’t neglect your homework! • My office hours are on Wednesday • See me after class in classroom • Or walk up the hill to Soda when I do • Will be meeting with TAs when there are no students during OH • But come interrupt me if you want to talk to me…..
Goals of Today’s Lecture • History of the web • Key players, why successful • Main ingredients of the Web • URIs, HTML, HTTP • Key properties of HTTP • Request-response, stateless, and resource meta-data • Performance of HTTP • Parallel connections, persistent connections, pipelining • Web infrastructure • Clients, proxies, and servers • Caching vs. replication
The Web – History (I) • 1945: Vannevar Bush, Memex: • "a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility" Vannevar Bush (1890-1974) (See http://www.iath.virginia.edu/elab/hfl0051.html) Memex
The Web – History (II) • 1967, Ted Nelson, Xanadu: • A world-wide publishing network that would allow information to be stored not as separate files but as connected literature • Owners of documents would be automatically paid via electronic means for the virtual copying of their documents • Coined the term “Hypertext” • Influenced research community • Who then missed the web….. Ted Nelson
The Web – History (III) • Physicist trying to solve real problem • Distributed access to data • World Wide Web (WWW): a distributed database of “pages” linked through Hypertext Transport Protocol (HTTP) • First HTTP implementation - 1990 • Tim Berners-Lee at CERN • HTTP/0.9 – 1991 • Simple GET command for the Web • HTTP/1.0 –1992 • Client/Server information, simple caching • HTTP/1.1 - 1996 Tim Berners-Lee
Why So Successful? • What do the web and youtube have in common? • The ability to self-publish • But the self-publishing mechanisms must be: • Technically easy • Independent (not requiring intricate coordination) • Free • Being part of a grand idealistic and collaborative endeavor isn’t what people want • People aren’t looking for Nirvana (or even Xanadu) • They want to make their mark, and find something neat
Moral of the Story • Timing is everything • Internet made vision not only possible, but necessary • Can you imagine with web without the Internet? • Visions are great, but problem solving is better • The best is the enemy of the good • Particularly when it blocks deployment • Nelson on HTML: • “HTML is precisely what we were trying to PREVENT— ever-breaking links, links going outward only, quotes you can't follow to their origins, no version management, no rights management.”
Components of Web Infrastructure • Content • Objects • Clients • Send requests / Receive responses • Servers • Receive requests / Send responses • Store or generate the responses • Proxies • Placed between clients and servers • Act as a server for the client, and a client to the server • Provide extra functions • Caching, anonymization, logging, transcoding, filtering access • Explicit or transparent (“interception”)
Ingredients of Web Implementation • HTML • URIs, URLs, URNs • HTTP
HTML • A Web page has several components • Base HTML file • Referenced objects (e.g., images) • HyperText Markup Language (HTML) • Representation of hypertext documents in ASCII format • Web browsers interpret HTML when rendering a page • Several functions: • Format text, reference images, embed hyperlinks (HREF) • Straight-forward to learn • Syntax easy to understand • Authoring programs can auto-generate HTML • Source almost always available
Uniform Resource Locator (URL) Provides a means to get the resource http://www.ietf.org/rfc/rfc3986.txt Uniform Resource Name (URN) Names a resource independent of how to get it urn:ietf:rfc:3986 is a standard URN for RFC 3986 URI: Uniform Resource Identifier
Names, locations, etc. • Who has used a URN? A URL? • Which one solves a real problem? • Which one represents an idealistic vision? • The dominance of URLs over URNs reflects the lack of a proper naming structure for objects • Naming was central component of our clean slate design • What properties should a URN have? • Not specify anything about object that can change • Cryptographic information about the associated key • Solves real problems: security and mobility of content
URL Syntax protocol://hostname[:port]/directorypath/resource
HTTP • HyperText Transfer Protocol (HTTP) • Client-server protocol for transferring resources • Important properties: • Request-response protocol • Reliance on a global URI namespace • Resource metadata • Stateless • ASCII format % telnet www.icir.org 80 GET /jdoe/ HTTP/1.0 <blank line, i.e., CRLF>
HTTP and TCP • What functions does HTTP leave to TCP? • Would HTTP be harder without layering?
Steps in HTTP Request • HTTP Client initiates TCP connection to server • HTTP Client sends HTTP request to server • HTTP Server responds to request • HTTP Client receives the request • TCP connection terminates How many RTTs for a single request?
Client-to-Server Communication • HTTP Request Message • Request line: method, resource, and protocol version • Request headers: provide information or modify request • Body: optional data (e.g., to “POST” data to the server) request line GET /somedir/page.html HTTP/1.1 Host: www.someschool.edu User-agent: Mozilla/4.0 Connection: close Accept-language: fr (blank line) header lines Not optional carriage return line feed indicates end of message
Client-to-Server Communication • HTTP Request Message • Request line: method, resource, and protocol version • Request headers: provide information or modify request • Body: optional data (e.g., to “POST” data to the server) • Request methods include: • GET: Return current value of resource, run program, … • HEAD: Return the meta-data associated with a resource • POST: Update resource, provide input to a program, … • Headers include: • Useful info for the server • e.g. desired language
Server-to-Client Communication • HTTP Response Message • Status line: protocol version, status code, status phrase • Response headers: provide information • Body: optional data status line (protocol, status code, status phrase) HTTP/1.1 200 OK Connection close Date: Thu, 06 Aug 2006 12:00:15 GMT Server: Apache/1.3.0 (Unix) Last-Modified: Mon, 22 Jun 2006 ... Content-Length: 6821 Content-Type: text/html (blank line) data data data data data ... header lines data e.g., requested HTML file
Server-to-Client Communication • HTTP Response Message • Status line: protocol version, status code, status phrase • Response headers: provide information • Body: optional data • Response code classes • Similar to other ASCII app. protocols like SMTP
Web Server: Generating a Response • Return a file • URL matches a file (e.g.,/www/index.html) • Server returns file as the response • Server generates appropriate response header • Generate response dynamically • URL triggers a program on the server • Server runs program and sends output to client • Return meta-data with no body
HTTP Resource Meta-Data • Meta-data • Info about a resource, stored as a separate entity • Examples: • Size of a resource, last modification time, etc. • Example: Type of the content • Data format classification (e.g., Content-Type: text/html) • Enables browser to automatically launch an appropriate viewer • From e-mail’s Multipurpose Internet Mail Extensions (MIME) • Usage example: Conditional GET Request • Client requests object “If-modified-since” • If unchanged, “HTTP/1.1 304 Not Modified” • No body in the server’s response, only a header
HTTP is Stateless • Stateless protocol • Each request-response exchange treated independently • Servers not required to retain state • This is good • Improves scalability on the server-side • Don’t have to retain info across requests • Can handle higher rate of requests • Order of requests doesn’t matter • This is also bad • Some applications need persistent state • Need to uniquely identify user or store temporary info • e.g., Shopping cart, user preferences/profiles, usage tracking, …
State in a Stateless Protocol:Cookies • Client-side state maintenance • Client stores small(?) state on behalf of server • Client sends state in future requests to the server • Can provide authentication Request Response Set-Cookie: XYZ Request Cookie: XYZ
State in a Stateless Protocol:HTTP Authentication • Tool to limit access to server documents • Basic HTTP Authentication • Client can add an Authorization header to GET request • Base64-encoded concatenation of username, a colon, & password • If client doesn’t provide header, server responds with a 401 Unauthorized and a WWW-Authenticate header • Server does not honor request until valid authorization received • Stateless: Must happen on each request • Is this secure? Is this security? • No. Authentication is not security, but provides a piece
Security Sneak-Peek: HTTPS • Transport Layer Security (TLS) • Came after Secure Sockets Layer (SSL) • Shim between App layer (e.g. HTTP) and Transport layer (e.g. TCP) • Provides authentication and communication privacy
Putting It All Together • Client-Server • Request-Response • HTTP • Stateless • Get state with cookies • Content • URI/URL • HTML • Meta-data
Web Browser • Is the client • Generates HTTP requests • User types URL, clicks a hyperlink or bookmark, clicks “reload” or “submit” • Automatically downloads embedded images • Submits the requests (fetches content) • Via one or more HTTP connections • Presents the response • Parses HTML and renders the Web page • Invokes helper applications (e.g., Acrobat, RealPlayer) • Maintains cache • Stores recently-viewed objects and ensures freshness
Web Browser History • 1990, WorldWideWeb, Tim Berners-Lee, NeXT computer • 1993, Mosaic, Marc Andreessen and Eric Bina • 1994, Netscape • 1995, Internet Explorer • ….
Web Server • Handle client request: • Accept a TCP connection • Read and parse the HTTP request message • Translate the URI to a resource • Determine whether the request is authorized • Generate and transmit the response • Web site vs. Web server • Web site: one or more Web pages and objects united to provide the user an experience of a coherent collection • Web server:program that satisfies client requests for Web resources
5 Minute Break Questions Before We Proceed?
HTTP Performance Most Web pages have multiple objects (“items”) • e.g., HTML file and a bunch of embedded images How do you retrieve those objects? One item at a time What transport behavior does this remind you of?
Fetch HTTP Items: Stop & Wait Server Client Start fetching page Request item 1 Time Transfer item 1 Request item 2 ≥2 RTTs per object Transfer item 2 Request item 3 Transfer item 3 Finish; display page
R2 R3 R1 T2 T3 T1 Improving HTTP Performance:Concurrent Requests & Responses • Use multiple connections in parallel • Does not necessarily maintain order of responses • Client = Why? • Server = Why? • Network = Why? • Is this fair? • N parallel connections use bandwidth N times more aggressively than just one • What’s a reasonable/fair limit as traffic competes with that of other users?
Improving HTTP Performance:Pipelined Requests & Responses • Batch requests and responses • Reduce connection overhead • Multiple requests sent in a single batch • Small items (common) can also share segments • Maintains order of responses • Item 1 always arrives before item 2 • How is this different from concurrent requests/responses? • What else could we do to speed things up? Server Client Request 1 Request 2 Request 3 Transfer 1 Transfer 2 Transfer 3
Improving HTTP Performance:Persistent Connections • Enables multiple transfers per connection • Maintain TCP connection across multiple requests • Including transfers subsequent to current page • Client or server can tear down connection • Performance advantages: • Avoid overhead of connection set-up and tear-down • Allow TCP to learn more accurate RTT estimate • Allow TCP congestion window to increase • i.e., leverage previously discovered bandwidth • Default in HTTP/1.1
Improving HTTP Performance:Caching • Many clients transfer same information • Generates redundant server and network load • Clients experience unnecessary latency Server Backbone ISP ISP-1 ISP-2 Clients
Improving HTTP Performance:Caching: How • Modifier to GET requests: • If-modified-since – returns “not modified” if resource not modified since specified time • Response header: • Expires – how long it’s safe to cache the resource • No-cache – ignore all caches; always get resource directly from server
Improving HTTP Performance:Caching: Why • Motive for placing content closer to client: • User gets better response time • Content providers get happier users • Time is money, really! • Network gets reduced load • Why does caching work? • Exploits locality of reference • How well does caching work? • Very well, up to a limit • Large overlap in content • But many unique requests
Improving HTTP Performance:Caching on the Client Example: Conditional GET Request • Return resource only if it has changed at the server • Save server resources! • How? • Client specifies “if-modified-since” time in request • Server compares this against “last modified” time of desired resource • Server returns “304 Not Modified” if resource has not changed • …. or a “200 OK” with the latest version otherwise Request from client to server: GET /~ee122/fa07/ HTTP/1.1 Host: inst.eecs.berkeley.edu User-Agent: Mozilla/4.03 If-Modified-Since: Sun, 27 Aug 2006 22:25:50 GMT <CRLF>
Improving HTTP Performance:Caching with Reverse Proxies Cache documents close to server decrease server load • Typically done by content providers • Only works for staticcontent Server Reverse proxies Backbone ISP ISP-1 ISP-2 Clients
Improving HTTP Performance:Caching with Forward Proxies Cache documents close to clients reduce network traffic and decrease latency • Typically done by ISPs or corporate LANs Server Reverse proxies Backbone ISP ISP-1 ISP-2 Forward proxies Clients
Improving HTTP Performance:Caching w/ Content Distribution Networks • Integrate forward and reverse caching functionality • One overlay network (usually) administered by one entity • e.g., Akamai • Provide document caching • Pull: Direct result of clients’ requests • Push: Expectation of high access rate • Also do some processing • Handle dynamic web pages • Transcoding
Improving HTTP Performance:Caching with CDNs (cont.) Server CDN Backbone ISP ISP-1 ISP-2 Forward proxies Clients
Improving HTTP Performance:CDN Example – Akamai • Akamai creates new domain names for each client content provider. • e.g., a128.g.akamai.net • The CDN’s DNS servers are authoritative for the new domains • The client content provider modifies its content so that embedded URLs reference the new domains. • “Akamaize” content • e.g.: http://www.cnn.com/image-of-the-day.gif becomes http://a128.g.akamai.net/image-of-the-day.gif
Improving HTTP Performance:Caching and Replication • Caching (pull) • Replicate content “on demand” after a request • Store the response message locally for future use • Challenges: • May need to verify if the response has changed • … and some responses are not cacheable • Replication (push) • Planned replication of content in multiple locations • Update of resources handled outside of HTTP • Can replicate scripts that create dynamic responses
Hosting: Multiple Sites Per Machine • Multiple Web sites on a single machine • Hosting company runs the Web server on behalf of multiple sites (e.g., www.foo.com and www.bar.com) • Problem: GET /index.html • www.foo.com/index.html or www.bar.com/index.html? • Solutions: • Multiple server processes on the same machine • Have a separate IP address (or port) for each server • Include site name in HTTP request • Single Web server process with a single IP address • Client includes “Host” header (e.g.,Host: www.foo.com) • Required header with HTTP/1.1