740 likes | 852 Views
COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2012. Today’s Topic. Basic Web Mechanics URI HTTP Client/Server Intermediaries. What is a “URI�. Uniform Resource Identifier Compact string of characters for identifying an abstract or physical resource
E N D
COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2012 Kaiser: COMS E6125
Today’s Topic • Basic Web Mechanics • URI • HTTP • Client/Server Intermediaries Kaiser: COMS E6125
What is a “URI”? • Uniform Resource Identifier • Compact string of characters for identifying an abstract or physical resource • Conforms to a simple and extensible format • Example:http://www.psl.cs.columbia.edu/courses/whim/ Kaiser: COMS E6125
What is a “Resource”? • Some piece of information that can be identified by a URI • The most common kind of resource is a file • But may also be a dynamically-generated query result, the output of a script, a document available in several languages or formats, etc. Kaiser: COMS E6125
Uniform Resource Identifier • Uniform: aka Universal - same string can be used with same semantic interpretation, even when mechanisms used to access the resource differ • Resource: Conceptual mapping to an entity or set of entities - not necessarily the entity that corresponds to that mapping at any particular instance in time • Identifier: An object that can act as a reference to something that has identity Kaiser: COMS E6125
Key Requirement: Transcribability • May be transcribed from non-network source • Often needs to be remembered by people • Should consist of characters that are most likely to be able to be typed into a computer, within the constraints imposed by keyboards (and related input devices) across languages and locales Kaiser: COMS E6125
Why do we usually say URL rather than URI? • A Uniform Resource Locator (URL) refers to the subset of URIs that identify resources via a representation of their primary access mechanism (i.e., their network “location”) • Most popular form of URI Kaiser: COMS E6125
What’s a URI that’s not a URL? • URN = Uniform Resource Name • Subset of URIs that denote a resource independent of its current location, the name by which it is known, or the mechanism by which it is accessed • Required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable • Thus not necessarily “retrievable” Kaiser: COMS E6125
URN vs. URL Example • Assume a published book (the resource) • The ISBN (International Standard Book Number) is a 10-digit number that uniquely identifies books and book-like products published internationally - this is the URN • The entire contents of the book might be placed on a Web server at http://www.xyz.com/book.gzand an Ftp server at ftp://ftp.xyz.com/book.gz- both of these are URLs • All of these are URIs Kaiser: COMS E6125
URI Syntax • <scheme>:<scheme-specific-part> • For a URL, the scheme indicates the protocol employed for retrieval (http, ftp, file, mailto, etc.) • More generally, a scheme is a specification for defining the syntax and semantics of the rest of the URI • Extensible because new schemes can be defined, with their own scheme-specific format after the colon (:) Kaiser: COMS E6125
URL Notation • <scheme>://<authority><path>?<query> typically, an Internet domain name specific to the authority, identifies the resource within the scope of the scheme and authority a string of information to be interpreted by the resource Kaiser: COMS E6125
What’s a “domain name”? • Domain Name System (DNS) • Maps domain names to IP addresses and vice versa • Hierarchy of DNS servers for top level domains (.com, .edu, .uk, etc.), second level domains (columbia.edu, ibm.com, etc.), and so on • Eventually finds IP address for individual host (e.g., bank.cs.columbia.edu) • DNS servers cache responses based on TTL = Time to Live • Originated ~1982, e.g., for email (gk60@CMUA -> gk60@CMUA.arpa -> gk60@a.cs.cmu.edu) Kaiser: COMS E6125
Relative URLs • Allows document trees to be independent of their location and scheme • A single set of hypertext documents can be simultaneously traversable via each of the ftp, http and file schemes • Such document trees can be moved, as a whole, without changing any of the relative references • Resolved to full (absolute) URLs using a base URL Kaiser: COMS E6125
Example Relative URLs • http://somehost/absolute/URL/with/absolute/path/to/resource.txt • /relative/URI/with/absolute/path/to/resource.txt • relative/path/to/resource.txt • ../../../resource.txt • resource.txt • /resource.txt#frag01 • #frag01 • [empty string] Kaiser: COMS E6125
URI “Standard” • URI is an Internet protocol element defined currently in RFC 3986 (2005) • Originally RFC1630 (1994) Kaiser: COMS E6125
What is an “RFC”? • Request for Comments • One of a series, begun in 1969, of numbered informational documents and standards followed by commercial software and freeware in the Internet and Unix communities • All Internet standards are recorded in RFCs Kaiser: COMS E6125
Who keeps track of RFCs? • IETF = Internet Engineering Task Force • Open, all-volunteer organization, with no formal membership or membership requirements • Organized into a large number of working groups, each dealing with a specific topic • April 1st RFCs, see http://www.apps.ietf.org/rfc/apr1list.html Kaiser: COMS E6125
What is “W3C”? • World Wide Web Consortium defines data formats and usage conventions as well as Internet protocols relevant to Web • Members pay fees depending on country, revenues and non-profit/for-profit status • Otherwise organized similar to IETF, but writes “Recommendations” instead of “Requests for Comments” • http://www.w3.org/ Kaiser: COMS E6125
Back to URLs • Most Web documents use the “http” scheme (or “https” = http over TLS/SSL) • What is “http” (HyperText Transfer Protocol)? Kaiser: COMS E6125
HTTP = HyperText Transfer Protocol • Most Web documents are accessed using the “http” scheme, the default Internet protocol used to deliver data on WWW • Usually through TCP/IP sockets on port 80, but can use any port and can be implemented on top of any reliable networking protocol • A Web browser (HTTP client) sends requests to an Web server (HTTP server), which sends responses back to the client Kaiser: COMS E6125
What’s “TCP/IP”? • IP = Internet Protocol • Delivers individual packets from one host to another, based on their IP address (in IPv4, four 8-bit octets as in 128.59.11.100) • Network routers direct traffic of IP packets • Analogous to telephone numbers (area code plus exchange plus 4 digits plus extension) and postal address (zip code plus street name plus building number plus apartment number) Kaiser: COMS E6125
What’s “TCP/IP”? • TCP = Transmission Control Protocol • Provides an abstraction of reliable, bidirectional connections for the delivery of IP packets to a particular port at a given IP address • The so-called well known ports (< 1024) are reserved for specific protocols (telnet, ftp, smtp, pop3, imap, etc.) • By default, HTTP uses port 80; this can be changed in the URL • http://www.example.com:2012/doc.html • Main alternative is UDP = User Datagram Protocol, no connection, no reliable delivery (used by DNS) Kaiser: COMS E6125
HTTP History • HTTP/0.9 (1990) - simple protocol for raw data transfer • HTTP/1.0 (1996) - allows MIME-like messages, containing meta-data about the resources transferred and modifiers on the request/response semantics • HTTP/1.1 (1999) – lots of practical improvements, e.g., caching policies, chunked encoding, persistent connections • W3C closed activity but IETF still has a working group to revise Kaiser: COMS E6125
What is “MIME”? • Multipurpose Internet Mail Extensions • Standard representation for “complex” message bodies (numerous RFCs since 1993) • Examples include messages with embedded graphics or audio clips, messages with file attachments, messages in Japanese or Russian, signed messages Kaiser: COMS E6125
HTTP Properties • Uses URLs for identifying Web resources • Request-response – always initiated by client to server, the server responds with results • Stateless – each request-response pair independent from every other, so any state information (login credentials, shopping carts, etc.) needs to be encoded somehow Kaiser: COMS E6125
HTTP request Port 80 Processing HTTP Client Response Other port HTTP Request/Response • Web server processes HTTP requests, generally over TCP Port 80 • The request specifies a resource URL • The server parses the URL and processes the request: • Returns a document with its type information • Invokes a program or script, and returns its output • The output (including metadata) is sent back to the client as a response message Kaiser: COMS E6125
HTTP Requests • Small number of request types (GET, POST, HEAD, etc.) • Request may contain additional information, e.g. client info, parameters for forms, cookies, etc. • Consists of a start-line, zero or more headers (one per line), an empty line (CRLF) indicating the end of the header fields, and possibly a message-body Kaiser: COMS E6125
HTTP Responses • Larger number of response codes (200 OK, 404 NOT FOUND) • Message body only allowed with certain response status codes • Includes MIME metadata as well as “payload” (data) Kaiser: COMS E6125
Start Line • HTTP Version (0.9, 1.0, 1.1) • URI • Method (request) or Status Code (response) Kaiser: COMS E6125
Sample HTTP Exchange • To retrieve the file at the URL http://psl.cs.columbia.edu • First open a socket to the host psl.cs.columbia.edu, port 80 (use the default port because none is specified in the URL) • Connect to 128.59.19.127 on port 80 ... ok Kaiser: COMS E6125
Sample • Then, send something like the following through the socket: GET / HTTP/1.1[CRLF] Host: psl.cs.columbia.edu[CRLF] Connection: close[CRLF] User-Agent: Web-sniffer/1.0.37 (+http://web-sniffer.net/)[CRLF] Accept-Encoding: gzip[CRLF] Accept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7[CRLF] Cache-Control: no-cache[CRLF] Accept-Language: de,en;q=0.7,en-us;q=0.3[CRLF] Referer: http://web-sniffer.net/[CRLF] [CRLF] Kaiser: COMS E6125
Sample • The server should respond with something like the following HTTP Status Code: HTTP/1.1 403 Forbidden[CRLF] Content-Length:218[CRLF] Content-Type:text/html[CRLF] Server:Microsoft-IIS/6.0[CRLF] X-Powered-By:ASP.NET[CRLF] Date: Sat, 22 Jan 2011 14:024:22 GMT[CRLF] Connection:close[CRLF] <html><head><title>Error</title></head><body><head><title>Directory Listing Denied</title></head>[LF] <body><h1>Directory Listing Denied</h1>This Virtual Directory does not allow contents to be listed.</body></body></html> Kaiser: COMS E6125
Some Request Headers • User-Agent: identifies the program that's making the request, in the form "Program-name/x.xx", where x.xx is the alphanumeric version of the program (e.g., browser) • User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9) Gecko/2008052906 Firefox/3.0 • Referer: the URL of the previous webpage from which a link was followed • Referer: http://web-sniffer.net/ Kaiser: COMS E6125
Some Response Headers • Server: analogous to User-Agent:, identifies the server software in the form "Program-name/x.xx" • Server: Apache/2.2.8 (Ubuntu) • Last-Modified: gives the modification date of the resource that's being returned, e.g., for use in caching • Use Greenwich Mean Time, in the format Last-Modified: Sat, 22 Jan 2011 14:46:32 GMT Kaiser: COMS E6125
HTTP URIs • Up to some bounded length (often 255), or “unbounded”, status code 414 (Request-URI Too Long) • Equivalence comparison http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Kaiser: COMS E6125
Request Messages • Method SP Request-URI SP HTTP-Version CRLF • GET http://www.gailkaiser.org • Equivalent to client making TCP connection to bank.cs.columbia.edu on port 80, then sending GET / Host: www.gailkaiser.org • Host field allows for virtual hosts Kaiser: COMS E6125
What is a “virtual host”? • Enables the same machine to host multiple domain names, sometimes at the same IP address (name-based virtual hosting) • Important for website hosting (e.g., www.foo.com maps to /www/foo/site1 and www.bar.com maps to /www/bar/site2), but usually there can be only one secure https website per IP address/port Kaiser: COMS E6125
GET • Retrieve whatever information (in the form of an entity) is identified by the URL • If the URL refers to a data-producing process, it is the produced data (given the input parameters after the “?”, if any) that is returned as the entity in the response - not the source text of the process (unless that text happens to be the output of the process) http://foo.com/run.cgi?name1=val1&name2=val2 Kaiser: COMS E6125
Conditional and Partial GET • Conditional if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field • Partial if the request message includes a Range header field • Don’t retrieve data the client doesn’t need (e.g., at least the part already up to date in cache) Kaiser: COMS E6125
HEAD • Identical to GET except that the server must not return a message-body in the response - only returns headers • Often used for testing hypertext links for validity and modification • Can mark cache entries as stale if certain header information changes (e.g., length, last-modified) Kaiser: COMS E6125
POST • Used to request that the server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line • Actual function performed by the POST method is determined by the server, usually dependent on the Request-URI Kaiser: COMS E6125
POST supports several functions • Annotation of an existing resource • Posting a message to a bulletin board, newsgroup, mailing list, or similar group of articles • Providing a block of data, such as the result of submitting a form, to a data-handling process • Extending a database through an append operation Kaiser: COMS E6125
POST vs. GET • GET can only be used to send relatively small amounts of data to a server, with the data following the ? character • The rest of the request-URI (before the ?) refers to some kind of processing program GET /run.cgi?name1=val1&name2=val2 HTTP/1.0 Kaiser: COMS E6125
PUT and DELETE • Often unsupported (501 Not Implemented) • PUT requests that the enclosed entity be stored under the supplied Request-URI • May create a new resource at a new URI, or modify an existing resource already at that URI • DELETE requests that the origin server delete the resource identified by the Request-URI • May be overridden, e.g., by human intervention, even if status code indicates successfully completed • Effectively supplanted by WebDAV Kaiser: COMS E6125
OPTIONS and TRACE • OPTIONS allows the client to determine the requirements associated with a resource, or the capabilities of a server (OPTIONS *), without implying a resource action or initiating a resource retrieval • TRACE used to invoke application-layer loop-back of the request message, allowing the client to see what is being received at the other end of the request chain for testing or diagnostic information Kaiser: COMS E6125
HTTP Responses • HTTP-Version SP Status-Code SP Reason-Phrase CRLF • Example: HTTP/1.0 404 Not Found • Status code: 3-digit integer result code of the attempt to understand and satisfy the request • Response phrase: short textual description of the Status-Code Kaiser: COMS E6125
Response Messages • Larger number of response codes (200 OK, 404 NOT FOUND) • Message body only allowed with certain response status codes • Includes MIME metadata as well as “payload” (data) Kaiser: COMS E6125
Status Codes • Applications need only understand first digit, treat others as equivalent to x00 • 1xx: Informational - Request received, continuing process ("100" : Continue, relevant to persistent connections in HTTP 1.1) • 2xx: Success - The action was successfully received, understood and accepted ("200" : OK) • 3xx: Redirection - Further action must be taken in order to complete the request ("300" : Multiple Choices) • 4xx: Client Error - The request contains bad syntax or cannot be fulfilled ("400" : Bad Request) • 5xx: Server Error - The server failed to fulfill an apparently valid request ("500" : Internal Server Error) Kaiser: COMS E6125
HTTP is “Stateless” • Server doesn’t remember anything about client between connections • Not even between requests during the same persistent connection, except TCP data • So how does HTTP support “remembering” the user during a session or across sessions? • Some state can be encoded in complex URLs or otherwise in the web page itself (e.g., query strings added to links, hidden form fields) • Or saved on client in “cookies” Kaiser: COMS E6125
Cookies • String associated with a name/domain/path, stored at the browser • Series of name-value pairs, interpreted by the web application • Create in HTTP response with “Set-Cookie:” (or “Set-Cookie2:”) • In all subsequent requests to this site, until cookie’s expiration, the client sends the HTTP header “Cookie:” (or “Cookie2:”) • Often have an expiration (otherwise expire when browser closed) • Various technical, privacy and security issues (e.g., inconsistent state after using “back” button, third-party cookies, cross-site scripting) Kaiser: COMS E6125