680 likes | 828 Views
COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2007. Reminders. Class attendance required! Preliminary paper proposal January 29 th Preliminary project proposal March 5 th Paper must be individual, projects may be teams of 2-5 students
E N D
COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2007 Kaiser: COMS E6125
Reminders • Class attendance required! • Preliminary paper proposal January 29th • Preliminary project proposal March 5th • Paper must be individual, projects may be teams of 2-5 students • See advice about team formation at http://york.cs.columbia.edu/classes/cs6125/team_advice.htm Kaiser: COMS E6125
Class Attendance is Required! • Attendance will be taken at every class meeting, starting TODAY • Final grade reduced one notch for first miss (e.g., A- -> B+) • Final grade reduced full letter grade for second miss (e.g., A- -> B-) • Fail (or drop) course for third miss Kaiser: COMS E6125
Today’s Topic: Basic Mechanics of the Web • URI (~URL) • HTTP • Client/Server Intermediaries Kaiser: COMS E6125
What is a “URI”? • Uniform Resource Identifier • Compact string of characters, that conform to a certain syntax, for identifying an abstract or physical resource • Simple and extensible format • Example: http://york.cs.columbia.edu/classes/cs6125 Kaiser: COMS E6125
What is a “Resource”? • Some piece of information that can be identified by a URI • The most common kind of resource is a file • But may also be a dynamically-generated query result, the output of a script, a document available in several languages, etc. Kaiser: COMS E6125
Uniform Resource Identifier • Uniform: aka Universal, same string can be used with the same semantic interpretation, even when mechanisms used to access the resource differ • Resource: Conceptual mapping to an entity or set of entities - not necessarily the entity which corresponds to that mapping at any particular instance in time, not always network “retrievable” • Identifier: An object that can act as a reference to something that has identity Kaiser: COMS E6125
Key requirement: Transcribability • Sequence of characters • May be transcribed from non-network source • Often needs to be remembered by people • Should consist of characters that are most likely to be able to be typed into a computer, within the constraints imposed by keyboards (and related input devices) across languages and locales Kaiser: COMS E6125
Why do we usually say URL rather than URI? • A Uniform Resource Locator (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network “location”) • Most popular form of URI Kaiser: COMS E6125
What’s a URI that’s not a URL? • URN = Uniform Resource Name • Subset of URIs that denote a resource independent of its current location or the name by which it is known or the mechanism by which it is accessed • Required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable • Thus not necessarily retrievable Kaiser: COMS E6125
URN vs. URL Example • Assume a published book (the resource) • ISBN assigned by the Library of Congress - this is the URN • Assume the entire contents of the book were placed on a Web server at http://www.xyz.com/book.gzand an Ftp server at ftp://ftp.xyz.com/book.gz - both of these are URLs Kaiser: COMS E6125
URL Notation • <scheme>://<authority><path>?<query> typically, an Internet domain name specific to the authority, identifies the resource within the scope of the scheme and authority a string of information to be interpreted by the resource Kaiser: COMS E6125
What’s a “domain name”? • Domain Name System (DNS) • Maps domain names to IP addresses and vice versa • Hierarchy of DNS servers for top level domains (.com, .edu, .uk, etc.), second level domains (columbia.edu, ibm.com, etc), and so on • Eventually finds IP address for individual host (e.g., www.cs.columbia.edu) • Originated ~1982, for email (gk60@CMUA -> gk60@CMUA.arpa -> gk60@a.cs.cmu.edu) Kaiser: COMS E6125
What is a “scheme”? • <scheme>:<scheme-specific-part> • In a URL, the protocol employed for retrieval (http, ftp, file, mailto, etc.) • More generally, a specification for defining the syntax and semantics of the rest of the URI • Extensible because new schemes can be defined, with their own scheme-specific format after the colon (:) Kaiser: COMS E6125
Example URLs • http://www.ietf.org/rfc/rfc3986.txt • gopher://gopher.quux.org/1/Software/Gopher • mailto:kaiser+6125@cs.columbia.edu • news:news.newusers.questions • telnet:cs.columbia.edu Kaiser: COMS E6125
Example Absolute URIs • http://somehost/absolute/URI/with/absolute/path/to/resource.txt • ftp://somehost/resource.txt • urn:a-rose-by-any-other-name Kaiser: COMS E6125
Example Relative URIs • http://somehost/absolute/URI/with/absolute/path/to/resource.txt • /relative/URI/with/absolute/path/to/resource.txt • relative/path/to/resource.txt • ../../../resource.txt • resource.txt • /resource.txt#frag01 • #frag01 • [empty string] Kaiser: COMS E6125
Relative Addresses • Allows document trees to be (partially) independent of their location and scheme • A single set of hypertext documents can be simultaneously traversable via each of the ftp, http and file schemes if the documents refer to each other using relative URIs • Such document trees can be moved, as a whole, without changing any of the relative references Kaiser: COMS E6125
URI “Standard” • URI is an Internet protocol element defined currently in RFC 3986 (2005) • Originally RFC1630 (1994) Kaiser: COMS E6125
What is an “RFC”? • Request for Comments • One of a series, begun in 1969, of numbered Internet informational documents and standards widely followed by commercial software and freeware in the Internet and Unix communities • All Internet standards are recorded in RFCs Kaiser: COMS E6125
Who keeps track of RFCs? • IETF = Internet Engineering Task Force • Open, all-volunteer organization, with no formal membership or membership requirements • Organized into a large number of working groups, each dealing with a specific topic • April 1st RFCs, e.g., http://www.apps.ietf.org/rfc/rfc3514.html Kaiser: COMS E6125
What is “W3C”? • World Wide Web Consortium defines data formats and usage conventions as well as Internet protocols relevant to Web • Members pay fees depending on country, revenues and non-profit/for-profit status (e.g., $953 vs. $63,500) • Otherwise organized similar to IETF, but writes “Recommendations” instead of “Requests for Comments” • http://www.w3.org/ Kaiser: COMS E6125
Back to URLs • Most (?) Web documents use the “http” scheme • What is “http” (HyperText Transfer Protocol)? Kaiser: COMS E6125
HTTP • The default Internet protocol used to deliver data on the World Wide Web • Usually through TCP/IP sockets on port 80, but can use any port and can be implemented on top of any reliable networking protocol • A Web browser (HTTP client) sends requests to an Web server (HTTP server), which sends responses back to the client Kaiser: COMS E6125
What’s “TCP/IP”? • IP = Internet Protocol • Delivers individual packets from one host to another, based on their IP address (in IPv4, four 8-bit octets as in 128.59.16.20) • Network routers direct traffic of IP packets Kaiser: COMS E6125
What’s “TCP/IP”? • TCP = Transmission Control Protocol • Provides an abstraction of reliable, bidirectional connections for the delivery of IP packets to a particular port at a given IP address • The so-called well known ports (< 1024) are reserved for specific protocols • By default, HTTP uses port 80; this can change in the URL • http://www.foo.com:2007/doc.html Kaiser: COMS E6125
HTTP History • HTTP/0.9 (1990) - simple protocol for raw data transfer • HTTP/1.0 (RFC 1945, 1996) - Allowed MIME-like messages, containing meta-information about the resources transferred and modifiers on the request/response semantics • HTTP/1.1 (RFC 2616, 1999) • HTTP Extension Framework (RFC 2774, 2000) Kaiser: COMS E6125
What is “MIME”? • Multipurpose Internet Mail Extensions • Standard representation for “complex” message bodies (numerous RFCs since 1993) • Examples include messages with embedded graphics or audio clips, messages with file attachments, messages in Japanese or Russian, signed messages Kaiser: COMS E6125
MIME Header Fields • Mime-Version, Content-Type, Content-Transfer-Encoding, Content-Description, Content-ID, Content-Location, Content-Disposition, Part Body • Discrete (text, image, audio) and Multipart (mixed, digest) content types Kaiser: COMS E6125
HTTP Request/Response HTTP request Port 80 Processing HTTP Client Response Other port Kaiser: COMS E6125
HTTP Requests and Responses • Consist of a start-line, zero or more headers (one per line), an empty line (CRLF) indicating the end of the header fields, and possibly a message-body • Message body only allowed with certain request methods and response status codes (200 OK vs. 404 NOT FOUND) Kaiser: COMS E6125
Sample HTTP Exchange • To retrieve the file at the URL http://www.somehost.com/path/file.html • First open a socket to the host www.somehost.com, port 80 (use the default port of 80 because none is specified in the URL) Kaiser: COMS E6125
Sample • Then, send something like the following through the socket: GET /path/file.html HTTP/1.0 From: someuser@columbia.edu User-Agent: HTTPTool/1.0 Accept: text/html, image/gif, image/jpeg [blank line here] Kaiser: COMS E6125
The server should respond with something like the following HTTP/1.0 200 OK Server: Apache/1.3.0 (Linux)Date: Sun, 31 Dec 2006 23:59:59 GMT Last-Modified: Sun, 31 Dec 2006 23:59:58 GMT Content-Type: text/html Content-Length: 1354 <html> <body> <h1>Happy New Year!</h1> (more file contents) . . . </body> </html> Kaiser: COMS E6125
Some Request Headers • From: gives the email address of whoever's making the request, or running the program doing so (for bots) • User-Agent: identifies the program that's making the request, in the form "Program-name/x.xx", where x.xx is the alphanumeric version of the program (e.g., browser) • User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705) Kaiser: COMS E6125
Some Response Headers • Server: analogous to User-Agent:, identifies the server software in the form "Program-name/x.xx" • Server: Apache/1.3.12 (Unix) • Last-Modified: gives the modification date of the resource that's being returned, e.g., for use in caching • Use Greenwich Mean Time, in the format Last-Modified: Tue, 23 Jan 2007 00:00:01 GMT Kaiser: COMS E6125
Start Line • HTTP Version (0.9, 1.0, 1.1) • URI • Method (request) or Status Code (response) Kaiser: COMS E6125
HTTP URIs • Up to some bounded length (often 255), or “unbounded”, status code 414 (Request-URI Too Long) • Equivalence comparison http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Kaiser: COMS E6125
Request Messages • Method SP Request-URI SP HTTP-Version CRLF • GET http://www.w3.org/pub/WWW/ TheProject.html HTTP/1.1 • Equivalent to client making TCP connection to www.w3.org on port 80, then sending GET /pub/WWW/TheProject.html HTTP/1.1 Host: www.w3.org • Host field allows for virtual hosts Kaiser: COMS E6125
What is a “virtual host”? • Enables the same machine to host multiple domain names, sometimes at the same IP address (name-based virtual hosting) • Important for website hosting (e.g., www.foo.com maps to /www/foo/site1 and www.bar.com maps to /www/bar/site2), but usually there can be only one secure https website per IP address/port Kaiser: COMS E6125
GET • Retrieve whatever information (in the form of an entity) is identified by the URI • If the URI refers to a data-producing process, it is the produced data (given the input parameters after the “?”, if any) that is returned as the entity in the response - not the source text of the process (unless that text happens to be the output of the process) • http://foo.com/run.cgi?name1=val1&name2=val2 Kaiser: COMS E6125
Conditional and Partial GET • Conditional if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field • Partial if the request message includes a Range header field • Don’t retrieve data the client doesn’t need (e.g., at least part and up to date already in cache) Kaiser: COMS E6125
HEAD • Identical to GET except that the server must not return a message-body in the response - only returns headers • Often used for testing hypertext links for validity and modification • Can mark cache entries as stale if certain header information changes (e.g., length, last-modified) Kaiser: COMS E6125
POST • Used to request that the origin server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line • Actual function performed by the POST method is determined by the server, usually dependent on the Request-URI Kaiser: COMS E6125
POST supports several functions • Annotation of an existing resource • Posting a message to a bulletin board, newsgroup, mailing list, or similar group of articles • Providing a block of data, such as the result of submitting a form, to a data-handling process • Extending a database through an append operation Kaiser: COMS E6125
POST vs. GET • GET can be used to send small amounts of data to a server, with the data following the ? character • The rest of the request-URI (before the ?) refers to some kind of processing program GET /path/script.cgi?field1=value1&field2=value2 HTTP/1.0 Kaiser: COMS E6125
PUT and DELETE • Often unsupported (501 Not Implemented) • PUT requests that the enclosed entity be stored under the supplied Request-URI • May create a new resource at a new URI, or modify an existing resource already at that URI • DELETE requests that the origin server delete the resource identified by the Request-URI • May be overridden, e.g., by human intervention, even if status code indicates successfully completed Kaiser: COMS E6125
OPTIONS and TRACE • OPTIONS allows the client to determine the requirements associated with a resource, or the capabilities of a server (OPTIONS *), without implying a resource action or initiating a resource retrieval • TRACE used to invoke application-layer loop-back of the request message, allowing the client to see what is being received at the other end of the request chain for testing or diagnostic information Kaiser: COMS E6125
HTTP is “Stateless” • Server doesn’t remember anything about client between connections • Not even between requests during the same persistent connection, except TCP data • But some state can be encoded in complex URLs or in forms • Or saved on client in “cookies” Kaiser: COMS E6125
Cookies • Opaque string associated with a website, stored at the browser • Create in HTTP response with “Set-Cookie:” • In all subsequent requests to this site, until cookie’s expiration, the client sends the HTTP header “Cookie:” • Name-value pairs • Cookie: user=“alex” lastvisit=“20070123-11:00” • Interpretation up to the Web application Kaiser: COMS E6125