170 likes | 351 Views
Web basics. HTTP http://www.ietf.org/rfc/rfc2616.txt http://www2002.org/CDROM/refereed/444/ URI/L/Ns http://www.ietf.org/rfc/rfc2396.txt HTML http://www.w3.org/TR/html401/. HTTP operation Basic (top) vs. with Intermediaries. Request. User Agent. Origin Server. Response.
E N D
Web basics • HTTP • http://www.ietf.org/rfc/rfc2616.txt • http://www2002.org/CDROM/refereed/444/ • URI/L/Ns • http://www.ietf.org/rfc/rfc2396.txt • HTML • http://www.w3.org/TR/html401/
HTTP operationBasic (top) vs. with Intermediaries Request User Agent Origin Server Response Request chain User Agent Origin Server Response chain Intermediaries: Proxies, gateways, tunnels
HTTP Terminology • User Agent (UA): program acting on behalf of user. • Resource: data object or service identified by a URI. • Origin server (OS): server originating a resource • Connection: transport session initiated by UA (but not always direct to OS). Typically TCP or SSL.
HTTP Terminology • Message: formatted sequence of bytes: • Request: from client to server • Response: from server to client • Message = startline + headers + body
GET /index.html HTTP/1.1 Host: www.hello.ucsc.edu User-Agent: Mozilla <blank line> HTTP/1.1 200 OK Content-Length: 45 Content-Language: en-us Content-Type: text/html <html> <body> Hello world </body> </html> Request and response messages
Requests • GET, HEAD, POST • PUT, DELETE • OPTIONS, TRACE, CONNECT
Common request headers • Host (required), User-Agent • Referer • Authorization • If-Modified-Since, Cache-Control • Accept[-Language/-Charset/-Encoding]
Common response codes • 200 OK • 301 Moved permanently, 307 Moved tmp • 400 Bad request • 401 Unauthorized, 403 Forbidden • 404 Not found • 500 Internal Server Error
Common response headers • Content-Type, Content-Length, Content-Language • Date, Last-Modified, Expires • Location [for 3xx responses] • Server
Response generationTheory (top) vs. practice Resource Variant Instance Entity Message Selection (negotiation, UA optimization) Content encoding (gzip) Instance manipulations (range, delta) Transfer encoding (chunking, encryption) Resource Variant/Instance Message Selection (UA optimization) Understanding the full model is necessary for a good understanding of caching, but we are going to ignore caching
Cookies • Not part of official HTTP spec, but see: • http://www.ietf.org/rfc/rfc2109.txt • http://www.ietf.org/rfc/rfc2965.txt • Adding state to “stateless” protocol • OS adds Set-Cookie header to response: • Set-Cookie: sid=113a8fbc;version=1;path=/ • UA adds Cookie header to future requests: • Cookie: sid=113a8fbc;$version=1;$path=/
URI/L/N • Universal Resource… • Name: a persistent identifier • (Under development) • Locator: (perhaps transient) locator information • Typically: address plus access method • Identifier: either a URN or URL • RFC2396 provides syntactic rules that all URIs must obey
HTTP URLs • http://host:port/path?query • “Fragments” are not strictly part of URLs • Relative URIs • Canonicalization • Aggressively avoid false distinctions • But always keep a working URL
HTML • Do a bit of review on the way frames and Javascript work
Problems for Archiving • Links obscured by increasing use of Flash, Javascript, DHTML, PDF, Word, … • Soft-404’s, 30x’s (Big pain!!) • Great example of non-cooperation • Browser-specific content • Servers lie about content • E.g., incorrect or missing Content-Type
Problems for Archiving • Aliasing • Material is copied • Host has multiple names (www.foo.com and foo.com typically the same) • Resource has multiple names (e.g., case-insensitivity)
Problems for archiving • And this ignores spamming!