1 / 17

Web basics

Web basics. HTTP http://www.ietf.org/rfc/rfc2616.txt http://www2002.org/CDROM/refereed/444/ URI/L/Ns http://www.ietf.org/rfc/rfc2396.txt HTML http://www.w3.org/TR/html401/. HTTP operation Basic (top) vs. with Intermediaries. Request. User Agent. Origin Server. Response.

caspar
Download Presentation

Web basics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web basics • HTTP • http://www.ietf.org/rfc/rfc2616.txt • http://www2002.org/CDROM/refereed/444/ • URI/L/Ns • http://www.ietf.org/rfc/rfc2396.txt • HTML • http://www.w3.org/TR/html401/

  2. HTTP operationBasic (top) vs. with Intermediaries Request User Agent Origin Server Response Request chain User Agent Origin Server Response chain Intermediaries: Proxies, gateways, tunnels

  3. HTTP Terminology • User Agent (UA): program acting on behalf of user. • Resource: data object or service identified by a URI. • Origin server (OS): server originating a resource • Connection: transport session initiated by UA (but not always direct to OS). Typically TCP or SSL.

  4. HTTP Terminology • Message: formatted sequence of bytes: • Request: from client to server • Response: from server to client • Message = startline + headers + body

  5. GET /index.html HTTP/1.1 Host: www.hello.ucsc.edu User-Agent: Mozilla <blank line> HTTP/1.1 200 OK Content-Length: 45 Content-Language: en-us Content-Type: text/html <html> <body> Hello world </body> </html> Request and response messages

  6. Requests • GET, HEAD, POST • PUT, DELETE • OPTIONS, TRACE, CONNECT

  7. Common request headers • Host (required), User-Agent • Referer • Authorization • If-Modified-Since, Cache-Control • Accept[-Language/-Charset/-Encoding]

  8. Common response codes • 200 OK • 301 Moved permanently, 307 Moved tmp • 400 Bad request • 401 Unauthorized, 403 Forbidden • 404 Not found • 500 Internal Server Error

  9. Common response headers • Content-Type, Content-Length, Content-Language • Date, Last-Modified, Expires • Location [for 3xx responses] • Server

  10. Response generationTheory (top) vs. practice Resource Variant Instance Entity Message Selection (negotiation, UA optimization) Content encoding (gzip) Instance manipulations (range, delta) Transfer encoding (chunking, encryption) Resource Variant/Instance Message Selection (UA optimization) Understanding the full model is necessary for a good understanding of caching, but we are going to ignore caching

  11. Cookies • Not part of official HTTP spec, but see: • http://www.ietf.org/rfc/rfc2109.txt • http://www.ietf.org/rfc/rfc2965.txt • Adding state to “stateless” protocol • OS adds Set-Cookie header to response: • Set-Cookie: sid=113a8fbc;version=1;path=/ • UA adds Cookie header to future requests: • Cookie: sid=113a8fbc;$version=1;$path=/

  12. URI/L/N • Universal Resource… • Name: a persistent identifier • (Under development) • Locator: (perhaps transient) locator information • Typically: address plus access method • Identifier: either a URN or URL • RFC2396 provides syntactic rules that all URIs must obey

  13. HTTP URLs • http://host:port/path?query • “Fragments” are not strictly part of URLs • Relative URIs • Canonicalization • Aggressively avoid false distinctions • But always keep a working URL

  14. HTML • Do a bit of review on the way frames and Javascript work

  15. Problems for Archiving • Links obscured by increasing use of Flash, Javascript, DHTML, PDF, Word, … • Soft-404’s, 30x’s (Big pain!!) • Great example of non-cooperation • Browser-specific content • Servers lie about content • E.g., incorrect or missing Content-Type

  16. Problems for Archiving • Aliasing • Material is copied • Host has multiple names (www.foo.com and foo.com typically the same) • Resource has multiple names (e.g., case-insensitivity)

  17. Problems for archiving • And this ignores spamming!

More Related