540 likes | 716 Views
Web Server Technologies Part I: HTTP & Getting Started. Joe Lima Director of Product Development Port80 Software, Inc. jlima@port80software.com. Web Server Technologies | Part I: HTTP & Getting Started. Tutorial Content. Introduction to HTTP. TCP/IP and application layer protocols
E N D
Web Server Technologies Part I: HTTP & Getting Started Joe LimaDirector of Product Development Port80 Software, Inc. jlima@port80software.com
Web Server Technologies | Part I: HTTP & Getting Started Tutorial Content Introduction to HTTP • TCP/IP and application layer protocols • URLs, resources and MIME Types • HTTP request/response cycle and proxies Setup and deployment • Planning Web server & site deployments • Site structure and basic server configuration • Managing users and hosts
Web Server Technologies | Part I: HTTP & Getting Started Preliminaries - Recommended Texts Administrating Web Servers, Security & Maintenance Larson and Stephens, Prentice Hall HTTP The Definitive Guide Gourley and Totty, et al., O’Reilly Online resources are plentiful and will be cited along the way.
Web Server Technologies | Part I: HTTP & Getting Started The Role of a Web Server • Web servers serve various resources - As file (document) servers - As application front ends • Other servers also provide services on the Internet, each speaking its own protocol: • - SMTP, POP, IMAP, NNTP, FTP, etc. • Web server = HTTP server • HTTP servers serve HTTP clients (browsers and other user agents) with the help of HTTP intermediaries (proxies) a box or a service?
Web Server Technologies | Part I: HTTP & Getting Started An Introduction to HTTP • Hyper Text Transfer Protocol • One of the application layer protocols that make up the Internet - HTTP over TCP/IP - Like SMTP, POP, IMAP, NNTP, FTP, etc. • The underlying language of the Web • Three versions have been used, two are in common use and have been specified: • - RFC 1945 HTTP 1.0 (1996) - RFC 2616 HTTP 1.1 (1999)
Web Server Technologies | Part I: HTTP & Getting Started A Brief Digression on TCP/IP HTTP sits atop the TCP/IP Protocol Stack HTTP Application Layer TCP Transport Layer IP Network Layer Network Interfaces Data Link Layer
Web Server Technologies | Part I: HTTP & Getting Started A Brief Digression on TCP/IP, cont. • IP provides packets that are routed based on source and destination IP addresses • TCP provides segments that ride inside the IP packets and add connection information based on source and destination ports • The ports let TCP carry multiple protocols that connect services running on default ports: • HTTP on port 80 • HTTP with SSL (HTTPS) on port 443 • FTP on port 21 • SMTP on port 25 • POP on port 110 • SSH on port 22
Web Server Technologies | Part I: HTTP & Getting Started A Brief Digression on TCP/IP, cont. • TCP also provides mechanisms to make the connection a reliable bit pipe • 3-way handshake, sequence numbers, checksums, control flags • A data stream is chopped up into chunks that are reassembled, complete and in correct order on the other endpoint of the connection • TCP segments, riding inside IP packets, carry the chunks of data • When HTTP is the Application Layer protocol on top of the stack, these chunks of data are the contents of the HTTP Message
Web Server Technologies | Part I: HTTP & Getting Started A Brief Digression on TCP/IP, cont. How an HTTP Message is delivered over TCP/IP connection: HTTP Message’s data stream is chopped up into chunks small enough to fit in a TCP segment GET /index.html HTTP/1.1<CRLF> Host: www.hostname.com Con… The chunks ride inside TCP segments used to reassemble them correctly on the other end of the connection The segments are shipped to the right destination inside IP datagrams
Web Server Technologies | Part I: HTTP & Getting Started A Brief Digression on TCP/IP, cont. • HTTPS (HTTP + SSL/TLS)Although a different protocol, service and port, HTTPS is usually integrated with the Web server • FTPOften run on the same box as the HTTP server to provide file transfer capabilities • SMTPSometimes run with Web server (email gateways) • SSHWidely used instead of telnet for remote admin Other application layer protocols use TCP/IP to provide Internet services often found in company with HTTP
Web Server Technologies | Part I: HTTP & Getting Started Introduction to HTTP - continued • HTTP and URLs • URLs used early on by all Internet protocols, including various document retrieval protocols • More specifications (both from 1994): - Uniform Resource Locators - RFC 1738 - Universal Resource Identifiers - RFC 1630 • Hypertext came to predominate as the most efficient way of providing access to resources - Fast, flexible, generic, extensible - Facilitated searching, collaboration, annotation • HTTP now the central mechanism for requesting and serving URL based resources
Web Server Technologies | Part I: HTTP & Getting Started Introduction to HTTP - continued A Digression on MIME Types • URLs point to resources (“content”) • Resources are represented using different Media Types (MIME Types) • Multipurpose Internet Mail Extensions RFC2045,6 • Should be registered with IANA (www.iana.org) • MIME Type tells how content should be handled • File extensions are mapped to certain MIME Types • .html usually means a MIME Type of text/html • .jpg usually means a MIME Type of image/jpeg • But mapping by file extension is dependent on local software’s conventions and might not be shared across applications or machines
Web Server Technologies | Part I: HTTP & Getting Started Introduction to HTTP - continued • HTTP allows MIME Type info to be passed between client and server so both agree about the media type of the resource • primary-type/sub-type • The most common MIME Types used on the Web come from the text, image and application top-level groups • text/html, text/css • image/gif, image/jpeg, image/png • application/pdf, application/octet-stream • application/x-javascript, application/x-shockwave-flash
Web Server Technologies | Part I: HTTP & Getting Started Introduction to HTTP - continued • HTTP servers turn URLs into resources through a request-response cycle • User agent (client) issues an HTTP request to a host (server) for a given resource using its URL • Server “resolves” the URL, acts on the resource - Retrieves, but also launches, modifies etc. • Server sends an HTTP response back to the client - Usually (not always) a representation of the requested resource - Can also be info about the resource, its state, etc. • Each request is discontinuous with all previous requests – HTTP is stateless
Web Server Technologies | Part I: HTTP & Getting Started Basic HTTP Request/Response Cycle HTTP Request HTTP Response Resource /bar HTTP Client Asks for resource by its URL: http://www.foo.com/bar.html HTTP Server www.foo.com
Web Server Technologies | Part I: HTTP & Getting Started An HTTP Request/Response Chain LAN DMZ Internet HTTP Server HTTP Client Egress Proxy Transparent Proxies Reverse Proxy Network at Hosting Provider Local DNS External DNS Servers Root DNS Servers
Web Server Technologies | Part I: HTTP & Getting Started Types and Uses of Proxy Servers • Proxies are HTTP Intermediaries • All act as both clients and servers • Major types of proxies can be distinguished by where they live and how they get traffic • - Explicit (e.g., Egress) • - Transparent/Intercepting • - Reverse/Surrogate • Three primary uses for proxies • - Security • - Performance • - Content Filtering
Web Server Technologies | Part I: HTTP & Getting Started Looking into HTTP To really understand Web servers (and clients), study the grammar, syntax and semantics of HTTP requests and responses: • Look at the parts of the transaction you don’t normally see in a browser • Issue requests manually to understand how a user agent gets resources from a server • Use protocol analyzers to “spy” on the HTTP conversation • Learn to troubleshoot problems by “reading” and “writing” HTTP
Web Server Technologies | Part I: HTTP & Getting Started Looking into HTTP - continued HTTP requests and responses are both types of Internet Messages (RFC 822), and share a general format: • A Start Line, followed by a CRLF • Request Line for requests • Status Line for responses • Zero or more Message Headers • field-name “:” [field-value] CRLF • An empty line • Two CRLFs mark the end of the Headers • An optional Message Body if there is a payload • All or part of the “Entity Body” or “Entity”
Web Server Technologies | Part I: HTTP & Getting Started Making a simple HTTP request • Open a TCP connection to a host • Can borrow telnet protocol to do this, by pointing it at the default HTTP port (80) • C:\>telnet www.google.com 80 • Ask for a resource using a minimal request syntax: • GET / HTTP/1.1 <CRLF> • Host: www.google.com <CRLF><CRLF> • A Host header is required for HTTP 1.1 connections, though not for HTTP 1.0
Web Server Technologies | Part I: HTTP & Getting Started A Closer Look at the Request Line Consists of three major parts • The Request Method followed by a SP • GET, POST, HEAD, TRACE, OPTIONS, PUT, DELETE and CONNECT • Extension methods such as those specified by WebDav (RFC 2518) • The Request URI followed by a SP • The URL associated with the resource • By far the most complex part of any Start Line • Defined by intension rather than extension • The HTTP Version followed by the CRLF • 0.9, 1.0, 1.1
Web Server Technologies | Part I: HTTP & Getting Started A Closer Look at the Request Methods • GET • By far most common method • Retrieves a resource from the server • Supports passing of query string arguments • HEAD • Retrieves only the Headers associated with a resource but not the entity itself • Highly useful for protocol analysis, diagnostics • POST • Allows passing of data in entity rather than URL • Can transmit of far larger arguments that GET • Arguments not displayed on the URL
Web Server Technologies | Part I: HTTP & Getting Started More Request Methods • OPTIONS • Shows methods available for use on the resource (if given a path) or the host (if given a “*”) • TRACE • Diagnostic method for assessing the impact of proxies along the request-response chain • PUT, DELETE • Used in HTTP publishing (e.g., WebDav) • CONNECT • A common extension method for Tunneling other protocols through HTTP
Web Server Technologies | Part I: HTTP & Getting Started A Closer Look at the Request URI • Absolute URI vs. Absolute Path • Explicit Proxies Require Absolute URIs • Client is connected directly to the proxy • Protocol and host name needed to resolve request • Grammar of the Absolute Path • Like Absolute URI minus the “http://hostname” • Initial “/” equivalent of the host’s document root • In HTTP 1.1 with name-based virtual hosting Host header directs request to appropriate document root • Subsequent slashes left-to-right imply less “significant” distinctions • The “*” form used to query entire host
Web Server Technologies | Part I: HTTP & Getting Started A Closer Look at the Status Line Consists of three major parts • The HTTP Version followed by a SP • Just like third part of Request Line • Status Code followed by a SP • 5 groups of 3 digit integers indicating the result of the attempt to satisfy the request • 1xx are informational • 2xx are success codes • 3xx are for alternate resource locations (redirects) • 4xx indicate client side errors • 5xx indicate server side errors • The Reason Phrase followed by the CRLF • Short textual description of the status code
Web Server Technologies | Part I: HTTP & Getting Started A Closer Look at HTTP Headers Headers come in four major types, some for requests, some for responses, some for both: • General Headers • Provide info about messages of both kinds • Request Headers • Provide request-specific info • Response Headers • Provide response-specific info • Entity Headers • Provide info about request and response entities • Extension headers are also possible
Web Server Technologies | Part I: HTTP & Getting Started A Closer Look at General Headers • Connection – lets clients and servers manage connection state • Connection: Keep-Alive (HTTP 1.0) • Connection: close (HTTP 1.1) • Date – when the message was created • Date: Sat, 31-May-03 15:00:00 GMT • Via – shows proxies that handled message • Via: 1.1 www.myproxy.com (Squid/1.4) • Cache-Control – Among the most complex of headers, enables caching directives • Cache-Control: no-cache
Web Server Technologies | Part I: HTTP & Getting Started A Closer Look at Request Headers • Host – The hostname (and optionally port) of server to which request is being sent • Required for name-based virtual hosting • Host: www.port80software.com • Referer – The URL of the resource from which the current request URI came • Misspelled in the specification, so [Sic] • Referer: http://www.host.com/login.asp • User-Agent – Name of the requesting application, used in browser sensing • User-Agent: Mozilla/4.0 (Compatible; MSIE 6.0)
Web Server Technologies | Part I: HTTP & Getting Started Some More Request Headers • Accept and its variants – Inform servers of client’s capabilities and preferences • Enables content negotiation • Accept: image/gif, image/jpeg;q=0.5 • Accept- variants for Language, Encoding, Charset • If-Modified-Since and other conditionals • Frequently used by browsers to manage caches • If-Modified-Since: Sat, 31-May-03 15:00:00 GMT • Cookie – How clients pass cookies back to the servers that set them • Cookie: id=23432;level=3
Web Server Technologies | Part I: HTTP & Getting Started A Closer Look at Response Headers • Server – The server’s name and version • Server: Microsoft-IIS/5.0 • Can be problematic for security reasons • Vary – Tells client & proxy caches which headers were used for content negotiation • Vary: User-Agent, Accept • Set-Cookie – This is how a server sets a cookie on a client • Set-Cookie: id=234; path=/shop; expires=Sat, 31-May-03 15:00:00 GMT; secure
Web Server Technologies | Part I: HTTP & Getting Started A Closer Look at Entity Headers • Allow – Lists the request methods that can be used on the entity • Allow: GET, HEAD, POST • Location – Gives the alternate or new location of the entity • Used with 3xx response codes (redirects) • Location: http://www.ibm.com/us/ • Content-Encoding – specifies encoding performed on the body of the response • Used with HTTP compression • Corresponds to Accept-Encoding request header • Content-Encoding: gzip
Web Server Technologies | Part I: HTTP & Getting Started More Entity Headers • Content-Length – The size of the entity body in bytes • Value shrinks when compression is applied • Content-Length: 24000 • Content-Location – The actual URL of the resource if different than its request URL • Often used to show the index or default page • Content-Location: http://www.foo.com/home.html • Content-Type – specifies Media (MIME) type of the entity body • Corresponds to Accept header • Content-Type: image/png
Web Server Technologies | Part I: HTTP & Getting Started More Entity Headers • Etag – Uniquely identifies a particular instance of a given resource • Used with conditional request headers to validate cached instances of the resource • If-Match, If-None-Match • Etag: adkskdashjgk07563AF • Expires – Gives expiration for the instance of the resource for use in caching • Expires: Sat, 31-May-03 19:00:00 GMT • Last-Modified – Date/time the entity was last changed (or created) • Last-Modified: Fri 30-May-03 09:00:00 GMT
Web Server Technologies | Part I: HTTP & Getting Started Planning Web Server Deployments • Major issues to consider when planning a Web server or Web site deployment • What is the appropriate form of Web hosting? • What type of server software will be used? • What are the sizing requirements? • How will DNS be handled? • There are no fixed answers to any of these questions • Planning should be guided by the goals of the deployment and should harmonize with the related business processes
Web Server Technologies | Part I: HTTP & Getting Started Choosing Among the Hosting Options • Host your own • Pro: Complete control over the physical box • Con: Expensive and difficult to maintain well • Hosting provider schemes • Dedicated Server • Pro: Control without the hardware purchase • Con: Must manage the box – remotely • Co-located Server • Pro: Admin control of entire box • Con: Must purchase box and manage remotely • Virtual Hosting • Pro: Cheapest and easiest to maintain solution • Con: Server is shared, admin access limited
Web Server Technologies | Part I: HTTP & Getting Started Choosing Server Software Beware of sectarian quarrels, especially over performance and security • Apache has the best reputation historically • OS started out more stable, secure and scalable • Features rapidly extended & refined via modular and open development model • Strong administrator ethos = well managed boxes • IIS formerly favored mainly for ease of use in less demanding environments, but 5.0 on Win2K closed most of the remaining quality gap • Any modern HTTP server is very solid software that is terribly vulnerable when deployed & used naively
Web Server Technologies | Part I: HTTP & Getting Started Choosing Server Software, cont. In real world, usually a conditioned choice if not a forgone conclusion • Biggest single factors are type of deployment and prior commitment to an underlying OS • Apache on UNIX and Linux predominates in universities, research institutes and for virtual hosting setups – has majority of hosted domains • Netscape/iPlanet used to have large enterprise market almost to itself • IIS started with smaller companies, often as part of LAN server, but has now taken over Netscape’s leading role in the enterprise
Web Server Technologies | Part I: HTTP & Getting Started Sizing a Web Server • Sizing is process of determining the physical resources required to meet anticipated demand • Processing power and memory are not typically a problem for the Web server • Basic HTTP server job of fetching files is not processor intensive • Resource constraints on the box probably an effect of other server-side mechanisms • Automated session management by app servers • Manipulation of large database queries • Lots of non-optimized code in Web applications
Web Server Technologies | Part I: HTTP & Getting Started Sizing a Web Server, cont Network bottlenecks • Available bandwidth should accommodate max HTTP operations (“hits”) under peak load • Assuming an average file size of 14,000 bytes • 56K Modem could handle about 0.5 hits/sec • T1 line (1.5Mb) could handle about 13 hits/sec • T3 (45Mb) could handle about 400 hits/sec • OC3 (155Mbps) could handle about 1380 hits/sec • Bandwidth sizing should be adjusted based on your actual request frequency and size • Assume peaks at triple the average loads • Also watch out for collisions and overloading of routers, switches, hubs and NICs on the network
Web Server Technologies | Part I: HTTP & Getting Started Dealing with DNS Making a site available by domain name requires its registration and use of DNS • A domain name can be registered with many different registrars • During registration, a DNS server is designated to maintain the domain’s DNS records • These records propagate to other DNS servers • DNS servers use them to resolve a domain such as www.port80software.com to a four-octet IP address such as 66.45.42.237 • ISP’s offer DNS services; you can also maintain your own or use a 3rd party service that lets you manage the records without running a DNS box
Web Server Technologies | Part I: HTTP & Getting Started A Simplistic Model of the DNS System • Client asks its ISP’s DNS to resolve foo.com • That DNS asks root DNS whom to ask about foo.com • Root DNS points to 2nd ISP’s DNS • 1st ISP’s DNS asks 2nd ISP’s DNS • 2nd ISP’s DNS responds with IP • 1st ISP’s DNS replies and caches Root DNS Server 2 3 1 4 6 5 ISP DNS Server ISP DNS Server
Web Server Technologies | Part I: HTTP & Getting Started Dealing with DNS, cont. • You should learn to use nslookup to verify your DNS lookups are working and troubleshoot DNS problems • Command line utility also built into network analyzers like free ieHTTPHeaders • C:\>nslookup google.com • You can also point nslookup at specific DNS servers to test their ability to resolve • C:\>nslookup • >Server 206.13.30.12 • >google.com
Web Server Technologies | Part I: HTTP & Getting Started Virtual and Physical Site Structure Think of a site as having not one structure but two – virtual and physical • Virtual structure is described by the URLs used to request resources from the site • This is the public view of the site – the site as visitors will see it when they browse to it • Physical structure is the organization of the files and directories in the file system on the host machine’s hard disk • This is the private view of the site seen only by you and those users you choose to give access • It will become obvious why this distinction is necessary to keep things straight
Web Server Technologies | Part I: HTTP & Getting Started Configuring Virtual-Physical Mappings The Document Root • A directory in the file system of the host machine where the Web server looks for the files that constitute the Web site • Also called the rootdirectory • Often given an index or default document that serves as the homepage of the site. • Corresponds to the “/” at the end of hostname portion of the URL: • http://www.foo.com/index.html (virtual) • /var/www/index.html (physical) • C:\inetpub\wwwroot\index.html (physical)
Web Server Technologies | Part I: HTTP & Getting Started Configuring Virtual-Physical Mappings Notice how the hostname portion of the URL maps to the same place pointed to by the physical path that lies to the left of the the “/” representing the document root • The URL is virtual to the left of the document root, but it seems to be physical to the right of the document root • In fact, a URL is purely virtual – there is no guarantee that the path to the right of the document root looks this way on disk • In this simple case, virtual and physical paths happen to coincide from the document root down, but such is not always the case
Web Server Technologies | Part I: HTTP & Getting Started Configuring Virtual-Physical Mappings • A virtual directory or alias in the URL path preempts the lookup in the document root • This extends the virtual structure to the right of (or “below”) the root “/” in the URL path • http://www.foo.com/virtual/index2.html • /htdocs/physical/index2.html • Here a virtual directory virtual points to a physical directory that is outside of the document root altogether • Nested virtual directories are also possible
Web Server Technologies | Part I: HTTP & Getting Started Configuring Virtual-Physical Mappings • You can (and should) take advantage of this virtual/physical distinction to: • Preserve the site’s URL scheme even if the physical structure has to change • Avoids broken links due to site expansion/revision • Manage directory and file locations in ways that minimize security risks and facilitate backup procedures • Reduce redundant physical directories for supporting files • Allow developers to keep relative URLs in source code simple
Web Server Technologies | Part I: HTTP & Getting Started Virtual Hosting • We know the hostname part of the URL is a virtual locator for files that live (physically) in a site’s document root • The idea of virtual hosting takes this a step further by allowing a single server to host many domains, each with its own document root • Two methods of virtual hosting • Old way: multiple IP addresses per server • New way: name-based using host headers
Web Server Technologies | Part I: HTTP & Getting Started Managing Users and Hosts • Users (developers) will need remote access allowing them to transfer files to and from the site’s physical structure • FTP (and other file transfer mechanisms) allow the administrator to restrict this access • to sub-sections of the site • by user account or client IP • These restrictions should be backed up by access control lists on the directories that enforce the “principle of least access”
Web Server Technologies | Part I: HTTP & Getting Started Managing Users and Hosts • Similar rules apply to managing access to the Web site itself by visitors • ACLs in the Web site’s physical file structure should be set to the minimum required by the Web server to serve the resources on the site • This gets tricky with server side programming • If the Web site (or part of it) does not need to be available for anonymous access from everywhere then users, groups, hosts and IPs should be restricted • HTTP Authentication can also be employed to require make all or part of a site private and require login