530 likes | 839 Views
Web Servers. Herng-Yow Chen. Outline. Survey many different types of software and hardware web servers. Describe how to write a simple diagnostic web server in Perl. Explain how web servers process HTTP transactions, step by step. Different types of web servers.
Web Servers Herng-Yow Chen
Outline • Survey many different types of software and hardware web servers. • Describe how to write a simple diagnostic web server in Perl. • Explain how web servers process HTTP transactions, step by step.
Different types of web servers • General-purpose software web server • Web server appliances • Embedded web servers
Jobs of web servers • Implement HTTP and the related TCP connection handling. • Manage the server-slide resource and provide administrative features to configure, control, and enhance the web service.
Jobs of Operating System • Manages the hardware details of the underlying computer system • Provide TCP/IP network support • Provide filesystems to hold web resources • Provide process management to control computing activities.
General-purpose software web server • General-purpose software web servers run on standard, network-enabled computer system. • Open source software (such as Apache or W3C’s Jigsaw). • Commercial software (such as Microsoft’s and iPlanet’s web servers). • Web server software is available for just about every computer and operating systems.
General-Purpose Software Web Servers In September 2004, the Netcaft survey (http://news.netcraft.com/archives/web_server_survey.html)
Web server appliances • Web server appliances are prepackaged software/hardware solutions. The vendor preinstalls a software server onto a vendor-chosen computer platform and preconfigures the software. • Sun/Cobalt RaQ web appliance(http://www.cobalt.com) • Toshiba Magnia SG10 (http://www.toshiba.com) • IBM Whistle web server application (http://www.whistle.com) • Appliance solutions remove the need to install and configuration software and often greatly simplify administration. However, the web server often is less flexible, feature-rich, and the server hardware is not easily upgradable.
Embedded web servers • Embedded servers are tiny web servers intended to be embedded into consumer products (e.g., printers or home appliances). • Allow users to administer their consumer devices using a convenient web browser interface. • IPic match-head sized web server • (http://www-ccs.cs.umass.edu/~shri/iPic.html) • NetMedia SitePlayer SP1 Ethernet web server • (http://www.siteplayer.com)
A Minimal Perl Web server • Type-o-serve – a minimal Perl web server used for HTTP debugging • http://www.http-guide.com/tools/type-o-serve.pl
A Minimal Perl Web Server HTTP request message Type-o-serve dialog GET /blah.txt HTTP/1.1 Accept: */* Accept-language: en-us Accept-encoding: gzip, deflate User-agent: Mozilla/4.0 Host: www.csie.ncnu.edu.tw:8080 Connection: Keep-alive % ./type-o-serve.pl 8080 <<Request From 'www.csie.ncnu.edu.tw'>> GET /blah.txt HTTP/1.1 Accept: */* Accept-language: en-us Accept-encoding: gzip, deflate User-agent: Mozilla/4.0 Host: www.csie.ncnu.edu.tw:8080 Connection: Keep-alive <<Type Response followed by '.’>> HTTP/1.0 200 OK Connection: close Content-type: text-plain Hi there! HTTP response message HTTP/1.0 200 OK Connection: close Content-type: text/plain Hi there!
What do web servers do? • Set up connection • Receive request • Process request • Access resource • Construct response • Send response • Log transaction
What Real Web Servers Do client User space HTTP server software process (3)Process request (5)Create response (2)Receive request (4)Access resource (7) Log transaction TCP/IP network stack (1)Set up connection Network interface Object Storage (6)Send response Operating system
Step 1: accepting client connections • Handling new connections • Exacting client IP from a new TCP connection • Client hostname identification • Using “reverse DNS” • Determining the client user through ident • Some web servers support the IETF ident protocol
Handling new connection • When a client requests a TCP connection to the web server, the web server establishes the connection and determines which client is on the other side of the connection, extracting the IP address from the TCP connection. (e.g., using getpeername call in UNIX socket) • The server is free to reject and immediately close connections, because the client IP is unauthorized or is known malicious client. • Once a new connection is established and accepted, the server adds the new connection to its list of existing connections and prepares to watch for data on the connection.
Client host identification • Most web servers can be configured to convert client IP addresses into client hostnames, using “reverse DNS.” • The hostname information is used for detailed access control and logging. • Note that hostname lookups can take a long time, slowing down web transactions. Many high-performance web servers either disable hostname resolution or enable it only for particular content. • Ex: Configuring Apache to lookup hostnames for HTML and CGI resources HostnameLookups off <Files ~ “\. (html | htm | cgi)$”> HostanmeLookups on </Files>
Determining the client user through ident • The ident protocol let servers find out what username initiated an HTTP connection. • The username information is particularly useful for logging – the 2nd field of the popular Common Log Format contains the ident username of each HTTP request. (RFC931, the updated ident specification is documented by RFC 1413). • If a client supports the ident protocol, the client listens on TCP port 113 for ident requests.
Determining the Client User Through ident ident connection (a) Mary establishes new HTTP connection Port 80 Port 4236 HTTP connection (c)Server sends request 4236, 80 (b)Server establishes ident connection Mary Port 80 Web server Port 113 4236, 80:USERID:UNIX:MARY (d)Client returns ident response
Ident protocol (cont.) • Ident can work inside organizations, but it does not work well across public Internet for the following reasons. • Many client PC don’t run the identd identification protocol daemon software. • The ident protocol significantly delays HTTP transactions. • Many firewalls won’t permit incoming ident traffic. • The ident protocol is insecure and easy to fabricate. • The ident protocol doesn’t support virtual IP address well. • There are privacy concerns about exporting client usernames. • Enable ident lookup in Apache • IdentityCheck on • Common Log Format log files typically contain typhens (-) in the 2nd filed if no ident information is available.
Step 2: Receiving request messages • As the data arrives on connections, the server reads out the data and start parsing the request message. • Parse the request line looking for the request method, the specified URI, and the version number. • Read the message headers, each ending in CRLF. • Detects the end-of-headers blank line, ending in CRLF. • Reads the request body, if any (length specified by Content-Length header) • Internet Representations of Messages • Some web servers also store the request message in internal data structures that make the message easy to manipulate.
Receiving Request Messages Request message being read from network GET /specials/hychen.gif HTTP/1.0CRLF Accept: image/gifCRLF Host: www.j Internet LF CR LF CR moc.erawdrah-seo server client
Internal Representations of Message GET /specials/saw-blade.gif HTTP/1.0CRLF Accept: image/gifCRLF Host: www.joes-hardware.comCRLF CRLF Parse method: 1 version: 1.0 uri: ● header count: 2 headers: ● body: - specials/saw-blade.gif www.joes-hardware.com Image/gif Name:Host Value: ● Value: ● Name:Accept
Different web server architectures • Single-threaded web servers • Multi-process and multi-threaded web servers • Multiplexed I/O web servers • Non-blocking network accessing • Multiplexed multi-threaded web servers
Step 3: Processing requests • Once the web server has received a request, it can process the request using method, resource, headers, and optional body. • Some method (e.g., POST) require entity body data in the request message. A few methods (e.g., GET) forbid entity body data in the request message.
Step 4: Mapping and Accessing resources • Docroot • Virtually hosted docroots • User home directory docroots • Directory Listings • Dynamic content resource mapping • Server-Side Include (SSI) • Access Control
Docroots • Web servers support different kinds of resource mapping, but the simplest form of mapping uses the request URI to name a file in the web server’s filesystem. • Typically, a special folder in the web server filesystem is reserved for web content. The folder is called the document root, or docroot. • The web server takes the URI from the request message and appends it to the document root. The docroot setting in apache servers • DocumentRoot /usr/local/httpd/files • Servers must be careful not to let relative URLs back up out of a document root and expose other parts of the filesystem. E.g., http://www.csie.ncnu.edu.tw/../
Docroots docroots /usr/local/httpd/files Internet Request message GET /specials/hychen.gif HTTP/1.0 Host: www.csie.ncnu.edu.tw Object Storage client Web server Request URI: /specials/hychen.gif Server resource: /usr/local/httpd/files/specials/hychen.gif
Virtually hosted docroots • Virtually hosted web servers host multiple web site on the same web server, giving each site its own distinct document root on the server. • A virtual hosted web server identifies the correct document root to use from the IP or hostname in the Host header.
Apache’s virtual host configuration • <VirtualHost www.joes-hardware.com> • ServerName www.joes-hardware.com • DocumentRoot /docs/joe • TransferLog /log/joe.access_log • ErrorLog /logs/joe.error_log • </VirtualHost> • <VirtualHost www.marys-hardware.com> • ServerName www.marys-hardware.com • DocumentRoot /docs/mary • TransferLog /log/mary.access_log • ErrorLog /logs/mary.error_log • </VirtualHost>
Virtually hosted docroots /docs/joe /docs/mary Internet Request message A GET /index.html HTTP/1.0 Host: www.joes-hardware.com GET /index.html HTTP/1.0 Host: www.marys-antiques.com client Request message B www.joes-hardware.com www.marys-antiques.com
User home directory docroots Request message A GET /~bob/index.html HTTP/1.0 /home/bob/public_html Internet /home/betty/public_html GET /~betty/index.html HTTP/1.0 client Request message B www.joes-hardware.com www.marys-antiques.com
User home directory docroots • Another common use of docroots gives people private web site on a web server. • A typical convention maps URIs whose paths begin with a slash and tilde (/~) followed by a username to a private document root for that user. • The private docroot is often the folder called public_html inside that user’s home directory, but it can be configured differently (e.g., in the NCNU web server, we use WWW as the user’s private document root.) • In apache’s configuration, • UserDir public_html
Directory listings • A web serer can receive request for directory URLs, where the path resolves to a directory, not a file. • Most web servers can be configured to take a few different actions when a client requests a directory URL: • Return an error. • Return a special, default, “index file” instead of the directory. • Scan the directory, and return an HTML page containing the contents.
Directory Listings (continued) • Most web servers look for a file named index.html or index.htm inside a directory to represent that directory. • In apache configuration • DirectoryIndex index.html index.htm home.html home.html index.cgi • Disable the automatic generation of directory index files with the apache directive: • Option -Indexes
Dynamic content resource mapping • Web server also can map URIs to dynamic resources – that is, to programs that generate content on demand. • In fact, a whole class of web servers called application servers connect web servers t sophisticated backend applications. • The web server need to be able to tell when a resource is a dynamic resource, where the dynamic content generator program is located, and how to runt he program.
Dynamic content … • In apache’s configuration • ScriptAlias /cgi-bin/ /usr/lcoal/etc/httpd/cgi-programs/ • AddHandler cgi-script .cgi • CGI is an early, simple, and popular interface for executing server-side applications. Modern application servers have more powerful and server-side dynamic content support, including Active Server Pages, java servlets, and PHP.
Dynamic Content Resource Mapping Internet client server
Server-Side Includes (SSI) • Many web servers also provide support for server-side includes. • If a resource is flagged as containing server-side includes, the server processes the resource contents before sending them to the client. • The content are scanned for certain special patterns, which can be variable name or embedded scripts. The special patterns are replaced with the values of variables or the output of executable scripts. • This is an easy way to create dynamic content.
Access controls • Web servers also can assign access controls to particular resource. • When a request arrives for an access-controlled resource, the web server can control access based on the IP address of the client, or it can issues a password challenge to get access to the resource. • We will see more details in the later lecture, chapter 12 (HTTP authentication).
Step 5: Building Responses • Once the web server has identified the resource, it performs the action described in the request method and returns the response message, which contains status code, response header, and a response body. • Response Entities • MIME Typing • Redirection
Response entities • If the transaction generated a response body, the content is sent back with the response message, which usually contains: • a Content-Type header, i.e. MIME typing • a Content-Length header, describing body size • The actual message body content
MIME typing • The web server is responsible for determining the MIME type of the response body. • There are many ways to configure servers to associate MIME types with resources: • mime.types: extension-based type association • Magic typing: content-based association, scanning a known patterns • Explicit typing: force particular files or directory contents to have a MIME types, regardless of the file extension or contents. • Type negotiation: server is configured to store a resource in multiple document formats. In a client-server negotiation process the server can determine the “best” format to use. (chapter17)
MIME Typing HTTP/1.1 200 OK Content-type: image/gif Content-length: 8572 hychen.gif file HTTP request message contains the command and the URI GET /specials/hychen.gif HTTP/1.1 Host: www.csie.ncnu.edu.tw www.csie.ncnu.edu.tw client
Redirection • Web servers sometimes return redirection responses (indicated by a 3XX return code) instead of success messages. The Location response header contains a URI for the new or preferred location of the content. Redirections are useful for: • Permanently moved resources • Temporarily moved resources • URL augmentation • Load balancing • Server affinity • Canonicalizing directory names
300-399: Redirection Status Code • Status code Reason Phrase 300 Multiple Choices 301 Moved Permanently 302 Found 303 See other 304 Not Modified 305 Use Proxy 306 (Unused) 307 Temporary Redirect
Step 6: Sending Responses • The servers may have many connections to many clients, some idle, some sending data to the server, and some carrying response data back to the clients. • The servers needs to keep track of connection state and handle persistent connections with special care. • For non-persistent connections, the server is expected to close its side of connection when the entire message is sent. • For persistent connections, the connection may stay open, in which case the server needs to be extra cautious to compute the Content-Length header correctly, or the client will have no way of knowing when a response ends (c.f., Chapter 4).
Step 7: Logging • Finally, when a transaction is complete, the web server notes an entry into a log file, describing the transaction performed. • Most web servers provide several configurable forms of logging. (Later lectures, Chapter 21, for details)
Reference: Web server • http://www.apache.org • The apache web site • http://www.w3c.org/Jigsaw • Jigsaw- W3C’s Server • http://www.ietf.org/rfc/rfc1413.txt • RFC 1413, “Identification Protocol,” By M. St. Johns.