400 likes | 549 Views
WEB Intelligence. Contents Basic Web technology, HTML, CGI, HTTP XML-based standards XSLT, XPATH Web services, SOAP Computational Intelligence (as for instance Neural Networks) Web Crawlers and focused Web crawlers XML indexing/retrieval Ranking. The Origins of the WWW.
E N D
WEB Intelligence • Contents • Basic Web technology, HTML, CGI, HTTP • XML-based standards XSLT, XPATH • Web services, SOAP • Computational Intelligence (as for instance Neural Networks) • Web Crawlers and focused Web crawlers • XML indexing/retrieval • Ranking
The Origins of the WWW • WWW was invented by Tim Berners-Lee at CERN (1989) • Hypertext across the Internet (replacing FTP) • Three constituents: HTML + URL + HTTP • HTML is an SGML language for hypertext • URL is an notation for locating files on serves • HTTP is a high-level protocol for file transfers
Web Servers HTTP request Klick Web Client Browser Web server Response: HTML code • Client - Server model • Stateless
Network Layers OUR APPLICATIONS THE APPLICATION LAYER HTTP, FTP, SMTP, DNS THE TRANSPORT LAYER TCP, UDP THE INTERNET LAYER IP Ethernet THE NETWORK INTERFACE LAYER
HTTP • HTTP request • GET http://www.it.lth.se/ • HTTP response • Envelope • A blank line • HTML code
HTTP response example HTTP/1.1 200 OK Date: Fri, 10 Feb 2006 13:50:53 GMT Server: Apache/1.3.29 (Debian GNU/Linux) PHP/4.3.3 Content-Length: 170 Content-Type: text/html Last-Modified: Fri, 10 Feb 2006 13:49:58 GMT <html> <head><title>Example HTML file</title></head> <body> <h1>Anders Ardö</h1> He is teacher at Department of Information Technology. </body> </html> 1 2 3
Anatomy of a WebPage • Head • Title • Meta: <meta name=”keywords” content=”HTML, WebPage”> • Style sheets • Body • Formating tags: H1, table, B, P, BR, UL, … • Input forms • Links: <a href="http://www.it.lth.se/">IT</a> • Styles
Hypertext • Collections of document connected by hyperlinks • Paul Otlet, philosophical treatise (1934) • Vannevar Bush, hypothetical Memex system (1945) • Ted Nelson introduced hypertext (1968) • Hypermedia generalizes hypertext beyond text
Markup Languages • Notation for adding formal structure to text • Charles Goldfarb, the INLINE system (1970) • Standard Generalized Markup Language, SGML (1986
The Design of HTML • Simple, purist design principles • HTML describes the logical structure of a document • Browsers are free to interpret tags differently • HTML is a lightweight file format • Size of file containing just ”Hello World!”:
Simple Formatting (1/2) <html> <head> <title>Good Advice</title> </head> <body> <h1>Good Advice for Everyday Life</h1> <h2>For UNIX programmers</h2> <b>Never</b> type: <p><tt>rm -rf /*</tt><p> on your computer. <h2>For Nuclear Scientists</h2> <b>Never</b> press the <i>Big <font color="red">Red</font> Button</i>. </body> </html>
Hyperlinks: Source Document <html> <head> <title>Source Document</title> </head> <body> <a href="target.html#danger">Better look here</a>. </body> </html>
Hyperlinks: Target Document <html> <head> <title>Target Document</title> </head> <body> ... <a name="danger"></a> <h2>Chapter 17: Dangerous Shell Commands</h2> Never execute a shell command that inadvertently changes all vowels to the character 'x'. </body> </html>
HTML Validity • HTML has a formal syntax specification • 800 lines of DTD notation • A validator gives syntax errors for invalid documents • Most HTML documents on the Web are invalid: • Valid documents may contain this logo:
Reasons for Invalidity • Ignorance of the HTML standard • Lack of testing • ”This page is optimized for the XYZ browser” • ”This page is best viewed in 1024x768” • Automatic tools generate invalid HTML output • Forgiving browsers try to interpret invalid input <h2>Lousy HTML</h1> <li><a>This is not very</b> good. <li><i>In fact, it is quite bad</em> </ul> But the browser does <a naem="goof">something.
Problems with Invalidity • There are several different browsers • Each browsers has many different implementations • Each implementation must interpret invalid HTML • There are many arbitrary choices to make • The HTML standard has been undermined • HTML renders differently for most clients
HTTP requests • GET: GET /path/to/file/index.html HTTP/1.0 • HEAD: HEAD /path/to/file/index.html HTTP/1.0 • POST: Adds data in the message body • and others …
HTTPexample GET /search?q=Introduction+to+XML+and+Web+Technologies HTTP/1.1 Host: www.google.com User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040803 Accept: text/xml,application/xml,application/xhtml+xml, text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: da,en-us;q=0.8,en;q=0.5,sw;q=0.3 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://www.google.com/ • Request line (methods: GET, POST, ...) • Header lines • Request body (empty here)
HTTP Responses HTTP/1.1 200 OK Status line Connection: close Date: Thu, 16 Mar 2006 12:39:12 GMT Accept-Ranges: bytes ETag: "63062-0-41342c03" Server: Apache/1.3.29 (Debian GNU/Linux) PHP/4.3.3 Content-Length: 2820 Content-Type: text/html Last-Modified: Tue, 31 Aug 2004 07:42:59 GMT Client-Date: Thu, 16 Mar 2006 12:39:12 GMT Client-Peer: 130.235.4.69:80 Client-Response-Num: 1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html>...</html> Header lines Response Body
HTTP return codes • 1xx informational message • 2xx success 200 OK • 3xx redirect 301 Moved permanently • 4xx client error 400 Bad Request 401 Unauthorized 403 Forbidden 404 Not Found • 5xx server error 500 Server error 503 Service Unavailable
Static vs Dynamic Pages • Static - just copy a file from server to client • Dynamic - do some data processing • Parameters - CGI, Forms
Dynamic Web Pages • Answers to database queries • Animated Web Pages • User Dialogs • Checking user input May be handled client side (JavaScript, Java applets, Flash, … Or server side
Dynamic, server side • CGI – Perl, Python, C, … • ASP • PHP • Java Servlets • Java Server Pages - JSP • etc
CGI - Common Gateway Interface • Webserver gets a request for a page with a special URL (/cgi-bin/…) • The CGI-script is started as an OS process • Script read parameters • Scipt outputs HTML-code • Script process terminates
CGI problems • OS processes are expensive • State between invocations • Synchronization between processes
Parameters HTML forms <h3>Search Lund University Departments</h3> <form action="http://www.lu.se/search.phtml“ method=“get"> Which database? <select name=“db"> <option value=“LTH">LTH</option> <option selected value=“LU">All LU</option> <option value=“IT">IT</option> </select><br> Please enter your question: <input type="text" name=“query"><br> <input type="submit" name="send" value="Go!"> </form> • HTML form
Parameters • Encoded in the URL: • GET GET /cgi-bin/search.phtml?db=LU&query=masters+thesis HTTP/1.0 • Encoded in the message body: • POST POST /cgi-bin/search.phtml HTTP/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 26 db=LU&query=masters+thesis
Encoding of Form Data • Encoding to query string (URL encoding):db=LU&query=masters+thesis&send=Go%21 • GET: place parameter string in request URLhttp://.../search.phtml?db=LU&query=mast... • POST: place query string in request body
Server side scripting PHP • general-purpose scripting language • suited for Web development • can be embedded into HTML • Have a lot of predefined modules and interfaces
PHP example • <html> <head> <title>PHP Test</title> </head> <body><?php echo "<p>Hello World</p>\n";?> • The time is <?php echo date(‘H:I:s’); ?> • </body> • </html>
Uniform Resource Locator • A Web resource is located by a URL http://www.w3.org/TR/html4/ • Relative URL sgml/dtd.html • Fragment identifier http://www.w3.org/TR/HTML4/#minitoc server path scheme
URIs, URNs • Uniform Resource Identifier (URI) scheme:scheme-specific-part Conventions about use of /, #, and ? • Uniform Resource Name (URN) urn:isbn:0-471-94128-X
Sessions • But what if I’d like to implement a hit counter? Stateless => problems
Session Management • Techniques • URL rewriting • Hidden form fields • Cookies • SSL sessions
Cookies • Extension of HTTP that allows servers to store data on the clients • limited size and number • may be disabled by the client • Set-Cookie: sessionid=21A9A8089C305319; path=/ • Cookie: sessionid=21A9A8089C305319
Regular expressions • is a very powerful way of extracting information (pieces of text) from a large document • Describes a pattern that is matched against the text
Regular expressions • /Heja/ matches the string 'Heja' • /Heja?/ matches the string 'Hej' and 'Heja' • /^http:/ matches all lines that begin with 'http:' • /\bFred\b/ matches 'Fred' but not 'Fredrick' • /(\d+):(\d+):(\d+)/ matches for example times like 12:30:01 and groups hours into group 1, minutes into group 2, and seconds into group 3. • /http:\/\/([^\/]+)(\/[^\s]+)\s/ matches URLs and places the server in group 1 and the path in group 2.
Regular expressions • What is an ISBN number? • Format? • /isbn:?\s*([\d-x]+)/i How match and extract ISBN numbers?