1 / 39

WEB Intelligence

WEB Intelligence. Contents Basic Web technology, HTML, CGI, HTTP XML-based standards XSLT, XPATH Web services, SOAP Computational Intelligence (as for instance Neural Networks) Web Crawlers and focused Web crawlers XML indexing/retrieval Ranking. The Origins of the WWW.

coons
Download Presentation

WEB Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WEB Intelligence • Contents • Basic Web technology, HTML, CGI, HTTP • XML-based standards XSLT, XPATH • Web services, SOAP • Computational Intelligence (as for instance Neural Networks) • Web Crawlers and focused Web crawlers • XML indexing/retrieval • Ranking

  2. The Origins of the WWW • WWW was invented by Tim Berners-Lee at CERN (1989) • Hypertext across the Internet (replacing FTP) • Three constituents: HTML + URL + HTTP • HTML is an SGML language for hypertext • URL is an notation for locating files on serves • HTTP is a high-level protocol for file transfers

  3. Web Servers HTTP request Klick Web Client Browser Web server Response: HTML code • Client - Server model • Stateless

  4. Network Layers OUR APPLICATIONS THE APPLICATION LAYER HTTP, FTP, SMTP, DNS THE TRANSPORT LAYER TCP, UDP THE INTERNET LAYER IP Ethernet THE NETWORK INTERFACE LAYER

  5. HTTP • HTTP request • GET http://www.it.lth.se/ • HTTP response • Envelope • A blank line • HTML code

  6. HTTP response example HTTP/1.1 200 OK Date: Fri, 10 Feb 2006 13:50:53 GMT Server: Apache/1.3.29 (Debian GNU/Linux) PHP/4.3.3 Content-Length: 170 Content-Type: text/html Last-Modified: Fri, 10 Feb 2006 13:49:58 GMT <html> <head><title>Example HTML file</title></head> <body> <h1>Anders Ardö</h1> He is teacher at Department of Information Technology. </body> </html> 1 2 3

  7. Anatomy of a WebPage • Head • Title • Meta: <meta name=”keywords” content=”HTML, WebPage”> • Style sheets • Body • Formating tags: H1, table, B, P, BR, UL, … • Input forms • Links: <a href="http://www.it.lth.se/">IT</a> • Styles

  8. Hypertext • Collections of document connected by hyperlinks • Paul Otlet, philosophical treatise (1934) • Vannevar Bush, hypothetical Memex system (1945) • Ted Nelson introduced hypertext (1968) • Hypermedia generalizes hypertext beyond text

  9. Markup Languages • Notation for adding formal structure to text • Charles Goldfarb, the INLINE system (1970) • Standard Generalized Markup Language, SGML (1986

  10. The Design of HTML • Simple, purist design principles • HTML describes the logical structure of a document • Browsers are free to interpret tags differently • HTML is a lightweight file format • Size of file containing just ”Hello World!”:

  11. Simple Formatting (1/2) <html> <head> <title>Good Advice</title> </head> <body> <h1>Good Advice for Everyday Life</h1> <h2>For UNIX programmers</h2> <b>Never</b> type: <p><tt>rm -rf /*</tt><p> on your computer. <h2>For Nuclear Scientists</h2> <b>Never</b> press the <i>Big <font color="red">Red</font> Button</i>. </body> </html>

  12. Simple Formatting (2/2)

  13. Hyperlinks: Source Document <html> <head> <title>Source Document</title> </head> <body> <a href="target.html#danger">Better look here</a>. </body> </html>

  14. Hyperlinks: Target Document <html> <head> <title>Target Document</title> </head> <body> ... <a name="danger"></a> <h2>Chapter 17: Dangerous Shell Commands</h2> Never execute a shell command that inadvertently changes all vowels to the character 'x'. </body> </html>

  15. HTML Validity • HTML has a formal syntax specification • 800 lines of DTD notation • A validator gives syntax errors for invalid documents • Most HTML documents on the Web are invalid: • Valid documents may contain this logo:

  16. Reasons for Invalidity • Ignorance of the HTML standard • Lack of testing • ”This page is optimized for the XYZ browser” • ”This page is best viewed in 1024x768” • Automatic tools generate invalid HTML output • Forgiving browsers try to interpret invalid input <h2>Lousy HTML</h1> <li><a>This is not very</b> good. <li><i>In fact, it is quite bad</em> </ul> But the browser does <a naem="goof">something.

  17. Problems with Invalidity • There are several different browsers • Each browsers has many different implementations • Each implementation must interpret invalid HTML • There are many arbitrary choices to make • The HTML standard has been undermined • HTML renders differently for most clients

  18. HTTP requests • GET: GET /path/to/file/index.html HTTP/1.0 • HEAD: HEAD /path/to/file/index.html HTTP/1.0 • POST: Adds data in the message body • and others …

  19. HTTPexample GET /search?q=Introduction+to+XML+and+Web+Technologies HTTP/1.1 Host: www.google.com User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040803 Accept: text/xml,application/xml,application/xhtml+xml, text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: da,en-us;q=0.8,en;q=0.5,sw;q=0.3 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://www.google.com/ • Request line (methods: GET, POST, ...) • Header lines • Request body (empty here)

  20. HTTP Responses HTTP/1.1 200 OK Status line Connection: close Date: Thu, 16 Mar 2006 12:39:12 GMT Accept-Ranges: bytes ETag: "63062-0-41342c03" Server: Apache/1.3.29 (Debian GNU/Linux) PHP/4.3.3 Content-Length: 2820 Content-Type: text/html Last-Modified: Tue, 31 Aug 2004 07:42:59 GMT Client-Date: Thu, 16 Mar 2006 12:39:12 GMT Client-Peer: 130.235.4.69:80 Client-Response-Num: 1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html>...</html> Header lines Response Body

  21. HTTP return codes • 1xx informational message • 2xx success 200 OK • 3xx redirect 301 Moved permanently • 4xx client error 400 Bad Request 401 Unauthorized 403 Forbidden 404 Not Found • 5xx server error 500 Server error 503 Service Unavailable

  22. Static vs Dynamic Pages • Static - just copy a file from server to client • Dynamic - do some data processing • Parameters - CGI, Forms

  23. Dynamic Web Pages • Answers to database queries • Animated Web Pages • User Dialogs • Checking user input May be handled client side (JavaScript, Java applets, Flash, … Or server side

  24. Dynamic, server side • CGI – Perl, Python, C, … • ASP • PHP • Java Servlets • Java Server Pages - JSP • etc

  25. CGI - Common Gateway Interface • Webserver gets a request for a page with a special URL (/cgi-bin/…) • The CGI-script is started as an OS process • Script read parameters • Scipt outputs HTML-code • Script process terminates

  26. CGI problems • OS processes are expensive • State between invocations • Synchronization between processes

  27. Parameters HTML forms <h3>Search Lund University Departments</h3> <form action="http://www.lu.se/search.phtml“ method=“get"> Which database? <select name=“db"> <option value=“LTH">LTH</option> <option selected value=“LU">All LU</option> <option value=“IT">IT</option> </select><br> Please enter your question: <input type="text" name=“query"><br> <input type="submit" name="send" value="Go!"> </form> • HTML form

  28. Parameters • Encoded in the URL: • GET GET /cgi-bin/search.phtml?db=LU&query=masters+thesis HTTP/1.0 • Encoded in the message body: • POST POST /cgi-bin/search.phtml HTTP/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 26 db=LU&query=masters+thesis

  29. Encoding of Form Data • Encoding to query string (URL encoding):db=LU&query=masters+thesis&send=Go%21 • GET: place parameter string in request URLhttp://.../search.phtml?db=LU&query=mast... • POST: place query string in request body

  30. Server side scripting PHP • general-purpose scripting language • suited for Web development • can be embedded into HTML • Have a lot of predefined modules and interfaces

  31. PHP example • <html>  <head>   <title>PHP Test</title>  </head>  <body><?php echo "<p>Hello World</p>\n";?> • The time is <?php echo date(‘H:I:s’); ?> •   </body> • </html>

  32. Uniform Resource Locator • A Web resource is located by a URL http://www.w3.org/TR/html4/ • Relative URL sgml/dtd.html • Fragment identifier http://www.w3.org/TR/HTML4/#minitoc server path scheme

  33. URIs, URNs • Uniform Resource Identifier (URI) scheme:scheme-specific-part Conventions about use of /, #, and ? • Uniform Resource Name (URN) urn:isbn:0-471-94128-X

  34. Sessions • But what if I’d like to implement a hit counter? Stateless => problems

  35. Session Management • Techniques • URL rewriting • Hidden form fields • Cookies • SSL sessions

  36. Cookies • Extension of HTTP that allows servers to store data on the clients • limited size and number • may be disabled by the client • Set-Cookie: sessionid=21A9A8089C305319; path=/ • Cookie: sessionid=21A9A8089C305319

  37. Regular expressions • is a very powerful way of extracting information (pieces of text) from a large document • Describes a pattern that is matched against the text

  38. Regular expressions • /Heja/ matches the string 'Heja' • /Heja?/ matches the string 'Hej' and 'Heja' • /^http:/ matches all lines that begin with 'http:' • /\bFred\b/ matches 'Fred' but not 'Fredrick' • /(\d+):(\d+):(\d+)/ matches for example times like 12:30:01 and groups hours into group 1, minutes into group 2, and seconds into group 3. • /http:\/\/([^\/]+)(\/[^\s]+)\s/ matches URLs and places the server in group 1 and the path in group 2.

  39. Regular expressions • What is an ISBN number? • Format? • /isbn:?\s*([\d-x]+)/i How match and extract ISBN numbers?

More Related