290 likes | 380 Views
Web Intelligence. By Otto Borchert April 28, 2003. Background. Application Layer / HTTP Agents Present - Google / Page Rank Future - Semantic Web / OWL. Hypertext Transfer Protocol (HTTP). Application level protocol (World Wide Web) Runs over TCP, normally port 80
E N D
Web Intelligence By Otto Borchert April 28, 2003
Background • Application Layer / HTTP • Agents • Present - Google / Page Rank • Future - Semantic Web / OWL
Hypertext Transfer Protocol (HTTP) • Application level protocol (World Wide Web) • Runs over TCP, normally port 80 • Information retrieved using a URL (Uniform Resource Locator) protocol://host:port • Typical HTTP packet format • START_LINE<CRLF> • MESSAGE_HEADER<CRLF> • <CRLF> • MESSAGE_BODY<CRLF>
Request Messages • Given by client on START_LINE • Includes: • OPTIONS: request information about available options • GET: (one of 2 most commonly used) retrieve document identified in URL • HEAD (other most common used) retrieve metainformation about document identified in URL (find out how old a page is) • POST: give information to server • PUT: store document under specified URL • DELETE: delete specified URL • TRACE: loopback request message • CONNECT: for use by proxies
Example request • GET http://www.cs.ndsu.nodak.edu/index.html HTTP/1.1 • Give entire descriptor in START_LINE • GET index.html HTTP/1.1 Host: www.cs.ndsu.nodak.edu • Precise page given in START_LINE, host in MESSAGE_HEADER
Server reply • Server replies with a Response Message • Contains version of HTTP being used, 3 digit code indicating whether or not the request was successful and the reason for giving that code
Codes • 1xx – Informational (Request received, continuing process) • 2xx – Success (Action successfully received, understood, and accepted) • 3xx – Redirection (further action must be taken to complete the request) • 4xx – Client Error (request contains bad syntax or cannot be fufilled) • 5xx – Server Error (server failed to fulfill an apparently valid request)
Example Replies • HTTP/1.1 202 Accepted • Web page request accepted, displays page • HTTP/1.1 404 Not Found • The usual not found error • HTTP/1.1 301 Moved Permanently • The page has moved, includes a MESSAGE_HEADER like in request to tell where the page has been moved to
HTTP extras • In version 1.0 one TCP connection for each request. 1.1 allowed for persistent connections • HTTP was set up with web caching in mind. One can check the date a page was last updated and store the newest versions of frequently accessed pages on a local machine
Is the web intelligent? • Intelligence is a poorly defined word anyway. For example, would you consider these intelligent? • Document analysis systems for cataloging and summarizing Web pages • Profiling systems for placing selective Web advertising • Data mining and analysis • Tools for searching databases supported by Web browsers • Translation tools that convert to and from human languages • Statistical software for network caching, routing, and tracking • Knowledge-based systems for automated e-mail reading • Smart agents for Internet-based product and service marketing • Video object recognition and searching
Is the web intelligent? (2) • One of the most important advances in making the web intelligent is through the use of agents. • These agents take many forms including many listed on the previous slide
What is an agent? • No standard definition • Can be: • Web Crawler • Travel Agent • Secretary • Hard to distinguish between agent and program. Agent normally performs actions based on data it finds, without much human intervention • Agents can be defined as intelligent as well • Act as the glue for many of the following ideas
The Present of Web Intelligence - Google • Presently the most used search engine the Internet has to offer. • Provides a unique blend of computer hardware and software to complete millions of user searches each day • Based on a system called Page Rank
PageRank • Developed by Larry Page and Sergey Brin at Stanford University (Google’s founders) • Uses a system of link ranking • If there is a link from page A to page B, page B is correlated to page A • If page A is a strong page to begin with, page B becomes stronger as well
Word Association • On top of PageRank, there is also a system of word matching. • Word counts (Do the words exist on the page?) • Proximity checks (Are the words close together?)
Can’t you cheat PageRank? • People try everyday! • Higher search ranking == More exposure • Link Farms • Places where people merely have millions of links to a web page in hopes the target will move higher on the list. • Google’s answer: Page importance. Once link farms are discovered, they are given a negative rank, so if you have a page on a link farm, its rank will go down as well
Another way to cheat • Put lots of words related to your page in your page (even if they are not visible) • Google’s answer: PageRank is primary, cheaters are given lower priority
Moral Decisions • Wired article • Computer screen shows location, query pairs for random searches on Google’s engines. • One search during the late hours on the West Coast was “How to stop a friend from committing suicide” • Can’t do much about it but make sure they get the right information the next time
The Future of Web Intelligence • The Semantic Web
What is the Semantic Web? • As the web presently stands, it is complete nonsense to most software applications. • Two completely different statements • The ball is round • The round ball • The semantic web is a series of protocols meant to enrich the current web with meaning
Series of Protocols • RDF – Resource Description Framework • OWL – Web Ontology Language (extension of RDF)
Resource Description Framework • From World Wide Web Consortium webpage • RDF “defines a mechanism for describing resources that makes no assumptions about a particular application domain, nor defines (a priori) the semantics of any application domain. The definition of the mechanism should be domain neutral, yet the mechanism should be suitable for describing information about any domain“
RDF – Some examples • Ora Lassila is the creator of the resource http://www.w3.org/Home/Lassila. • Abstract, conceptual Framework • Concrete syntax using XML
Abstract example • Subject (Resource) • http://www.w3.org/Home/Lassila • Predicate (Property) • Creator • Object (literal) • "Ora Lassila“ • Graphic
Concrete syntax • Ora Lassila is the creator of the resource http://www.w3.org/Home/Lassila. <rdf:RDF> <rdf:Description about="http://www.w3.org/Home/Lassila"> <s:Creator>Ora Lassila</s:Creator> </rdf:Description> </rdf:RDF>
Web Ontology Language • What is an ontology? • “defines the terms used to describe and represent an area of knowledge” • OWL defines ontologies for use on the web • Actually an extension of RDF
Ontologies • Date and Time • Countries of the World • Wines • Space Shuttle Information
Some example OWL statements <owl:Class rdf:ID="WineGrape"> <rdfs:subClassOf rdf:resource="&food;Grape" /> </owl:Class> <owl:Class rdf:ID="WhiteWine"> <owl:intersectionOf rdf:parseType="Collection"> <owl:Class rdf:about="#Wine" /> <owl:Restriction> <owl:onProperty rdf:resource="#hasColor" /> <owl:hasValue rdf:resource="#White" /> </owl:Restriction> </owl:intersectionOf> </owl:Class>
Conclusion • Web intelligence is a broad new field for exploration • Present efforts like Google can be improved upon with more semantic information • Any questions?