Sökmotorer och agenter i framtidens webb TREFpunkt 2006

Sökmotorer och agenter i framtidens webbTREFpunkt 2006 Anders Arpteg Ph D in Computer Science

Google Architecture

Pre-Google Ranking • WebCrawler, AltaVista, Evreka • TF/IDF • Term Frequency • Inverse Document Frequency • Problems • Returns many irrelevant pages • Easy to cheat, to manipulate rankings • Other techniques used • Lexical analysis • Stop words • Stemming

The PageRank Algorithm • Links instead of terms • Many IMPORTANT inbound links PR(x) = PageRank value for page xd = damping factorC(x) = outbound links from page x

PageRank Example 1 • A page's PageRank = 0.15 +0.85 * (a "share" of the PageRank of every page that links to it)

Other ranking factors • PageRank is not as important any more • Targeted keyword techniques • Choose keywords carefully • Font-size identification • Keywords in title, headings, … • Keywords in URL, preferably in domain name • Keywords in link text • Relevant in-bound links • Links from sites with related content • Links from sites with high PageRank • Patience, time will favor • Sandbox effect • Trusted and old domains • Clean code, valid HTML • Beware JavaScript links • Beware frames • Use Google sitemaps,but beware link farming

Google Summary • Links represents popularity, and we want popular sites highly ranked • Difficult to cheat PageRank compared to TF/IDF • Revolutionary architecture • High coverage • High performance • PageRank is not the only factor • Keyword targeting • Clean design, valid code • General rule • Google tries to simulate human behavior;keywords that are highlighted for humans are highly valued by Google. • Sites with good structure for humans have good structure for Google.

The Semantic Web • Definition of the Semantic Web • "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001. • Why the Semantic Web topic? • Connection to my research work • How will the Semantic Web influence you

History of the Internet • 1969 ARPANET (Internet) • 1971 Email • 1974 TCP introduced • 1979 USENET • 1984 DNS introduced • 1989 First Web proposal • 1991 WWW introduced • 1994 Order pizza online • 1994 Webcrawler • 1995 Sun launch Java • 1998 Google • 1998 XML defined • 1999 RDF defined • 2004 Yahoo-, MSN Search • 2004 OWL defined "We set up a telephone connection between us and the guys at SRI...," Kleinrock ... said in an interview: "We typed the L and we asked on the phone, "Do you see the L?" "Yes, we see the L," came the response. "We typed the O, and we asked, "Do you see the O." "Yes, we see the O." "Then we typed the G, and the system crashed"... Yet a revolution had begun"...

Growth of the Web

Problems with the current Web

Semantic Web Principles • Everything can be identified by URI's • Resources and links can have types • Partial information is tolerated • There is no need for absolute truth • Evolution is supported • Minimalistic design Make simple things simple, and complex things possible!

Semantic Web Languages • XML • Defines the data language • How to encode “words” into a string • RDF • Defines resources and links • How “things” are related to each other • OWL • Defines ontology • What things “mean” and their constraints

Semantic Web Layers

Semantic Web Example <?xml version="1.0" encoding="ISO-8859-1"?> <rdf:RDF xmlns:daml="http://www.daml.org/2001/03/daml+oil#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oiled="http://img.cs.man.ac.uk/oil/oiled#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2000/10/XMLSchema#"> <daml:Ontology rdf:about=""> <dc:title>Distribution Company</dc:title> <dc:date></dc:date> <dc:creator>Anders Arpteg</dc:creator> <dc:description></dc:description> <dc:subject></dc:subject> <daml:versionInfo></daml:versionInfo> </daml:Ontology> <daml:Class rdf:about="http://www.bugsoft.nu/aa/logics2003/company.daml#item"> <rdfs:label>item</rdfs:label> <rdfs:comment><![CDATA[]]></rdfs:comment> <oiled:creationDate><![CDATA[2003-12-17T10:06:20Z]]></oiled:creationDate>

Summary • Problem with the current Web • Huge amount of information, needs KM • Machines can not understand the information • Semantic Web technologies • Standardized languages • Minimalistic approach • Good or bad? • Nothing really new, we can already do that • Amazing, think of all new possibilities

Sökmotorer och agenter i framtidens webb TREFpunkt 2006