COMS E6125 Web-enHanced Information Management (WHIM)

COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008 Kaiser: COMS E6125

Today’s Topics • Web Search – partially adapted from Alexandros Biliris (adjunct here) • Semantic Web – partially adapted from York Sure (University of Karlsruhe) Kaiser: COMS E6125

Information Retrieval as a Field • An “old” field that addresses issues related to • Classification and categorization of documents • Systems and languages for searching for words • User interfaces and visualization of results • Field was previously seen as of narrow interest – mainly, library search • The advent of the Web brings IR to the forefront • The Web became a huge “library” and everybody has free access to it (with no special training on “search”) • No central editorial board Kaiser: COMS E6125

IR: A World of Words • Typical IR model: The dataset consists of documents, each of which is a bag (multiset) of words (terms) • IR functionality: map words to documents • Search for documents that contain • a given word • word1 AND word2 • word1 AND word2 AND NOT word3 • etc.

IR: A World of Words • Detail 1: Stop Words • Certain words are considered irrelevant and not placed in the bag, e.g., “and”, “the”, … • Detail 2: “Stemming” and other content analysis • Using language-specific rules, convert words to their basic form, e.g., “surfing”, “surfed” --> “surf” • Deal with synonyms, misspellings, abbreviations

Rankings • Finding documents that are the most relevant to a user’s query is quite imprecise • A ranking is an ordering of the documents retrieved that (hopefully) reflects their relevance to the user query Kaiser: COMS E6125

IR Imprecise semantics Keyword search Text, unstructured data No transactions Partial results (top k) Relevance is built-in DBMS Precise semantics SQL Structured data Transactional semantics Generate full answer Relevance is built on top IR vs. DBMS

Inverted indexes • Permit fast search for individual terms • For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) Kaiser: COMS E6125

Inverted indexes • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 • Also used for statistical ranking algorithms Kaiser: COMS E6125

How Inverted Files Are Created • Periodically rebuilt, static otherwise • Documents are parsed to extract tokens Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

How Inverted Files are Created • After all documents have been parsed the inverted file is sorted alphabetically

How Inverted Files are Created • Multiple term entries for a single document are merged • Within-document term frequency information is compiled

How Inverted Files are Created • Finally, the file can be split into • A Dictionary or Lexicon file and • A Postings file Kaiser: COMS E6125

How Inverted Files are Created Dictionary/Lexicon Postings

Search Engine Characteristics • Unedited data – anyone can enter content • Quality issues, spam • Varied information types • Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! Kaiser: COMS E6125

Search Engine Characteristics • Different kinds of users • LexisNexis: Paid professional searchers • Online catalogs: Scholars searching scholarly literature • Web: Every type of person with every type of goal • Scale • Hundreds of millions of searches/day • Billions of static documents • Tens of millions of Web servers Kaiser: COMS E6125

Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized in response to a query by relevance rankings or other scores Directories vs. Search Engines Kaiser: COMS E6125

Inverted Indexes for Web Search Engines • Inverted indexes are still used, even though the web is so huge. • Some systems partition the indexes across different machines; each machine handles different parts of the data. • Other systems duplicate the data across many machines; queries are distributed among the machines. • Most do a combination of these. Kaiser: COMS E6125

Ranking Strategies • Details proprietary and changing • Combining subsets of: • IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), font, etc. • Popularity information - Frequently visited pages • Link analysis information - Which sites are linked to by other sites • A variant of vector space ranking to combine these • Make a vector of weights for each feature • Multiply this by the counts for each feature

Link Analysis for Ranking Pages • Assumption: If the pages pointing to this page are good, then this is also a good page • Draws upon earlier research in sociology and bibliometrics. • Kleinberg’s model includes “authorities” (highly referenced pages) and “hubs” (pages containing good reference lists). • Google model is a version with no hubs Kaiser: COMS E6125

A H Intuition • Authority comes from in-edges. • Being a good hub comes from out-edges. • Better authority comes from in-edges from good hubs. • Being a better hub comes from out-edges to good authorities. Kaiser: COMS E6125

Web Crawlers • How do the web search engines get all of the items they index? • Main idea: • Start with known sites • Record information for these sites • Follow the links from each site • Record information found at new sites • Repeat Kaiser: COMS E6125

Web Crawling Algorithm • Put a set of known sites on a queue • Repeat the following until the queue is empty: • Take the first page off of the queue • If this page has not yet been processed: • Record the information found on this page • Add each link on the current page to the queue • Record that this page has been processed Kaiser: COMS E6125

Robot Exclusion Protocol • Polite crawlers first attempt to download the file robots.txt • Created by the Web master to indicate which part of the site is off-limits to crawlers User-agent: * Disallow: / Kaiser: COMS E6125

Robot Exclusion Protocol • robots META tag <HTML> <HEAD> <META NAME="robots" CONTENT="noindex,nofollow"> ... </HEAD> ... </HTML> Kaiser: COMS E6125

Web Crawling Issues/Challenges • Politeness: robots “keep out” signs • Freshness - Figure out which pages change often, and re-crawl these often • Quantity (> 6B docs on > 60M Web servers) • Quality - Duplicates, virtual hosts, etc. • Convert page contents with a hash function • Compare new pages to the hash table Kaiser: COMS E6125

Web Crawling Issues/Challenges • Lots of other problems • Server unavailable; incorrect html; missing links; attempts to “fool” search engine by giving crawler a version of the page with lots of spurious terms added ... • Web crawling is difficult to do robustly! Kaiser: COMS E6125

Pages that do not actually exist as such: they are created dynamically as a result of a request/query to a specific application that most likely uses a DBMS Content in the deep Web is massive For a Web page to be discovered by a crawler, it must be static and linked The Deep (Hidden) Web Kaiser: COMS E6125

Perspective on Crawlers/Engines • Web content is getting more • Volatile • Frequent updates in content and/or location • New Web sites appear and existing ones disappear on a daily basis • Dynamic • Content produced by database-driven applications • These are the same challenges faced by • Caching proxies • Content distribution networks

Today’s Topics • Web Search – partially adapted from Alexandros Biliris (adjunct here) • Semantic Web – partially adapted from York Sure (University of Karlsruhe) Kaiser: COMS E6125

Simplicity is Good • The World Wide Web contains huge amounts of information created by many different organizations, communities and individuals for many different reasons • Web users can easily access this information by specifying URI (Universal Resource Identifier) addresses or using a search engine, and following links to find other related resources • This simplicity is a key aspect that made the Web so popular Kaiser: COMS E6125

Simplicity is Bad • The simplicity of the current Web has a price • It is very easy to get lost, or discover irrelevant or unrelated information • For instance, if we search for courses taught by a person named “Gail Kaiser”, we might find all kinds of other information • http://www.google.com/search?q=courses+taught+by+gail+kaiser&sourceid=navclient-ff&ie=UTF-8&rlz=1B3GGGL_enUS253US253 • The problem is that the search engine does know what “courses” or “taught” means Kaiser: COMS E6125

name education CV work private Machine accessible meaning(What it’s like to be a machine)

So what does this mean? • What’s a “CV”? • What’s a “name”? • Etc. • Need semantics Kaiser: COMS E6125

Semantic Web The Semantic Web is not a separate web but an extension of the current web, in which information is given well-defined meaning, better enabling computers and people to work in co-operation.[Berners-Lee et al., 2001] Kaiser: COMS E6125

Semantic Web Layers(T. Berners-Lee)

Start with XML, not HTML HTML: <H1>WHIM</H1><UL> <LI>Instructor: Gail Kaiser <LI>Students: George Bush</UL> XML: <course><title>WHIM</title><instructor>Gail Kaiser</instructor><students>George Bush</students></course>

Why Not Use XML Tags For Semantics? <title> … <title> • But what does “title” mean? • If we ask google, we get (on the 1st page) • title element of an html document • a prefix or suffix added to a person's name • a company that sells boxing gear • a gym club for boxers Kaiser: COMS E6125

XML Limitations for Semantic Markup • XML makes no commitment on:  Domain-specific vocabulary  Modeling primitives • Requires pre-arranged agreement on  &  Kaiser: COMS E6125

XML Limitations for Semantic Markup • Only feasible for closed collaboration • agents in a small & stable community • pages on a small & stable intranet • Not suited for sharing Web resources Kaiser: COMS E6125

< > < name > name <education> < > education < CV > < > CV <work> < > work <private> < > private XML machine accessible meaning

Semantic Web Layers

http://www.aifb.uni-karlsruhe.de/WBS/ysu site-owner tel 6086592 York W3C <rdf:Descriptionrdf:about=“#York”> <tel>6086592</tel> </rdf:Description> site-owner explains http://www.w3.org/RDF RDF for Semantic Annotation • RDF (Resource Description Framework) provides metadata about Web resources • Triples with Subject (or Resource) / Predicate (or Property) / Object (or Value) • XML syntax • Chained triples form a graph

Person subClassOf subClassOf range domain PhDStud Professor hasSuperVisor type type RDF Schema • Defines vocabulary for RDF • Organizes this vocabulary in a typed hierarchy • Class, subClassOf, type • Property, subPropertyOf • domain, range hasSuperVisor York Rudi

RDF Schema Syntax in XML <rdf:Description ID="MotorVehicle"> <rdf:type resource="http://www.w3.org/...#Class"/> <rdfs:subClassOf rdf:resource="http://www.w3.org/...#Resource"/> </rdf:Description> <rdf:Description ID="Truck"> <rdf:type resource="http://www.w3.org/...#Class"/> <rdfs:subClassOf rdf:resource="#MotorVehicle"/> </rdf:Description> <rdf:Description ID="registeredTo"> <rdf:type resource="http://www.w3.org/...#Property"/> <rdfs:domain rdf:resource="#MotorVehicle"/> <rdfs:range rdf:resource="#Person"/> </rdf:Description> <rdf:Description ID=”ownedBy"> <rdf:type resource="http://www.w3.org/...#Property"/> <rdfs:subPropertyOf rdf:resource="#registeredTo"/> </rdf:Description>

Higher-order Statements • One can make RDF statements about other RDF statements • Example: “Cinderella believes that the web contains one billion documents” • Allow us to express beliefs (and other modalities) • Important for trust models, digital signatures, etc. • Constitute metadata about metadata • Represented by modeling RDF in RDF itself Kaiser: COMS E6125

dc:Creator http://www.w3.org/TR/REC-rdf-syntax “Eric Miller” dc:Creator “Library of Congress” Reification • Reification allows a computer to process an abstraction as if it were any other datum • RDF is not really second-order • But it does provide a built-in predicate vocabulary for reification • The dotted box corresponds to the following statements • { x,rdf:predicate, “dc:creator” } • { x, rdf:subject, “http://www.w3.org/TR/REC-rdf-syntax } • { x, rdf:object, “Eric Miller” } • { x, rdf:type, “rdf:statement” }

<rdf:Description rdf:about=“#NYT”> <claims> <rdf:Description rdf:about=“#pers05”> <authorOf>ISBN...</authorOf> </rdf:Description> </claims> </rdf:Description> Author-of pers05 ISBN... Reification • Any statement can be an object (graphs can be nested) claims NYT

Conclusions about RDF • Next step up from plain XML • modeling primitives • possible to define vocabulary • However: • no precisely described meaning • no inference model Kaiser: COMS E6125

Where do we get the precisely defined meaning? • Two databases may use different identifiers for the same concept, such as zip code • A program that wants to compare or combine information across the two databases has to know that these two terms mean the same thing • The program must have a way to discover such common meanings for whatever databases it encounters • A solution to this problem is provided by collections of information called ontologies Kaiser: COMS E6125

COMS E6125 Web-enHanced Information Management (WHIM)