To the Internet and Beyond: Database Challenges for New/Advanced Applications

To the Internet and Beyond: Database Challenges for New/Advanced Applications May 21, 2001 Propel Confidential

Agenda • The story of Infoseek • Why Propel? • The problems that arise for us

Scoring Framework • Classification Problem • Separate relevant from non-relevant documents • Bayes’ Decision rule: Relevant if P(x(d)|R)P(R)  P(x(d)|~R)P(R)where x(d) is the observed representation of d • Independence assumption leads toS(d) =  log [p(t)(1-q(t))/(1-p(t))q(t)]where p(t) = P(t|R) and q(t) = P(t|R)

The original Infoseek vision • Stolen from Bill Gates… “Information at your fingertips” • To find any piece of information on any computer in the world within 1 second

How we got started • Finding information was too expensive and too hard • Our field of dreams • “If you provide useful information at bargain prices, they will come” • In January 1995 we launched Infoseek • Register with a credit card • First month free • 10 cents a transaction

What happened... • “I thought you said it was FREE to try it?” • “You’ve got to be kidding!” • “I already pay $10 a month for my access!” • “I can’t afford it.” • “Go to …” • “Why should I pay when the information is available free elsewhere on the net?” • “I don’t like to be nickeled and dimed.”

Even more advice... • “You should only charge me per query” • “You should only charge for document” • “I’ll only sign up for a flat fee” • “I refuse to pay a flat fee” • “I don’t have a credit card” • “Your legal agreement is too long”

What we did • Dropped the credit card registration for a free trial • Made it very clear you can’t get most of this stuff for free anywhere on the net • Made the pricing easier to understand • Advertised it on our free Net Search

“So…. How would you like to provide a free Net Search?” • First reactions • “Are you joking?” • “How would we make money? By making it up in volume?” • Strategy • It would be free advertising for Pro • Limit the search results to 100 hits • Want more? Refer to Infoseek Pro

Infoseek Guide • 25M hits/day (200 queries/sec at times) • #1 search engine on the Net • 1,000 signups/day for Infoseek Pro • Discovered advertising sponsorship • 1.5 cents per query • Discovered TV math • we make more money giving away information than selling it

Four years later… Propel Confidential

How to find Barney pagessuitable for your kids +Barney +dinosaur -bash -kill -maim -destroy -hate

What people ask about (and why) Propel Confidential

Unofficial SIGMOD survey question How many people here search the web for “adult sites”?

sex Playboy Penthouse chat Hustler nude porn erotica games pornography porno adult ESPN pussy Pamela Anderson Top 15 queries on the WWW * I am not making this up! This list is real!

What does that mean? • “Uhh… I was just testing!”

Unofficial SIGMOD trivia question • Q: What famous IR researcher asked in 1995 “Is this because of the Communications Decency Act (CDA)?” • A: Bruce Croft

Why it happens(possible explanations) • Research on CDA • Curious what others looking at • Many new technologies are driven by sex: • VCR • Hotel movies on demand • People are naturally horny

What it means • Human race in no danger of extinction • Corporate libraries doing a great job in technical areas • Traditional sex education inadequate • Some of you are not telling the truth • Audience surveys are not always accurate • Bill Gates should admit to Congress that Pamela Anderson is more important than he is • If you didn’t raise your hand, you may need professional help!

The secret Infoseek backup bizplan • Selling our list of porn sites

We never pursued it… … But other companies did! • Sinfoseek • Infoseak • Nymfoseek • Infopeek • ...

Relevance ranking Web sites Propel Confidential

Facts about Queries • Most queries are short • Average length approx. 2.2 • 10% use query syntax (usually incorrectly) • 1% used advanced search • Noun phrases only • Precision more important than recall • Users expect precision in top results

Relevance ranking objectives Must use several techniques to determine “relevance”: • Page has query term(s) • Popular usage of the term, e.g., penthouse, java, adult, “evil empire”, ... • Page quality • Page/site popularity • Spam reduction/elimination • Porn reduction

Relevance ranking factors • Query terms: tf*idf • Usage: Hyperlink text, thesaurus • Quality: site quality, dates, depth, … • Popularity: External link count, proxy stats • Spam: word/phrase unusual statistics (tf limiting) • Porn: site exclusion list, naughty phrase list

Relative weighting of these factors is tricky and subjective Should “evil empire” return Microsoft as the top hit?

Living in a world of an infinite number of documents Propel Confidential

The problem (user view) • Too hard to find things even though only 100M documents indexed • Often precision and relevance, NOT recall • “intel” in the title search gives over 200 hits just like this: Index of /CPAN-local/authors/id/GSAR/x86/intel/ix86/intel/ix86/intel/intel/ix86/intel/ix86/ • Query ambiguity, e.g., “baby Bells”

The problem (vendor view) • Speed • Size • Cost • Freshness • Load on the Internet/bandwidth (both sides) • Quality (Spam/porn) • Will people be able to find what they are looking for as the net grows?

Today’s approach sucks Suck all content into a centralized search engine Infoseek All the world’s content

Is there a better way? • We might start by asking the question: “How do people find information today?”

Centralized searching techniques are rarely used in real life... • Ask God (and pray for an answer) • Ask DIALOG …and pray... • WWW search (new!)

What people DO use is decentralized searching Source 1 Source 2 Question ... Source N Answers and more sources

Human distributed searching attributes • Faster than a computer!!! • Complete • Accurate • Can be used to validate an answer • Will always find an answer (eventually) • No specialized hardware • All humans had the same CPU speed/RAM

So can’t we design a computer distributed search network that is as fast and accurate and complete as our human distributed search network?

Our goal • Don’t necessarily mimic the process, but adapt the process to the medium

One approach • User types query • System searches databases of popular pages as well as meta descriptions of other databases • Repeat until all websites have been searched NOTE: This is the fastest way to search an infinite amount of data

What we learned • Relatively weak engines with no proximity got a wide following: people couldn’t see through the hype • Bigger was better

People lie • We have “concept searching” • We’re growing faster than the net • We’ve indexed 95% of the net • We have more URLs than anyone else

What we learned • Competing for the Internet customer is not always a case of who really has: • the best engine • the highest quality content or the most content • the best price, best GUI, or the best product • It’s more about: • brand name • convincing the customer you are the best

What we learned • Ads don’t sell themselves • If you do 1M ads per day, you’re cookin’ • Lots of competition • Switching costs are low • User behavior can be tracked • Seemingly identical pages can have dramatically different click through

Mistakes we made • Not pressing for branding • Slow to recognize ad model • No ! at the end of our name

Ultra required new thinking • The traditional IDF formula breaks down for 1 Billion documents • Existing data structures would never work • “Managing Gigabytes” didn’t go far enough • Inktomi approach was too inefficient • No sacred cows

Ultra: Designed for speed • Speed/space tradeoff • Architected from the ground up for 1Billion docs and 1,000 queries/sec • Everything is done in parallel and multi-threaded • Limited disk I/O • Small in-RAM tables • Stable connections

Ultra size • Parallel worms (multi-process, multi-threaded) • Proprietary database required; OODB’s too slow • Change frequency monitored • >50M URLs

Feature set • Natural language queries • +, -, “phrases” • Fields (link, url, site, title) • Case sensitivity • Stemming • Approximate matching for phrases • Gets faster the longer the query • 8 shortest term lists • Space invariant, I.e., CD-ROM =CDROM

To the Internet and Beyond: Database Challenges for New/Advanced Applications