870 likes | 1k Views
Midnight in the Garden of Good and Evil Search Engines. Presentation by Richard Wiggins Technical Advisor, NEM Online, Michigan State University www.msu.edu/staff/rww wiggins@msu.edu Columnist, “Internet Buzz,” webreference.com www.webreference.com/outlook wiggins@internet.com
E N D
Midnight in the Garden of Good and Evil Search Engines • Presentation by Richard Wiggins • Technical Advisor, NEM Online, Michigan State University • www.msu.edu/staff/rww • wiggins@msu.edu • Columnist, “Internet Buzz,” webreference.com • www.webreference.com/outlook • wiggins@internet.com • Co-host, Nothing But Net television program (produced by Media One)
A Parable: The Encounter Between the USS Nimitz and a Canadian Vessel...
A Frequency Analysis of the Appearance of a Critical Search Term Among Major Search Engines...
Frequency of the Search Term “Slavko” Among Major Search Indexes • AltaVista 5477 • Excite 1160 • Infoseek 1452 • Hotbot 4226
Come Join Our Tour of SearchVannah ...a place millions want to visit... …where a cast of characters stands ready to help you find exactly what you’re looking for...
SearchVannah’s Tour Guides • …a relatively new town • …only existed since 1993 • With so many visitors, lots of tour guides have set up shop • They tend to have funny names • They compete fiercely • They’re all trying to make money helping visitors find their way
The Tour Guides • AltaVista • Fast, lots of memory, knows a lot • But people complain sometimes results are inconsistent • InfoSeek • Claims answers are more relevant • MetaCrawler • Doesn’t know anything at all! Just asks the other tour guides!
HotBot HotBot: This tour guide wears the ugliest clothes!
The Tour Guides... • Inktomi: other tour guides hire Inktomi to answer their questions • One guide knows a LOT less than all the others… • But it’s the most popular by far! • The smarter tour guides think of it as just a dumb Yahoo… • But maybe tourists want to know where the B&B is, not a list of all the towels and dishes
Crawler: automated tool to discover new and changed pages, feeds data to… Indexer: builds and maintains an index, concordance-style Search engine: the actual tool end-users employ when searching …but in popular usage, all together = “search engine” Definitions
Leveraging 30 Years of Information Retrieval (IR) • Most new ideas we see in Web engines were thought of long ago... • Stemming • Controlled vocabulary • Text analytics • Knowledge Bases • Personalization (by observing user usage patterns) • Natural language
How Do People Search? SearchVannah “Honestly, tourists are the dumbest people” -- anonymous Tour Guide
What Do People Search For? • Major search services say people look for... • Sex sites • One’s own name • Friends, colleagues’ Web sites (also by name) • Items in the news • Company / product information • Etc.
One user view of search.msu.edu: Academics • application for graduation • overseas study • ordering catalog • School of Music • Computer Science • human ecology department • psychology 101
Another user view of search.msu.edu: Virtual Library • DNA sequencing • climate change • beam theory • feline brain tumor • PRL and sequencing
Another user view of search.msu.edu: Extension • livestock pavilion • wildlife fisheries • bathtub removal and installation • Round Bale Storage
Another user view of search.msu.edu: Conversational • I would like to know if you offer a workshop on “International Law”
What Do People Search For?Matt Koll’s Formulation • “finding a needle in a haystack” • a known needle in a known haystack • a known needle in an unknown haystack, • to any needle in a haystack • Where are the haystacks? • GenX rendition: Needles? Haystacks? Whatever!
Typical User Search Strategy • Type in a one-word search term • Maybe two words • Seldom exploit advanced options • Capitalization • Quoting phrases (e.g. “climate change”) • Date restrictions • Host:, URL: parameters • Seldom use iterative refinement
Users Make “Wrong” Choices • Picking the right database is confusing • Reference librarians, experienced users learn brand names • Inexperienced users do not • Lycos example: “Small” versus “Large” catalog • “Small” catalog was faster, more precise • Virtually no one used it, thinking “Large” meant “better”
A Route 128 Story • Engineering firm on Route 128 • Engineers new products • Has constant need for specialized information • Uses traditional sources, and the Web • “Joe down the hall” does the Internet searches • Joe is a reference librarian with an engineering degree (and no training in online searching!)
Prospects for Training are Dismal! • We don’t know the users, so we can’t hope to train them • Users won’t read documentation or help notes • If engine doesn’t deliver, users react viscerally • “This engine is useless” or • “The Internet has nothing useful” • “The Internet has too much information!”
How Well Do Today’s Engines Meet Real Users’ Needs? • Most engines cannot yield high precision, high recall hit list with only one search term • But most users don’t compose or refine their searches carefully • Boolean operators virtually unused • Therefore most users probably fail to get desired results • Many sample searches from MSU example would not yield desired information
AltaVista “Intelligent” Case Matching Example • Looking for information on “TREC” search engines testing at NIST
Scale Issues SearchVannah “This town is growing so fast, and there’s too many tourists!” -- a 3rd generation resident
The Problem of Scale • No one knows exact size of Web • Databases, intranets complicate issue • “Dark matter” -- Vint Cerf • Probably 250 to 500 million pages publicly accessible • Recent Science article claims most spider coverage is incomplete • AltaVista claims 140 million pages in index
1 Billion URLs -- and Beyond 1000M 140M 30M 1996 1997 1998
Problem of Scale: Transaction Load • AltaVista handles 30 million searches per day • Inktomi is “back-end” for numerous sites • HotBot, N2H2 (Japan), Australian news service • Soon, the “find a Web site” function in Windows 98 • No popular service has melted down yet
Eric Brewer, CEO, claims centralized high-speed servers cannot scale Developed new clustering scheme: dozens or hundreds of low-cost servers on high-speed network But centralized engines have not broken down yet 64-bit processors @ 300-450 MHz, gigabytes of RAM, fast paths to disk Inktomi’s “Network of Workstations” Model
Trends SearchVannah “We have a forward-looking sense of fashion!” -- one of the tour guides
Trends Among Search Engines • Observations of Dr. Susan Feldman, Cornell: • More professional look, feel than a couple years ago • Common syntax evolving: • Plus sign prefix for required term, minus for excluded term • Quotes signify phrases, caps signify case significant • Unique “personalities” evolving
The Role of Meta-Crawlers • Experts agree that spider coverage varies across services • No two services cover the same sites for a given search • Therefore searching across multiple indexes yields more results • Therefore metacrawlers can be useful
Targeted Spiders • Train the spider to crawl only sites that fit a certain subject domain • InfoSeek News Index • Death of a Princess example • Internet.com’s “vertical” index • LawCrawler • NEM Online • Research project at Michigan State University • Harnessing information of use to manufacturers
“death of Princess Diana” Search on Infoseek, 8/31/97 1:00 pm
Traditional Model: First, Pick a Database, Then Do Your Search
Why Northern Light is a Breakthrough • Delivering quality sources alongside Web resources • As Web becomes more cluttered, advantage grows • Database search paradigm inverted: First do your search, then pick your source • Automatic categorization yields manageable hit lists • Advantage also grows as Web grows
Beyond Text: Still Images, Digitized Speech, Video • We tend to think of search engines as limited to text • But increasingly we will face digital content • Thanks to scanners, digital cameras, digital sound cards, digital video cameras • These digital collections will be corporate assets • But to use, and re-purpose, these assets, we will need search engines
IBM Almaden’s Image Search Software • Able to index a large collection of still images • Able to find similar images • User selects image, asks for similar shapes • User draws shapes • User filters by color, textual metadata • Samples available online: • Searchable digital postage stamp archive • www.qbic.almaden.ibm.com/cgi-bin/stamps-demo • Searchable archive of trademarks (logos)