Search and Discovery: Searching the Web

Search and Discovery:Searching the Web

Stages of a transaction • Discovery • Find what you’re interested in • Locate sellers • Locate buyers • Compare products • Negotiation • Exchange

Discovery • Encompasses: • Search engines • Recommender systems • Price comparison/shopping agents • Description languages • Data sources • Generic sources: portals, web directories • Domain-specific sources: catalogs, guides, etc. • Advertising

Discovery • More than just finding a resource • Need to be able to estimate value, likelihood of successful negotiation • An evaluative infrastructure is required • Least formalized of e-commerce subareas. • Unlikely to have a general-purpose solution soon • Too complex

A Brief History of the Web • Prehistory: • Hypertext as an idea has been around since the 40s. • Vannevar Bush: Memex • Engelbart: 60s • 1987: Hypercard • Graphical tool allowing users to create hyperlinked documents. • Late 80s/early 90s: WAIS, Gopher

A Brief History of the Web • 1989/90: Tim Berners-Lee proposes the WWW at CERN • A new global information retrieval system • Develops HTML, a simple markup language • 1993: Mosaic developed at NCSA • Marc Andressen then founds Netscape • 1993/94: NCSA httpd released • Open-source web server, supported CGI • Precursor to Apache

A Brief History of the Web • 1994: Banner ads appear on HotWired • Beginning of the commercial web • 1994: Yahoo founded • Appearance of the portal, search engine • 1995: NSF backbone privatized • AT&T, Sprint, etc take over traffic • Network Solutions given a monopoly on domain names • 1995: Microsoft releases Internet Explorer • In 7 years, Netscape goes from 100% market share to 20% (2001).

A Brief History of the Web • 1995: AltaVista started • Full-text Web search • 1995: Andressen first WWW billionaire • 1995: Sun introduces Java • Able to ship code and text across networks • 1995: eBay founded • First online auction • 1995-98: Explosive growth • Many new formats, applications, companies • 1998: Akamai founded (web caching)

A Brief History of the Web • 1998: ICANN governs names & addresses • 1998: MP3 format popularized • WinAmp released • Small enough to make audio distribution practical • 1998: Google founded. • 2000: Napster appears • Beginnings of peer-to-peer technology, file sharing • 2000(ish): End of the boom • Consolidation, reduction in growth

Lessons from Radio • Radio was popularized in the 1920s • Originally intended as a one-to-one messaging system. • Fee-for-use pay structure. • 1922: Explosive growth begins • RCA’s revenues from sales of receivers doubled each year • Broadcast model becomes prevalent • Thousands of broadcasters emerge

Lessons From Radio • 1922-1924: Transition • How to make money broadcasting? • Support sale of receivers • Goodwill (sponsors) • Public good – supported as a non-profit • Advertising • Tube tax/set tax (a la BBC) • By 1924, stations are failing as quickly as they start.

Lessons From Radio • Affordable content driven by audience size • “Rich-get-richer” for large stations • 1926: RCA launches NBC • First nationwide broadcast • Creates the network system • National content, local broadcasting • Advertising the dominant revenue generator • WWW questions: • Who will be NBC? • What will the revenue model be? • Advertising? Competition with TV, radio for this revenue. • Micropayments? Subscriptions? Content aggregation?

Searching the Web • Web growth estimated at 1000% in late 90s. • Can search engines keep up with this growth? • How to deal with the dynamic nature of the web? • Page contents change • Pages appear, disappear, move • Link structure changes

Search Engines • Most common form of discovery • Crawl the web to collect pages • Stored and indexed for easy retrieval • Query languages simple • Goals: • Fast retrieval (Google gets 150 million queries per day) • Accurate (no dead links) • Precise (pages match user’s needs)

Terminology • Outward link • Object that a page links to • Outdegree: number of outward links • Inward link • Pages that link to an object • Indegree: number of inward links • Path • Series of outward links from A to B

The Web as a Directed Graph • We can represent the web as a directed graph. • Sites are nodes • Links are edges. • Outward link • Object that a page links to • Inward link • Pages that link to an object

The Web as a Directed Graph

Adjacency Matrix • We can also represent the Web as a very large adjacency matrix. • The eigenvector of this matrix illustrates the clusteredness of the Web • Distribution of in-degree and out-degree • Connectedness • Some ranking algorithms (HITS) use this measure.

Web structure • Web can be broken into four areas (Kleinberg/Lawrence) • Core: Path between any two pages • Upstream: Can reach the core, but no path from core. • Downstream: can be reached from core, but cannot reach core. • Tendrils/islands – disconnected from the core. • Areas (allegedly) have roughly equal size.

Coverage • Search engines claim they index a large fraction of the web. • How to verify this? • Run queries on many engines and compare number of hits. • May return irrelevant documents • Documents may no longer exist • Documents may have changed

Coverage • NEC (1998) – Estimate size of web, coverage for major search engines. • Query each engine, retrieve and compare all results (only exact matches). • Coverage estimates: • HotBot: 57%, AltaVista: 46% • NorthernLight: 33%, Excite: 23% • Infoseek: 16%, Lycos: 4%

Estimating the size of the indexable web • Overlap in coverage was used to estimate size. A B U U/B serves as an estimate of A/N, where N is the size of the Web. 1998: Altavista/Hotbot estimate: 320 million pages.

Using size to refine coverage estimates.(1997) • This value can then be used to determine a coverage estimate for each engine. • For each pair, solve for N. • Assume real N is largest found. • Updated: HotBot: 34%, AltaVista: 28% • NorthernLight: 20%, Excite: 14% • Infoseek: 10%, Lycos: 3%

Updates: (1999) • Web growth ahead of indexing • No search engine covers more than 16% of the Web. • Union of all engines: ~50% coverage • Estimated size: 800 million pages • Search engines more likely to link to authorities • More likely to link to US, commercial sites.

Updates (12/2001) • Self-reported number of pages indexed: • Google: 2 billion (3 billion+ today) • FAST (AllTheWeb.com): 625 million • (claimed 2.1 billion in 2002) • Altavista: 550 million • Inktomi: 500 million • NorthernLight: 390 million

Indexing the web • Spiders are used to crawl the web and collect pages. • A page is downloaded and its outward links are found. • Each outward link is then downloaded. • Exceptions: • Links from CGI interfaces • Robot Exclusion Standard

Indexing the Web • “Stop words” stripped from page • Forward index created • Bundles words • Maps words to documents. • Can use TFIDF to only map “significant” keywords • Term Frequency * InverseDocumentFrequency

Indexing the web • An inverted index is created • Forward index sorted according to word • Maps keywords to URLs • Some wrinkles: • Morphology: stripping suffixes (stemming), singular vs. plural, tense, case folding • Semantic similarity • Words with similar meanings share an index. • Issue: trading coverage (number of hits) for precision (how closely hits match request)

Indexing Issues • Indexing techniques were designed for static collections • How to deal with pages that change? • Periodic crawls, rebuild index. • Varied frequency crawls • Records need a way to be “purged” • Hash of page stored • Can use the text of a link to a page to help label that page. • Helps eliminate the addition of spurious keywords.

Indexing Issues • Availability and speed • Most search engines will cache the page being referenced. • Multiple search terms • OR: separate searches concatenated • AND: intersection of searches computed. • Regular expressions not typically handled. • Parsing • Must be able to handle malformed HTML, partial documents

PageRank • Google uses PageRank to determine relevance. • Based on the “quality” of a page’s inward links. • Average the PageRanks of each page that points to a given page, divided by their outdegree. • Let p be a page, with T1 – Tn linking to p. • PR(p) = (1-d) + d(SumI(Pr(TI)/outI)) • d is a ‘damping’ factor. • PR ‘propagates’ through a graph.

PageRank • Justification: • Imagine a random surfer who keeps clicking through links. • d is the probability she starts a new search. • Or … • A page has a high ranking if highly ranked pages point to it. • Pros: difficult to game the system • Cons: Creates a “rich get richer” web structure where highly popular sites grow in popularity.

HITS • HITS is also commonly used for document ranking. • Gives each page a hub score and an authority score • A good authority is pointed to by many good hubs. • A good hub points to many good authorities. • Users want good authorities.

Issues with Ranking Algorithms • Spurious keywords and META tags • Users reinforcing each other • Increases “authority” measure • Topic drift • Many hubs link to more than one topic

Web structure • Structure is important for: • Predicting traffic patterns • Who will visit a site? • Where will visitors arrive from? • How many visitors can you expect? • Estimating coverage • Is a site likely to be indexed?

Core • Compact • Short paths between sites • “Small world” phenomenon • Distances are small relative to average path length • Number if inward and outward links follows a power law. • Mechanism: preferential attachment • As new sites arrive, the probability of gaining an inward link is proportional to in-degree.

Power laws and small worlds • Power laws occur everywhere in nature • Distribution of site sizes, city sizes, incomes, word frequencies • Random networks tend to evolve according to a power law. • Small-world phenomenon • “Neighborhoods” will be joined by a common member • Hubs serve to connect neighborhoods • Linkage is closer than one might expect • Six Degrees of Separation, Kevin Bacon

Local structure • More diverse than a power law • Pages with similar topics self-organize into communities • Short average path length • High link density • Webrings • Inverse: Does a high link density imply the existence of a community? • Can this be used to study the emergence and growth of web communities?

Hubs and Authorities • Common community structure • Hubs • Many outward links • Lists of resources • Authorities • Many inward links • Provide resources, content

Hubs and Authorities Authorities Hubs Link structure estimates over 100,000 Web communities Often not categorized by portals

Web Communities • Alternate definition • Each member has more links to community members than non-community members. • Extension of a clique. • Can be discovered with network flow algorithms.

Weaknesses of search engines

Search and Discovery: Searching the Web