420 likes | 726 Views
Search and Discovery: Searching the Web Stages of a transaction Discovery Find what you’re interested in Locate sellers Locate buyers Compare products Negotiation Exchange Discovery Encompasses: Search engines Recommender systems Price comparison/shopping agents Description languages
E N D
Stages of a transaction • Discovery • Find what you’re interested in • Locate sellers • Locate buyers • Compare products • Negotiation • Exchange
Discovery • Encompasses: • Search engines • Recommender systems • Price comparison/shopping agents • Description languages • Data sources • Generic sources: portals, web directories • Domain-specific sources: catalogs, guides, etc. • Advertising
Discovery • More than just finding a resource • Need to be able to estimate value, likelihood of successful negotiation • An evaluative infrastructure is required • Least formalized of e-commerce subareas. • Unlikely to have a general-purpose solution soon • Too complex
A Brief History of the Web • Prehistory: • Hypertext as an idea has been around since the 40s. • Vannevar Bush: Memex • Engelbart: 60s • 1987: Hypercard • Graphical tool allowing users to create hyperlinked documents. • Late 80s/early 90s: WAIS, Gopher
A Brief History of the Web • 1989/90: Tim Berners-Lee proposes the WWW at CERN • A new global information retrieval system • Develops HTML, a simple markup language • 1993: Mosaic developed at NCSA • Marc Andressen then founds Netscape • 1993/94: NCSA httpd released • Open-source web server, supported CGI • Precursor to Apache
A Brief History of the Web • 1994: Banner ads appear on HotWired • Beginning of the commercial web • 1994: Yahoo founded • Appearance of the portal, search engine • 1995: NSF backbone privatized • AT&T, Sprint, etc take over traffic • Network Solutions given a monopoly on domain names • 1995: Microsoft releases Internet Explorer • In 7 years, Netscape goes from 100% market share to 20% (2001).
A Brief History of the Web • 1995: AltaVista started • Full-text Web search • 1995: Andressen first WWW billionaire • 1995: Sun introduces Java • Able to ship code and text across networks • 1995: eBay founded • First online auction • 1995-98: Explosive growth • Many new formats, applications, companies • 1998: Akamai founded (web caching)
A Brief History of the Web • 1998: ICANN governs names & addresses • 1998: MP3 format popularized • WinAmp released • Small enough to make audio distribution practical • 1998: Google founded. • 2000: Napster appears • Beginnings of peer-to-peer technology, file sharing • 2000(ish): End of the boom • Consolidation, reduction in growth
Lessons from Radio • Radio was popularized in the 1920s • Originally intended as a one-to-one messaging system. • Fee-for-use pay structure. • 1922: Explosive growth begins • RCA’s revenues from sales of receivers doubled each year • Broadcast model becomes prevalent • Thousands of broadcasters emerge
Lessons From Radio • 1922-1924: Transition • How to make money broadcasting? • Support sale of receivers • Goodwill (sponsors) • Public good – supported as a non-profit • Advertising • Tube tax/set tax (a la BBC) • By 1924, stations are failing as quickly as they start.
Lessons From Radio • Affordable content driven by audience size • “Rich-get-richer” for large stations • 1926: RCA launches NBC • First nationwide broadcast • Creates the network system • National content, local broadcasting • Advertising the dominant revenue generator • WWW questions: • Who will be NBC? • What will the revenue model be? • Advertising? Competition with TV, radio for this revenue. • Micropayments? Subscriptions? Content aggregation?
Searching the Web • Web growth estimated at 1000% in late 90s. • Can search engines keep up with this growth? • How to deal with the dynamic nature of the web? • Page contents change • Pages appear, disappear, move • Link structure changes
Search Engines • Most common form of discovery • Crawl the web to collect pages • Stored and indexed for easy retrieval • Query languages simple • Goals: • Fast retrieval (Google gets 150 million queries per day) • Accurate (no dead links) • Precise (pages match user’s needs)
Terminology • Outward link • Object that a page links to • Outdegree: number of outward links • Inward link • Pages that link to an object • Indegree: number of inward links • Path • Series of outward links from A to B
The Web as a Directed Graph • We can represent the web as a directed graph. • Sites are nodes • Links are edges. • Outward link • Object that a page links to • Inward link • Pages that link to an object
Adjacency Matrix • We can also represent the Web as a very large adjacency matrix. • The eigenvector of this matrix illustrates the clusteredness of the Web • Distribution of in-degree and out-degree • Connectedness • Some ranking algorithms (HITS) use this measure.
Web structure • Web can be broken into four areas (Kleinberg/Lawrence) • Core: Path between any two pages • Upstream: Can reach the core, but no path from core. • Downstream: can be reached from core, but cannot reach core. • Tendrils/islands – disconnected from the core. • Areas (allegedly) have roughly equal size.
Coverage • Search engines claim they index a large fraction of the web. • How to verify this? • Run queries on many engines and compare number of hits. • May return irrelevant documents • Documents may no longer exist • Documents may have changed
Coverage • NEC (1998) – Estimate size of web, coverage for major search engines. • Query each engine, retrieve and compare all results (only exact matches). • Coverage estimates: • HotBot: 57%, AltaVista: 46% • NorthernLight: 33%, Excite: 23% • Infoseek: 16%, Lycos: 4%
Estimating the size of the indexable web • Overlap in coverage was used to estimate size. A B U U/B serves as an estimate of A/N, where N is the size of the Web. 1998: Altavista/Hotbot estimate: 320 million pages.
Using size to refine coverage estimates.(1997) • This value can then be used to determine a coverage estimate for each engine. • For each pair, solve for N. • Assume real N is largest found. • Updated: HotBot: 34%, AltaVista: 28% • NorthernLight: 20%, Excite: 14% • Infoseek: 10%, Lycos: 3%
Updates: (1999) • Web growth ahead of indexing • No search engine covers more than 16% of the Web. • Union of all engines: ~50% coverage • Estimated size: 800 million pages • Search engines more likely to link to authorities • More likely to link to US, commercial sites.
Updates (12/2001) • Self-reported number of pages indexed: • Google: 2 billion (3 billion+ today) • FAST (AllTheWeb.com): 625 million • (claimed 2.1 billion in 2002) • Altavista: 550 million • Inktomi: 500 million • NorthernLight: 390 million
Indexing the web • Spiders are used to crawl the web and collect pages. • A page is downloaded and its outward links are found. • Each outward link is then downloaded. • Exceptions: • Links from CGI interfaces • Robot Exclusion Standard
Indexing the Web • “Stop words” stripped from page • Forward index created • Bundles words • Maps words to documents. • Can use TFIDF to only map “significant” keywords • Term Frequency * InverseDocumentFrequency
Indexing the web • An inverted index is created • Forward index sorted according to word • Maps keywords to URLs • Some wrinkles: • Morphology: stripping suffixes (stemming), singular vs. plural, tense, case folding • Semantic similarity • Words with similar meanings share an index. • Issue: trading coverage (number of hits) for precision (how closely hits match request)
Indexing Issues • Indexing techniques were designed for static collections • How to deal with pages that change? • Periodic crawls, rebuild index. • Varied frequency crawls • Records need a way to be “purged” • Hash of page stored • Can use the text of a link to a page to help label that page. • Helps eliminate the addition of spurious keywords.
Indexing Issues • Availability and speed • Most search engines will cache the page being referenced. • Multiple search terms • OR: separate searches concatenated • AND: intersection of searches computed. • Regular expressions not typically handled. • Parsing • Must be able to handle malformed HTML, partial documents
PageRank • Google uses PageRank to determine relevance. • Based on the “quality” of a page’s inward links. • Average the PageRanks of each page that points to a given page, divided by their outdegree. • Let p be a page, with T1 – Tn linking to p. • PR(p) = (1-d) + d(SumI(Pr(TI)/outI)) • d is a ‘damping’ factor. • PR ‘propagates’ through a graph.
PageRank • Justification: • Imagine a random surfer who keeps clicking through links. • d is the probability she starts a new search. • Or … • A page has a high ranking if highly ranked pages point to it. • Pros: difficult to game the system • Cons: Creates a “rich get richer” web structure where highly popular sites grow in popularity.
HITS • HITS is also commonly used for document ranking. • Gives each page a hub score and an authority score • A good authority is pointed to by many good hubs. • A good hub points to many good authorities. • Users want good authorities.
Issues with Ranking Algorithms • Spurious keywords and META tags • Users reinforcing each other • Increases “authority” measure • Topic drift • Many hubs link to more than one topic
Web structure • Structure is important for: • Predicting traffic patterns • Who will visit a site? • Where will visitors arrive from? • How many visitors can you expect? • Estimating coverage • Is a site likely to be indexed?
Core • Compact • Short paths between sites • “Small world” phenomenon • Distances are small relative to average path length • Number if inward and outward links follows a power law. • Mechanism: preferential attachment • As new sites arrive, the probability of gaining an inward link is proportional to in-degree.
Power laws and small worlds • Power laws occur everywhere in nature • Distribution of site sizes, city sizes, incomes, word frequencies • Random networks tend to evolve according to a power law. • Small-world phenomenon • “Neighborhoods” will be joined by a common member • Hubs serve to connect neighborhoods • Linkage is closer than one might expect • Six Degrees of Separation, Kevin Bacon
Local structure • More diverse than a power law • Pages with similar topics self-organize into communities • Short average path length • High link density • Webrings • Inverse: Does a high link density imply the existence of a community? • Can this be used to study the emergence and growth of web communities?
Hubs and Authorities • Common community structure • Hubs • Many outward links • Lists of resources • Authorities • Many inward links • Provide resources, content
Hubs and Authorities Authorities Hubs Link structure estimates over 100,000 Web communities Often not categorized by portals
Web Communities • Alternate definition • Each member has more links to community members than non-community members. • Extension of a clique. • Can be discovered with network flow algorithms.