Nurturing content-based collaborative communities on the Web

Nurturing content-based collaborative communitieson the Web Soumen Chakrabarti Center for Intelligent Internet ResearchComputer Science and EngineeringIndian Institute of Technology Bombay www.cse.iitb.ernet.in/~soumenwww.cse.iitb.ernet.in/~cfiir

Generic search engines • Struggle to cover the expanding Web • 35% coverage in 1997 (Bharat and Broder) • 18% in 1999 (Lawrence and Lee Giles) • Google rebounds to 50% in 2000 • Moore’s law vs. Web population • Search quality, index freshness • Cannot afford advanced processing • Alta Vista serves >40 million queries / day • Cannot even afford to seek on disk (8ms) • Limits intelligence of search engines

Scale vs. quality Lexical networks, parsing, semantic indexing Resourcediscovery Quality Focusedcrawling Link-assistedranking Topic distillation Keyword-basedsearch engines Google, Clever HotBot, Alta Vista Scale

The case for vertical portals “Portals and search pages are changing rapidly, in part because their biggest strength — massive size and reach — can also be a drawback. The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical — and useful — than trying to cover the entire universe.” (San Jose Mercury News, 1/1999)

Scaling through specialization • The Web shows content-based locality • Link-based clusters correlated with content • Content-based communities emerge in a spontaneous, decentralized fashion • Can learn and exploit locality patterns • Analyze page visits and bookmarks • Automatically construct a “focused portal” with resources that • Have high relevance and quality • Are up-to-date and collectively comprehensive

Roadmap • Hyperlink mining: a short history • Resource discovery • Content-based locality in hypertext • Taxonomy models, topic distillation • Strategies for focused crawling • Data capture and mining architecture • The Memex collaboration system • Collaborative construction of vertical portals • Link metadata management architecture • Surfing backwards on the Web

Historical background • First generation Web search engines • Delete ‘stopwords’ from queries • Can only do syntactic matching • Users stopped asking good questions! • TREC queries: tens to hundreds of words • Alta Vista: at most 2–3 words • Crisis of abundance • Relevance ranking for very short queries • Quality complements relevance — that’s where hand-made topic directories shine

h h a h a h a a Hyperlink Induced Topic Search Query Expanded graph Keyword Search engine Response a = Eh h = ETa ‘Hubs’ and ‘authorities’

Prestige of a page is proportional to sum of prestige of citing pages Standard bibliometric measure of influence Simulate a random walk on the Web to precompute prestige of all pages Sort keyword-matched responses by decreasing prestige Follow randomoutlink from page PageRank and Google p1 p2 p4 p3 p4 p1 + p2 + p3 I.e., p = Ep

Observations • HITS • Uses text initially to select Web subgraph • Expands subgraph by radius 1 … magic! • h and a scores independent of content • Iterations required at query time • Google/PageRank • Precomputed query-independent prestige • No iterations needed at query time, faster • Keyword query selects subgraph to rank • No notion of hub or bipartite reinforcement

Limitations • Artificial decoupling of text and links • Connectivity-based topic drift (HITS) • “movie awards”  “movies” • Expanders at www.web-popularity.com • Feature diffusion (Google) • “more evil than evil” www.microsoft.com • New threat of anchor-text spamming • Decoupled ranking (Google) • “harvard mother”  Bill Gates’s bio page!

Genealogy Bibliometry Exploiting anchor text Google HITS Outlier elimination Topic distillation @Compaq Clever@IBM Text classification Focusedcrawling Hypertextclassification Relaxationlabeling Crawlingcontext graphs Learningtopic paths

Page modeled as sequence of tokens and outlinks “Radius of influence” around each token Query term matching token increases link weight Favors hubs and authorities near relevant pages Better answers than HITS Ad-hoc “spreading activation”, but no formal model as yet Reducing topic drift: anchor text Query term

Search response is usually ‘purer’ than radius=1 expansion Compute document term vectors Compute centroid of response vectors Eliminate far-away expanded vectors Results improve Why stop at radius=1? Reducing topic drift: Outlier detection Expanded graph Keyword searchresponse Vector-spacedocumentmodel Centroid Cut-off radius ×

Resource discovery • Given • Yahoo-like topic tree with example URLs • A selection of good topics to explore • Examples, not queries, define topics • Need 2-way decision, not ad-hoc cut-off • Goal • Start from the good / relevant examples • Crawl to collect additional relevant URLs • Fetch as few irrelevant URLs as possible

A model for relevance Blocked class Path class All Arts Bus&Econ Recreation Companies ... Cycling ... Bike Shops Clubs Mt.Biking Subsumed classes Good classes

Pr(c|d) from Pr(c|d) using Bayes rule • Decide topic; topic c is picked with prior probability (c); c(c) = 1 • Each c has parameters (c,t) for terms t • Coin with face probabilities t (c,t) = 1 • Fix document length n(d) and toss coin • Naïve yet effective; can use other algos • Given c, probability of document is

c=class, d=text, N=neighbors Text-only model: Pr(d|c) Using neighbors’ text to judge my topic:Pr(d, d(N) | c) Better recursive model:Pr(d, c(N)| c) Relaxation labeling over Markov random fields Or, EM formulation Enhanced models for hypertext ?

9600 patents from 12 classes marked by USPTO Patents have text and prior art links Expand test patent to include neighborhood ‘Forget’ and re-estimate fraction of neighbors’ classes (Even better for Yahoo) Hyperlink modeling boosts accuracy

Topic taxonomy with examples and ‘good’ topics specified Crawler coupled to hypertext classifier Crawl frontier expanded in relevance order Neighbors of good hubs expanded with high priority Resource discovery: basic approach ExampleURLs ? ? Radius-1 rule Radius-2 rule

Taxonomy Editor Example Browser Topic Distiller Scheduler Feedback Taxonomy Database Crawl Database Workers Hypertext Classifier (Learn) Hypertext Classifier (Apply) TopicModels Focused crawler block diagram

Focused crawling evaluation • Harvest rate • What fraction of crawled pages are relevant • Robustness across seed sets • Perform separate crawls with random disjoint samples • Measure overlap in URLs, server IP addresses, and best-rated resources • Evidence of non-trivial work • Path length to the best resources

Focused Harvest rate Unfocused

Crawl robustness URL Overlap Server Overlap Crawl 1 Crawl 2

Sample disjoint sets of starting URL’s Two separate crawls Run HITS/Clever Find best authorities Order by rank Find overlap in the top-rated resources Robustness of resource quality

Cycling: cooperative Mutual funds: competitive Distance to best resources

A top hub on‘airlines’ after half an hourof focusedcrawling

A top hub on‘bicycling’ after one hour of focused crawling

Learning context graphs • Topics form connected cliques • “heart disease”  ‘swimming’, ‘hiking’ • ‘cycling’  “first-aid”! • Radius-1 rule can be myopic • Trapped within boundaries of related topics • From short pre-crawled paths • Can learn frequent chains of related topics • Use this knowledge to circumvent local “topic traps”

Context improves focused crawling

Roadmap • Hyperlink mining: a short history • Resource discovery • Content-based locality in hypertext • Taxonomy models, topic distillation • Strategies for focused crawling • Data capture and mining architecture • The Memex collaboration system • Collaborative construction of vertical portals • Link metadata management architecture • Surfing backwards on the Web

Memex project goals • Infrastructure to support spontaneous formation of topic-based communities • Mining algorithms for personal and community level topic management and collaborative resource discovery • Extensible API for plugging in additional hypertext analysis tools

Memex project status • Java applet client • Netscape 4.5+ (Javascript) available • IE4+ (ActiveX) planned • Server code for Unix and Windows • Servlets + IBM Universal Database • Berkeley DB lightweight storage manager • Simple-to-install RPMs for Linux planned • About a dozen alpha testers • First beta available 12/2000

Creating personal topic spaces • Valuable user input and feedback on topics and associated examples User cuts and pastes to correct or reinforce the Memex classifier ‘?’ indicates automatic placement by Memex classifier File manager- like interface Privacy choice

Replaying topic-based contexts • “Where was I when last surfing around /Software/Programming?” Choice of topic context Replay of recent browsing context restricted to chosen topic Better mobility than one- dimensional history provided by popular browsers Active browser monitoring and dynamic layout of new/ incremental context graph

Themes ‘Radio’ Share document ‘Television’ Share folder ‘Movies’ Share terms Synthesis of a community taxonomy • Users classify URLs into folders • How to synthesize personal folders into common taxonomy? • Combine multiple similarity hints Media kpfa.org bbc.co.uk kron.com Broadcasting channel4.com kcbs.com Entertainment foxmovies.com lucasfilms.com Studios miramax.com

Setting up the focused crawler Current Examples Drag Taxonomy Editor Suggested Additional Examples

Monitoring harvest rate One URL Relevance/Harvest rate Moving Average Time

Overview of the Memex system Browser Memex server Visit Client JAR Taxonomy synthesis Resource discovery Search Attach Recommendation Folder Download Context Classification Mining demons Running client applet Event-handler servlets Archive Clustering Relational metadata Text index Topic models Memex client-server protocol and workload sharing negotiations

Who points to S2/P2? Backlink Database C’ Local or on Memex server Surfing backwards using contexts • Space-bounded referrer log • HTTP extension to query backlink data GET /P2 HTTP/1.0 Referer: http://S1/P1 S1 S2 C http://S1/P1 http://S2/P2

Surfing backwards 1

User study and analysis • (1999) Significant improvement in finding comprehensive resource lists • Six broad information needs, 25 volunteers • Find good resources within limited time • Backlinks faked using search engines • Blind-reviewed by three other volunteers • (2000) Average path length of undirected Web graph is much smaller compared to directed Web graph • (2000) Better focused crawls using backlinks • Proposal to W3C

Follow forward HREF as before Also expand backlinks using ‘link:’ queries Classify pages as before Backlinks improve focused crawling …but pays off in the end Sometimes distracts in unrewarding work…

Surfing backwards: summary • “Life must be lived forwards, but it can only be understood backwards” —Soren Kierkegaard • Hubs are everywhere! • To find them, look backwards • Bidirectional surfing is a valuable means to seed focused resource discovery • Even if one has to depend on search engines initially for link:… queries

Conclusion • Architecture for topic-specific web resource discovery • Driven by examples collected from surfing and bookmarking activity • Reduced dependence on large crawlers • Modest desktop hardware adequate • Variable radius goal-directed crawling • High harvest rate • High quality resources found far from keyword query response nodes

Nurturing content-based collaborative communities on the Web