240 likes | 892 Views
Enhanced topic distillation using text, markup tags, and hyperlinks. Soumen Chakrabarti Mukul Joshi Vivek Tawde www.cse.iitb.ac.in/~soumen. Topic distillation. Keyword query. Given a query or some example URLs Collect a relevant subgraph (community) of the Web
E N D
Enhanced topic distillation using text, markup tags, and hyperlinks Soumen ChakrabartiMukul JoshiVivek Tawde www.cse.iitb.ac.in/~soumen
Topic distillation Keyword query • Given a query or some example URLs • Collect a relevant subgraph (community) of the Web • Bipartite reinforcement between hubs and authorities • Prototypes: • HITS, Clever, SALSA • Bharat and Henzinger Searchengine Base set Expanded set Root set
Two issues • How to collect the base set • Radius-1 expansion is arbitrary • Content relevance must play a role • How to spread prestige along links • Instability of HITS (Borodin, Lempel Zheng) • Stability of PageRank (Zheng) • Stochastic variants of HITS (Lempel) • Need better recall collecting base graph • Need accurate ‘boundaries’ around it
Challenges and limitations • Topic distillation results deteriorating • Web authoring style in flux since 1996 • Complex pages, templates, cloaks • File or page boundary less meaningful • “Clique attacks”—rampant multi-host ‘nepotism’ via rings, ads, banner exchanges • Models too simplistic • Hub and authority symmetry is illusory • Coarse-grain hub model ‘leaks’ authority • Ad-hoc linear segmentation not content-aware
Clique attacks! Irrelevantlinks formpseudo-community Relevant regionsthat lead to inclusionof page in base set
Benign drift and generalization Remainingsectionsgeneralize and/or drift This sectionspecializes on‘Shakespeare’
html DocumentObject Model(DOM) body head Frontier ofdifferentiation table tr td tr td table ul Relevantsubtree … tr tr tr … li li li td td td a a a a Irrelevantsubtree ski.qaz.com Toncheese.co.uk art.qaz.com www.fromages.com A fine-grained hypertext model <html>…<body>… <table …> <tr><td> <table …> <tr><td><a href=“http://art.qaz.com”>art</a></td></tr> <tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>… </table> </td></tr> <tr><td> <ul> <li><a href=“http://www.fromages.com”>Fromages.com</a> French cheese…</li> <li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a> Buy online…</li> … </ul>… </td></tr> </table>… </body></html>
3 9 6 12 3 Preliminary approaches • Apply HITS to fine-grained base graph • Blocked reinforcement • Model DOM trees as resistance or flow networks • Ad-hoc decay factors • Apply B&H outlier elimination to every DOM node • Hot absorbs cold, includes drift-enhancing links Cold Warm enoughto figure asone hub Hot
Generative model for hub text Global termdistribution 0 • Global hub text distribution 0 relevant to given query • Authors use internal DOM nodes to hierarchically specialize 0 into I • At a certain frontier, local models are ‘frozen’ and text generated Progressive‘distortion’ Modelfrontier I Other pages
Examples using the binary model • Binary model: • Code length for document d • Cost for specializing a term distribution
Discovering the frontier Referencedistribution0 • Use u to directly generate text snippets in the subtree rooted at u • Expand to children v and use different params for each tree • Greedily pick better local choice Cumulative distortion cost =KL(0u) + … + KL(uv) u v Dv
Exploiting co-citation in our model 1 2 Initial values ofleaf hub scores = target auth scores Segment treeusing hub scores Have reasonto believethese could be good too 0.10 0.20 0.01 0.06 0.05 0.13 3 4 Aggregate hubscores are copiedback to leaves 0.12 ‘Known’authorities 0.13 0.10 0.20 0.12 0.12 0.12 0.10 0.20 0.13 Frontier microhubsaccumulate scores Non-linear transform, unlike HITS
Complete algorithm • Collect root set and base set • Pre-segment using text and mark relevant micro-hubs to be pruned • Assign only root set authority scores to 1s • Iterate • Transfer from authority to hub leaves • Re-segment hub DOM trees using link + text • Smooth and redistribute hub scores • Transfer from hub leaves to authority roots • Report top authority and ‘hot’ microhubs
Experimental setup • Large data sets • 28 queries from Clever, >20 topics from Dmoz • Collect 2000…10000 pages per query/topic • Several million DOM nodes and fine links • Find top authorities using various algos • Measurements + anecdotes • For ad-hoc query, measure cosine similarity of authorities with root-set centroid in vector space • Compare HITS, DOM, DOM+Text
Avoiding topic drift via micro-hubs Query: cyclingNo danger of topic drift Query: affirmative actionTopic drift from software sites
Empirical convergence • Convergence for all queries within 20 iterations • Faster convergence for drift-free graphs, slower for graphs that posed a danger of topic drift • Very important to not set all auth scores to > 0
Results for the Clever benchmark • Take top 40 auths • Find average cosine similarity to root set centroid • HITS < DOM+Text < DOM similarity • DOM alone cannot prune well enough: most top auths from root set • HITS drifts often
Dmoz experiments and results • 223 topics from http://dmoz.org • Sample root set URLs from a class c • Top authorities not in root set submitted to Rainbow classifier • d Pr(c |d) is the expected number of relevant documents • DOM+Text best DMoz Train Rainbowclassifier Sample Test Music Expanded set Root set Top authority
Anecdotes • “amusement parks”: http://www.411fun.com/THEMEPARKSleaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc. • New algorithm reduces drift • Mixed hubs accurately segmented, e.g. amusement parks, classical guitar, Shakespeare and sushi • Mixed hubs and clique attacks rampant
Application: surfing like humans Focused Crawling Train a topic classifierInitialize priority queue to a few sample URLs about a topicAssume they have relevance = 1Repeat Fetch page most relevant to topic Estimate relevance R using classifier Guess that all outlinks have relevance R Add outlinks to priority queue ? ! ? • Problem: average out-degree is too high (~10) • Discovering irrelevance after 10X more work • Can we use DOM and text to bias the ‘walk’?
Preliminary results Relevance R1 Featurescollected fromsource pageDOM Relevance R2 Promising andunpromising‘clicks’ Feedback Standardfocusedcrawler Meta-learner
Summary • Hypertext shows complex idioms, missed by coarse-grained graph model • Enhanced fine-grained distillation • Identifies content-bearing ‘hot’ micro-hubs • Disaggregates hub scores • Reduces topic drift via mixed hubs and pseudo-communities • Application: online reinforcement learning • Need probabilistic combination of evidence from text and links