170 likes | 242 Views
Enhanced topic distillation using text, markup tags, and hyperlinks. Soumen Chakrabarti Mukul Joshi Vivek Tawde www.cse.iitb.ac.in/~soumen. Topic distillation. Keyword query. Given a query or some example URLs Collect a relevant subgraph (community) of the Web
E N D
Enhanced topic distillation using text, markup tags, and hyperlinks Soumen ChakrabartiMukul JoshiVivek Tawde www.cse.iitb.ac.in/~soumen
Topic distillation Keyword query • Given a query or some example URLs • Collect a relevant subgraph (community) of the Web • Bipartite reinforcement between hubs and authorities • Prototypes: • HITS and Clever • Bharat and Henzinger Searchengine Expanded set Root set
Challenges and limitations • Web authoring style in flux since 1996 • Complex pages generated from templates • File or page boundary less meaningful • “Clique attacks”—rampant multi-host ‘nepotism’ via rings, ads, banner exchanges • Models are too simplistic • Hub and authority symmetry is illusory • Coarse-grain hub model ‘leaks’ authority • Ad-hoc linear segmentation not content-aware • Deteriorating results of topic distillation
Clique attacks! Irrelevantlinks formpseudo-community Relevant regionsthat lead to inclusionof page in base set
Benign drift and generalization Remainingsectionsgeneralize and/or drift This sectionspecializes on‘Shakespeare’
html DocumentObject Model(DOM) body head Frontier ofdifferentiation table tr td tr td table ul Relevantsubtree … tr tr tr … li li li td td td a a a a Irrelevantsubtree ski.qaz.com Toncheese.co.uk art.qaz.com www.fromages.com A new fine-grained model <html>…<body>… <table …> <tr><td> <table …> <tr><td><a href=“http://art.qaz.com”>art</a></td></tr> <tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>… </table> </td></tr> <tr><td> <ul> <li><a href=“http://www.fromages.com”>Fromages.com</a> French cheese…</li> <li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a> Buy online…</li> … </ul>… </td></tr> </table>… </body></html>
Generative model for hub text Global termdistribution 0 • Global hub text distribution 0 relevant to given query • Authors use internal DOM nodes to specialize 0 into I • At a certain frontier in the DOM tree, local distribution directly generates text in ‘hot’ and ‘cold’ subtrees Progressive‘distortion’ Modelfrontier I Other pages
A balanced cost measure Reference distribution 0 Cumulative distortion cost =KL(0; u) + … + KL(u; v) u v (for exponential distribution) Dv Goal: Find minimumcost frontier Data encoding cost is roughly
Marking ‘hot’ subtrees • Hard to solve exactly (knapsack) • (1+) dynamic programming solution • Too slow for 10 million DOM nodes • Greedy expansion approach: at each node v, compare the cost of • Directly encoding Dvw.r.t. model v at v • First distorting v to w for each child w of v, then encoding all Dw w.r.t. respective w • If latter is smaller expand v, else prune • Mark relevance subtrees as “must-prune”
Exploiting co-citation in our model 1 2 Initial values ofleaf hub scores = target auth scores Must-prune nodes are marked Have reasonto believethese could be good too 0.10 0.20 0.01 0.06 0.05 0.13 3 4 Aggregate hubscores are copiedback to leaves 0.12 ‘Known’authorities 0.13 0.10 0.20 0.12 0.12 0.12 0.10 0.20 0.13 Frontier microhubsaccumulate scores Non-linear transform, unlike HITS
Complete algorithm • Collect root set and base set • Pre-segment using text and mark relevant micro-hubs to be pruned • Assign only root set authority scores to 1s • Iterate • Transfer from authority to hub leaves • Re-segment hub DOM trees using link + text • Smooth and redistribute hub scores • Transfer from hub leaves to authority roots • Report top authority and ‘hot’ microhubs
Experimental setup • Large data sets • 28 queries from Clever, >20 topics from Dmoz • Collect 2000…10000 pages per query/topic • Several million DOM nodes and fine links • Find top authorities using various algos • For ad-hoc query, measure cosine similarity of authorities with root-set centroid in vector space • For Dmoz, use an automatic classifier…
Avoiding topic drift via micro-hubs Query: cyclingNo danger of topic drift Query: affirmative actionTopic drift from software sites
Results for the Clever benchmark • Take top 40 auths • Find average cosine similarity to root set centroid • HITS < DOM+Text < DOM similarity • DOM alone cannot prune well enough: most top auths from root set • HITS drifts often
Dmoz experiments and results • 223 topics from http://dmoz.org • Sample root set URLs from a class c • Top authorities not in root set submitted to Rainbow classifier • d Pr(c |d) is the expected number of relevant documents • DOM+Text best DMoz Train Rainbowclassifier Sample Test Music Expanded set Root set Top authority
Anecdotes • “amusement parks”: http://www.411fun.com/THEMEPARKSleaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc. • New algorithm reduces drift • Mixed hubs accurately segmented, e.g. amusement parks, classical guitar, Shakespeare and sushi • Mixed hubs in top 50 for 13/28 queries
Conclusion and ongoing work • Hypertext shows complex idioms, missed by coarse-grained graph model • Enhanced fine-grained distillation • Identifies content-bearing ‘hot’ micro-hubs • Disaggregates hub scores • Reduces topic drift via mixed hubs and pseudo-communities • Application: topic-based focused crawling • Need probabilistic combination of evidence from text and links