Enhanced topic distillation using text, markup tags, and hyperlinks

Enhanced topic distillation using text, markup tags, and hyperlinks Soumen ChakrabartiMukul JoshiVivek Tawde www.cse.iitb.ac.in/~soumen

Topic distillation Keyword query • Given a query or some example URLs • Collect a relevant subgraph (community) of the Web • Bipartite reinforcement between hubs and authorities • Prototypes: • HITS and Clever • Bharat and Henzinger Searchengine Expanded set Root set

Challenges and limitations • Web authoring style in flux since 1996 • Complex pages generated from templates • File or page boundary less meaningful • “Clique attacks”—rampant multi-host ‘nepotism’ via rings, ads, banner exchanges • Models are too simplistic • Hub and authority symmetry is illusory • Coarse-grain hub model ‘leaks’ authority • Ad-hoc linear segmentation not content-aware • Deteriorating results of topic distillation

Clique attacks! Irrelevantlinks formpseudo-community Relevant regionsthat lead to inclusionof page in base set

Benign drift and generalization Remainingsectionsgeneralize and/or drift This sectionspecializes on‘Shakespeare’

html DocumentObject Model(DOM) body head Frontier ofdifferentiation table tr td tr td table ul Relevantsubtree … tr tr tr … li li li td td td a a a a Irrelevantsubtree ski.qaz.com Toncheese.co.uk art.qaz.com www.fromages.com A new fine-grained model <html>…<body>… <table …> <tr><td> <table …> <tr><td><a href=“http://art.qaz.com”>art</a></td></tr> <tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>… </table> </td></tr> <tr><td> <ul> <li><a href=“http://www.fromages.com”>Fromages.com</a> French cheese…</li> <li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a> Buy online…</li> … </ul>… </td></tr> </table>… </body></html>

Generative model for hub text Global termdistribution 0 • Global hub text distribution 0 relevant to given query • Authors use internal DOM nodes to specialize 0 into I • At a certain frontier in the DOM tree, local distribution directly generates text in ‘hot’ and ‘cold’ subtrees Progressive‘distortion’ Modelfrontier I Other pages

A balanced cost measure Reference distribution 0 Cumulative distortion cost =KL(0; u) + … + KL(u; v) u v (for exponential distribution) Dv Goal: Find minimumcost frontier Data encoding cost is roughly

Marking ‘hot’ subtrees • Hard to solve exactly (knapsack) • (1+) dynamic programming solution • Too slow for 10 million DOM nodes • Greedy expansion approach: at each node v, compare the cost of • Directly encoding Dvw.r.t. model v at v • First distorting v to w for each child w of v, then encoding all Dw w.r.t. respective w • If latter is smaller expand v, else prune • Mark relevance subtrees as “must-prune”

Exploiting co-citation in our model 1 2 Initial values ofleaf hub scores = target auth scores Must-prune nodes are marked Have reasonto believethese could be good too 0.10 0.20 0.01 0.06 0.05 0.13 3 4 Aggregate hubscores are copiedback to leaves 0.12 ‘Known’authorities 0.13 0.10 0.20 0.12 0.12 0.12 0.10 0.20 0.13 Frontier microhubsaccumulate scores Non-linear transform, unlike HITS

Complete algorithm • Collect root set and base set • Pre-segment using text and mark relevant micro-hubs to be pruned • Assign only root set authority scores to 1s • Iterate • Transfer from authority to hub leaves • Re-segment hub DOM trees using link + text • Smooth and redistribute hub scores • Transfer from hub leaves to authority roots • Report top authority and ‘hot’ microhubs

Experimental setup • Large data sets • 28 queries from Clever, >20 topics from Dmoz • Collect 2000…10000 pages per query/topic • Several million DOM nodes and fine links • Find top authorities using various algos • For ad-hoc query, measure cosine similarity of authorities with root-set centroid in vector space • For Dmoz, use an automatic classifier…

Avoiding topic drift via micro-hubs Query: cyclingNo danger of topic drift Query: affirmative actionTopic drift from software sites

Results for the Clever benchmark • Take top 40 auths • Find average cosine similarity to root set centroid • HITS < DOM+Text < DOM similarity • DOM alone cannot prune well enough: most top auths from root set • HITS drifts often

Dmoz experiments and results • 223 topics from http://dmoz.org • Sample root set URLs from a class c • Top authorities not in root set submitted to Rainbow classifier • d Pr(c |d) is the expected number of relevant documents • DOM+Text best DMoz Train Rainbowclassifier Sample Test Music Expanded set Root set Top authority

Anecdotes • “amusement parks”: http://www.411fun.com/THEMEPARKSleaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc. • New algorithm reduces drift • Mixed hubs accurately segmented, e.g. amusement parks, classical guitar, Shakespeare and sushi • Mixed hubs in top 50 for 13/28 queries

Conclusion and ongoing work • Hypertext shows complex idioms, missed by coarse-grained graph model • Enhanced fine-grained distillation • Identifies content-bearing ‘hot’ micro-hubs • Disaggregates hub scores • Reduces topic drift via mixed hubs and pseudo-communities • Application: topic-based focused crawling • Need probabilistic combination of evidence from text and links

Enhanced topic distillation using text, markup tags, and hyperlinks

Enhanced topic distillation using text, markup tags, and hyperlinks

Presentation Transcript

Graphics and Hyperlinks

HTML (Hyper Text Markup Language)

HTML Hyper Text Markup Language

HTML5 ( Hyper Text Markup Language)

Using Hyper Text Markup Language to develop a Web page

Hyperlinks

Hyper Text Markup Language

Hyperlinks

Searching and Browsing Using Tags

Entering HTML Tags and Text

Using Predicate-Argument Structure for Topic- and Event-based Distillation

XML for Text Markup

Hyperlinks

HTML Hyper Text Markup Language

Topic Distillation and Web Page Categorization

Hyperlinks