1 / 22

Enhanced topic distillation using text, markup tags, and hyperlinks

Enhanced topic distillation using text, markup tags, and hyperlinks. Soumen Chakrabarti Mukul Joshi Vivek Tawde www.cse.iitb.ac.in/~soumen. Topic distillation. Keyword query. Given a query or some example URLs Collect a relevant subgraph (community) of the Web

kamil
Download Presentation

Enhanced topic distillation using text, markup tags, and hyperlinks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhanced topic distillation using text, markup tags, and hyperlinks Soumen ChakrabartiMukul JoshiVivek Tawde www.cse.iitb.ac.in/~soumen

  2. Topic distillation Keyword query • Given a query or some example URLs • Collect a relevant subgraph (community) of the Web • Bipartite reinforcement between hubs and authorities • Prototypes: • HITS, Clever, SALSA • Bharat and Henzinger Searchengine Base set Expanded set Root set

  3. Two issues • How to collect the base set • Radius-1 expansion is arbitrary • Content relevance must play a role • How to spread prestige along links • Instability of HITS (Borodin, Lempel Zheng) • Stability of PageRank (Zheng) • Stochastic variants of HITS (Lempel) • Need better recall collecting base graph • Need accurate ‘boundaries’ around it

  4. Challenges and limitations • Topic distillation results deteriorating • Web authoring style in flux since 1996 • Complex pages, templates, cloaks • File or page boundary less meaningful • “Clique attacks”—rampant multi-host ‘nepotism’ via rings, ads, banner exchanges • Models too simplistic • Hub and authority symmetry is illusory • Coarse-grain hub model ‘leaks’ authority • Ad-hoc linear segmentation not content-aware

  5. Clique attacks! Irrelevantlinks formpseudo-community Relevant regionsthat lead to inclusionof page in base set

  6. Benign drift and generalization Remainingsectionsgeneralize and/or drift This sectionspecializes on‘Shakespeare’

  7. html DocumentObject Model(DOM) body head Frontier ofdifferentiation table tr td tr td table ul Relevantsubtree … tr tr tr … li li li td td td a a a a Irrelevantsubtree ski.qaz.com Toncheese.co.uk art.qaz.com www.fromages.com A fine-grained hypertext model <html>…<body>… <table …> <tr><td> <table …> <tr><td><a href=“http://art.qaz.com”>art</a></td></tr> <tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>… </table> </td></tr> <tr><td> <ul> <li><a href=“http://www.fromages.com”>Fromages.com</a> French cheese…</li> <li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a> Buy online…</li> … </ul>… </td></tr> </table>… </body></html>

  8. 3 9 6 12 3 Preliminary approaches • Apply HITS to fine-grained base graph • Blocked reinforcement • Model DOM trees as resistance or flow networks • Ad-hoc decay factors • Apply B&H outlier elimination to every DOM node • Hot absorbs cold, includes drift-enhancing links Cold Warm enoughto figure asone hub Hot

  9. Generative model for hub text Global termdistribution 0 • Global hub text distribution 0 relevant to given query • Authors use internal DOM nodes to hierarchically specialize 0 into I • At a certain frontier, local models are ‘frozen’ and text generated Progressive‘distortion’ Modelfrontier I Other pages

  10. Examples using the binary model • Binary model: • Code length for document d • Cost for specializing a term distribution

  11. Discovering the frontier Referencedistribution0 • Use u to directly generate text snippets in the subtree rooted at u • Expand to children v and use different params for each tree • Greedily pick better local choice Cumulative distortion cost =KL(0u) + … + KL(uv) u v Dv

  12. Exploiting co-citation in our model 1 2 Initial values ofleaf hub scores = target auth scores Segment treeusing hub scores Have reasonto believethese could be good too 0.10 0.20 0.01 0.06 0.05 0.13 3 4 Aggregate hubscores are copiedback to leaves 0.12 ‘Known’authorities 0.13 0.10 0.20 0.12 0.12 0.12 0.10 0.20 0.13 Frontier microhubsaccumulate scores Non-linear transform, unlike HITS

  13. Complete algorithm • Collect root set and base set • Pre-segment using text and mark relevant micro-hubs to be pruned • Assign only root set authority scores to 1s • Iterate • Transfer from authority to hub leaves • Re-segment hub DOM trees using link + text • Smooth and redistribute hub scores • Transfer from hub leaves to authority roots • Report top authority and ‘hot’ microhubs

  14. Experimental setup • Large data sets • 28 queries from Clever, >20 topics from Dmoz • Collect 2000…10000 pages per query/topic • Several million DOM nodes and fine links • Find top authorities using various algos • Measurements + anecdotes • For ad-hoc query, measure cosine similarity of authorities with root-set centroid in vector space • Compare HITS, DOM, DOM+Text

  15. Avoiding topic drift via micro-hubs Query: cyclingNo danger of topic drift Query: affirmative actionTopic drift from software sites

  16. Empirical convergence • Convergence for all queries within 20 iterations • Faster convergence for drift-free graphs, slower for graphs that posed a danger of topic drift • Very important to not set all auth scores to > 0

  17. Results for the Clever benchmark • Take top 40 auths • Find average cosine similarity to root set centroid • HITS < DOM+Text < DOM similarity • DOM alone cannot prune well enough: most top auths from root set • HITS drifts often

  18. Dmoz experiments and results • 223 topics from http://dmoz.org • Sample root set URLs from a class c • Top authorities not in root set submitted to Rainbow classifier • d Pr(c |d) is the expected number of relevant documents • DOM+Text best DMoz Train Rainbowclassifier Sample Test Music Expanded set Root set Top authority

  19. Anecdotes • “amusement parks”: http://www.411fun.com/THEMEPARKSleaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc. • New algorithm reduces drift • Mixed hubs accurately segmented, e.g. amusement parks, classical guitar, Shakespeare and sushi • Mixed hubs and clique attacks rampant

  20. Application: surfing like humans Focused Crawling Train a topic classifierInitialize priority queue to a few sample URLs about a topicAssume they have relevance = 1Repeat Fetch page most relevant to topic Estimate relevance R using classifier Guess that all outlinks have relevance R Add outlinks to priority queue ? ! ? • Problem: average out-degree is too high (~10) • Discovering irrelevance after 10X more work • Can we use DOM and text to bias the ‘walk’?

  21. Preliminary results Relevance R1 Featurescollected fromsource pageDOM Relevance R2 Promising andunpromising‘clicks’ Feedback Standardfocusedcrawler Meta-learner

  22. Summary • Hypertext shows complex idioms, missed by coarse-grained graph model • Enhanced fine-grained distillation • Identifies content-bearing ‘hot’ micro-hubs • Disaggregates hub scores • Reduces topic drift via mixed hubs and pseudo-communities • Application: online reinforcement learning • Need probabilistic combination of evidence from text and links

More Related