1 / 14

Accelerated Focused Crawling Through Online Relevance Feedback

Accelerated Focused Crawling Through Online Relevance Feedback. Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin. Baseline learner. Dmoz topic taxonomy. Class models consisting of term stats. First-generation focused crawling.

zane-neal
Download Presentation

Accelerated Focused Crawling Through Online Relevance Feedback

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerated Focused CrawlingThrough Online Relevance Feedback Soumen Chakrabarti, IIT BombayKunal Punera, IIT BombayMallela Subramanyam, UT Austin

  2. Baseline learner Dmoztopic taxonomy Class modelsconsisting ofterm stats First-generation focused crawling If Pr(c*|u) is large enoughthen enqueue all outlinks v of uwith priority Pr(c*|u) Frontier URLSpriority queue Seed URLs Pick best Crawler Crawldatabase Newly fetchedpage u • Crawl regions of the Web pertaining to specific topic c*, avoiding irrelevant topics • Guess relevance of unseen node v based on the relevance of u (uv) evaluated by topic classifier Submit page for classification

  3. Baseline crawling results • 20 topics from http://dmoz.org • Half to two-thirds of pages fetched are irrelevant • “Every click on a link is a leap of faith” • Humans leap better than focused crawlers • Adequate clues in text + DOM to leap better

  4. How to seek out distant goals? • Manually collect paths (context graphs) leading to relevant goals • Use a classifier to predict link-distance to goal from page text • Prioritize link expansion by estimated distance to relevant goals • No discrimination amongdifferent links on a page Goal 1 2 3 1 2 3 4 Classifier Crawler

  5. An instance (u,v)for the apprentice Apprentice learner Onlinetraining Pr(c|u) forall classes c Classmodels u ... submit (u,v)to the apprentice + - v Apprentice assigns moreaccurate priority to node v Pr(c*|v) The apprentice+critic strategy If Pr(c*|u) islarge enough... Frontier URLspriority queue Baseline learner (Critic) Good Dmoztopic taxonomy Class modelsconsisting ofterm stats u Pick best Good/ Bad Crawler v Crawldatabase Newly fetchedpage u Submit page for classification

  6. Design rationale • Baseline classifier specifies what to crawl • Could be a user-defined black box • Usually depends on large amounts of training data, relatively slow to train (hours) • Apprentice classifier learns how to locate pages approved by the baseline classifier • Uses local regularities in HTML, site structure • Less training data, faster training (minutes) • Guards against high fan-out (~10) • No need to manually collect paths to goals

  7. LCA “download” Apprentice feature design • HREF source page u represented by DOM tree • Leaf nodes marked with offsets wrt the HREF • Many usable representations for term at offset d • A t,d tuple, e.g., “download”,-2 • t,p,d where p is path length from t to d through LCA ul li li li li tt TEXT a HREF TEXT em TEXT TEXT font TEXT TEXT TEXT @-2 @-1 @0 @0 @1 @2 @3 Offsets

  8. Offsets of good t,d features • Plot information gain at each d averaged over terms t • Max at d=0, falls off on both sides, but… • Clear saw-tooth pattern for most topics—why? • <li><a…><li><a…>… • <a…><br><a…><br>… • Topic-independent authoring idioms, best handled by apprentice

  9. Apprentice learner design • Instance (u,v),Pr(c*|v)  represented by • t,d features for d up to some dmax • HREF source topics {c,Pr(c|u) c} • Cocitation: w1, w2 siblings, w1 goodw2 good • Learning algorithm • Want to map features to a score in [0,1] • Discretize [0,1] into ranges, each a label  • Class label  has an associated value q • Use a naïve Bayes classifier to find Pr(|(u,v)) • Score = q Pr(|(u,v))

  10. Apprentice accuracy • Better elimination of useless outlinks with increasing dmax • Good accuracy with dmax< 6 • Using DOM offset info improves accuracy • Small accuracy gains • LCA distance in t,p,d • Source page topics • Cocitation features

  11. Offline apprentice trials • Run baseline, train apprentice • Start new crawler at the same URLs • Let it fetch any page it schedules • (Recall) limit it to pages visited by the baseline crawler • Baseline loss > recall loss > apprentice loss • Small URL overlap

  12. Online guidance from apprentice • Run baseline • Train apprentice • Re-evaluate frontier • Apprentice not as optimistic as baseline • Many URLs downgraded • Continue crawling with apprentice guidance • Immediate reduction in loss rate

  13. Summary • New generation of focused crawler • Discriminates between links, learning online • Apprentice easy and fast to train online • Accurate with small dmax around 4—6 • DOM-derived features better than text • Effective division of labor (‘what’ vs. ‘how’) • Loss rate reduced by 30—90% • Apprentice better at guessing relevance of unvisited nodes than baseline crawler • Benefits visible after 100—1000 page fetches

  14. Ongoing work • Extending to larger radius and deeper DOM + site structure • Public domain C++ software • Crawler • Asynchronous DNS, simple callback model • Can saturate dedicated 4Mbps with a Pentium2 • HTML cleaner • Simple, customizable, table-driven patch logic • Robust to bad HTML, no crash or memory leak • HTML to DOM converter • Extensible DOM node class

More Related