1 / 17

Web Crawling Notes by Aisha Walcott

Web Crawling Notes by Aisha Walcott. Modeling the Internet and the Web: Probabilistic Methods and Algorithms Authors: Baldi, Frasconi, Smyth. Outline. Basic Crawling Selective crawling Focused crawling Distributed crawling Web dynamics- age/lifetime of documents.

espen
Download Presentation

Web Crawling Notes by Aisha Walcott

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web CrawlingNotes by Aisha Walcott Modeling the Internet and the Web: Probabilistic Methods and Algorithms Authors: Baldi, Frasconi, Smyth

  2. Outline • Basic Crawling • Selective crawling • Focused crawling • Distributed crawling • Web dynamics- age/lifetime of documents -Anchors are very useful in search engines, they are the text “on top” of a link on a webpage Eg: <a href=“URL”> anchor text </a> -Many topics presented here have pointers to a number of references

  3. Basic Crawling • A simple crawler uses a graph algorithm such as BFS • Maintains a queue, Q, that stores URLs • Two repositories: D- stores documents, E- stores URLs • Given S0 (seeds): initial collection of URLs • Each iteration • Dequeue, fetch, and parse document for new URLs • Enqueue new URLs not visited (web is acyclic) • Termination conditions • Time allotted to crawling expired • Storage resources are full • Consequently Q, D have data, so anchors to the URLs in Q are used to return query results (many search engines do this)

  4. Practical Modifications & Issues • Time to download a doc is unknown • DNS lookup may be slow • Network congestion, connection delays • Exploit bandwidth- run concurrent fetching threads • Crawlers should be respectful of servers and not abuse resources at target site (robots exclusion protocol) • Multiple threads should not fetch from same server simultaneously or too often • Broaden crawling fringe (more servers) and increase time between requests to same server • Storing Q, and D on disk requires careful external memory management • Crawlers avoid aliases “traps”- same doc is addressed by many different URLs • Web is dynamic and changes in topology and content

  5. where u is a URL, is the relevance criterion,  is the set of parameters. Selective Crawling (Selective Crawling) • Recognizing the relevance or importance of sites, limit fetching to most important subset • Define a scoring function for relevance • Eg. Best first search using score to enqueue • Measure efficiency: rt/t, t = #pages fetched, rt = #fetched pages with score > st (ideally rt =t)

  6. 1, if |root(u) ~> u| < , root(u) is root of site with u 0, otherwise 1, if indegree(u) > 0, otherwise Ex: Scoring Functions (Selective Crawling) • Depth- limit #docs downloaded from a single site by a) setting threshold, b) depth in dir tree, or c) limit path length; maximizes breadth • Popularity- assigning importance by most popular; eg. a relevance function based on backlinks • PageRank- measure of popularity recursively assigns ea. link a weight proportional to popularity of doc

  7. Focused Crawling • Searches for info related to certain topic not driven by generic quality measures • Relevance prediction • Context graphs • Reinforcement learning • Examples: Citeseer, Fish algm (agents accumulate energy for relative docs, consume energy for network resources)

  8. Relevance Prediction (Focused Crawling) • Define a score as cond. prob. that a doc is relevant given text in the doc. • Strategies for approx topic score • Parent-based: score a fetched doc and extend score to all URLs in that doc, “topic locality” • Anchor-based: just use text d(v,u) in the anchor(s) where link to u is referred to, “semantic linkage • Eg. naïve Bayes classifier trained on relevant docs. c is topic of interest  are adjustable params of classifier d(u) is contents of doc at vertex u v is parent ofu

  9. Context Graphs (Focused Crawling) • Take adv of knowledge of internet topology • Train machine learning system to predict “how far” relevant info can be expected to be found • Eg. 2 layered context graph, layered graph of node u • After training, predict layer a new doc belongs to indicating # links to follow before relevant info reached Layer 2 Layer 1 u

  10. Reinforcement Learning (Focused Crawling) • Immediate rewards when crawler downloads a relevant doc • Policy learned by RL can guide agent toward high long-term cumulative rewards • Internal state of crawler- sets of fetched and discovered URLs • Actions- fetching a URL in the queue of URLs • State space too large

  11. Distributed Crawling • Scalable system by “divide and conquer” • Want to minimize significant overlap • Characterize interaction between crawlers • Coordination • Confinement • Partitioning

  12. Coordination (Distributed Crawling) • The day different crawlers agree about the subset of pages ea. of them is responsible for • If 2 crawlers are completely independent then overlap only controlled by having different seeds (URLs) • Hard to compute the partition that minimizes overlap • Partition web into subgraphs-crawler is responsible for fetching docs from their subgraphs • Static or dynamic partition based on whether or not it changes during crawling (static more autonomous, dynamic is subject to reassignment from external observer)

  13. Confinement (Distributed Crawling) • Assumes static coordination; defines how strict ea. crawler should operate within its own partition • What happens when a crawler pops “foreign” URLs from its queue (URLs from another partition) • 3 suggested modes • Firewall: never follow interpartition links • Poor coverage • Crossover: follow links when Q has no more local URLs • Good coverage, potential high overlap • Exchange: never follows interpartition links, but periodically communicates foreign URLs w/ the correct crawler(s) • No overlap, potential perfect coverage, but extra bandwidth

  14. Partitioning (Distributed Crawling) • Strategy used to split URLs into non-overlapping subsets assigned to ea. crawler • Eg. Hash fn. of IPs assigning them to a crawler • Take into account geographical dislocations

  15. Web Dynamics • How info on web changes over time • SE w/ a collection of dos is (, )-current if the probability that a doc is -current is at least  ( is the “grace period”) • Eg. How many docs per day to be (0.9, 1wk)-current • Assume changes in the web are random and independent • Model this according to a Poisson process • “Dot coms” much more dynamic than “dot edu”

  16. Lifetime and Aging of Documents • Model based on reliability theory in Ind Engr’g

  17. Table cdfs pdf

More Related