170 likes | 410 Views
Web Crawling Notes by Aisha Walcott. Modeling the Internet and the Web: Probabilistic Methods and Algorithms Authors: Baldi, Frasconi, Smyth. Outline. Basic Crawling Selective crawling Focused crawling Distributed crawling Web dynamics- age/lifetime of documents.
E N D
Web CrawlingNotes by Aisha Walcott Modeling the Internet and the Web: Probabilistic Methods and Algorithms Authors: Baldi, Frasconi, Smyth
Outline • Basic Crawling • Selective crawling • Focused crawling • Distributed crawling • Web dynamics- age/lifetime of documents -Anchors are very useful in search engines, they are the text “on top” of a link on a webpage Eg: <a href=“URL”> anchor text </a> -Many topics presented here have pointers to a number of references
Basic Crawling • A simple crawler uses a graph algorithm such as BFS • Maintains a queue, Q, that stores URLs • Two repositories: D- stores documents, E- stores URLs • Given S0 (seeds): initial collection of URLs • Each iteration • Dequeue, fetch, and parse document for new URLs • Enqueue new URLs not visited (web is acyclic) • Termination conditions • Time allotted to crawling expired • Storage resources are full • Consequently Q, D have data, so anchors to the URLs in Q are used to return query results (many search engines do this)
Practical Modifications & Issues • Time to download a doc is unknown • DNS lookup may be slow • Network congestion, connection delays • Exploit bandwidth- run concurrent fetching threads • Crawlers should be respectful of servers and not abuse resources at target site (robots exclusion protocol) • Multiple threads should not fetch from same server simultaneously or too often • Broaden crawling fringe (more servers) and increase time between requests to same server • Storing Q, and D on disk requires careful external memory management • Crawlers avoid aliases “traps”- same doc is addressed by many different URLs • Web is dynamic and changes in topology and content
where u is a URL, is the relevance criterion, is the set of parameters. Selective Crawling (Selective Crawling) • Recognizing the relevance or importance of sites, limit fetching to most important subset • Define a scoring function for relevance • Eg. Best first search using score to enqueue • Measure efficiency: rt/t, t = #pages fetched, rt = #fetched pages with score > st (ideally rt =t)
1, if |root(u) ~> u| < , root(u) is root of site with u 0, otherwise 1, if indegree(u) > 0, otherwise Ex: Scoring Functions (Selective Crawling) • Depth- limit #docs downloaded from a single site by a) setting threshold, b) depth in dir tree, or c) limit path length; maximizes breadth • Popularity- assigning importance by most popular; eg. a relevance function based on backlinks • PageRank- measure of popularity recursively assigns ea. link a weight proportional to popularity of doc
Focused Crawling • Searches for info related to certain topic not driven by generic quality measures • Relevance prediction • Context graphs • Reinforcement learning • Examples: Citeseer, Fish algm (agents accumulate energy for relative docs, consume energy for network resources)
Relevance Prediction (Focused Crawling) • Define a score as cond. prob. that a doc is relevant given text in the doc. • Strategies for approx topic score • Parent-based: score a fetched doc and extend score to all URLs in that doc, “topic locality” • Anchor-based: just use text d(v,u) in the anchor(s) where link to u is referred to, “semantic linkage • Eg. naïve Bayes classifier trained on relevant docs. c is topic of interest are adjustable params of classifier d(u) is contents of doc at vertex u v is parent ofu
Context Graphs (Focused Crawling) • Take adv of knowledge of internet topology • Train machine learning system to predict “how far” relevant info can be expected to be found • Eg. 2 layered context graph, layered graph of node u • After training, predict layer a new doc belongs to indicating # links to follow before relevant info reached Layer 2 Layer 1 u
Reinforcement Learning (Focused Crawling) • Immediate rewards when crawler downloads a relevant doc • Policy learned by RL can guide agent toward high long-term cumulative rewards • Internal state of crawler- sets of fetched and discovered URLs • Actions- fetching a URL in the queue of URLs • State space too large
Distributed Crawling • Scalable system by “divide and conquer” • Want to minimize significant overlap • Characterize interaction between crawlers • Coordination • Confinement • Partitioning
Coordination (Distributed Crawling) • The day different crawlers agree about the subset of pages ea. of them is responsible for • If 2 crawlers are completely independent then overlap only controlled by having different seeds (URLs) • Hard to compute the partition that minimizes overlap • Partition web into subgraphs-crawler is responsible for fetching docs from their subgraphs • Static or dynamic partition based on whether or not it changes during crawling (static more autonomous, dynamic is subject to reassignment from external observer)
Confinement (Distributed Crawling) • Assumes static coordination; defines how strict ea. crawler should operate within its own partition • What happens when a crawler pops “foreign” URLs from its queue (URLs from another partition) • 3 suggested modes • Firewall: never follow interpartition links • Poor coverage • Crossover: follow links when Q has no more local URLs • Good coverage, potential high overlap • Exchange: never follows interpartition links, but periodically communicates foreign URLs w/ the correct crawler(s) • No overlap, potential perfect coverage, but extra bandwidth
Partitioning (Distributed Crawling) • Strategy used to split URLs into non-overlapping subsets assigned to ea. crawler • Eg. Hash fn. of IPs assigning them to a crawler • Take into account geographical dislocations
Web Dynamics • How info on web changes over time • SE w/ a collection of dos is (, )-current if the probability that a doc is -current is at least ( is the “grace period”) • Eg. How many docs per day to be (0.9, 1wk)-current • Assume changes in the web are random and independent • Model this according to a Poisson process • “Dot coms” much more dynamic than “dot edu”
Lifetime and Aging of Documents • Model based on reliability theory in Ind Engr’g