Adaptive Focused Crawling

Adaptive Focused Crawling Presented by: Siqing Du Date: 10/19/05

Outline • Introduction of web crawling • Exploiting the hypertextual information • Genetic-based crawler • Ant-based crawler • Machine learning-based crawler • Evaluation

Crawling the Web • Simple crawling on the web proceeds by following the urls in the seed pages, retrieve web pages and add them into a local repository. • Taking the Web as a graph structure (V,E), web crawling is similar to graph traversal problem. • Breadth-first search

Flow of a Basic Sequential Crawler

What is the Problem • Current Size of web (static/crawlable/visible) is 4 ~ 10 billion or maybe a lot more • Average out-degree(# of urls in a page) of a random page on the web is 7 • Hence the size of the graph increases exponentially by 7 • A well-known web search engine only can cover a part of the whole web

Adaptive Focused Crawling • Focused crawling: developing particular crawlers able to seek out and collect pages related to a given topic. • It is also called topical crawling • If a focused crawler includes learning methods in order to adapt its behavior during the crawl to the particular environment and its relationships with the given input parameters, e.g., the set of retrieved pages and the user-defined topic, the crawler is named adaptive. • Best-first search

Exploiting the Hypertextural Information • PageRank and HITS founded from citation analysis started in 1950s by Garfield. • In focused crawling systems, the precision is not defined only in terms of number of crawled pages, but in terms of rank. • Short result lists of high rank documents are definitively better than long lists of interesting documents that force the users to sift through them in order to find the most valuable information.

Topical Locality and Anchors • Topical locality occurs each time a page is linked to others with related content. (in order to give users the chance to see further related information or services). • Proximal cues or residues correspond with the imperfect information at intermediate locations that a user exploits to decide the paths to follow in order to reach a target information. • Text snippet, anchor text or icons are usually the imperfect information related to a certain distant content.

HITS • Authorities: have relevant content about a topic • Hubs: contain several links toward relevant authoritative pages.

PageRank • Random surfer model : a surfer in that model is able to randomly click on one of the links contained in a page p with equal probability 1/Np

AI-based Approaches • Speculate that crawlers as single autonomous units live and keep moving for interesting resources. • Genetic-based crawlers • Ant paradigm

Genetic-based crawlers • InfoSpiders, also known as ARACHNID (Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery) • Genetic algorithms have been introduced in order to find approximate solutions to hard-to-solve combinatorial optimization problems. • Inspired by evolutionary biology studies.

Basic Idea of GA • A population • Genetic operators, such as, inheritance, mutation, crossover. • The ones that are closer to the better solutions are given more chances to live and reproduce, while the ones that are ill-suited for an environment die out. • The initial population generated randomly

InfoSpider • In InfoSpiders an evolving population of intelligent agents browse the Web driven by the user queries. • Each agent is able to draw relevant resources and reason autonomously about next page to download and analyze. • The goal is to mimic the intelligent browsing behavior of human users with little or no interaction among agents.

InfoSpider cont. • Each agent is built on top of a genotype (parameter that represents the degree to which a gent trusts the textual description about outgoing links, a set of keywords initialized with the query terms, and a vector of weights) • A feed-forward neural network used to judging what are the best keywords in the first set that best discriminate the documents relevant to the user.

InfoSpider cont. • The adaptivity is both unsupervised and supervised. (With or without users’ feedback) • If any error occurs (uninteresting page )due to the agents action selection, the weight of the neural networks are updated subsequently. • Mutation and crossovers provide the second kind of adaptivity to the environment. • An agent’s energy value is assigned at the beginning, updated according to the relevance of page visited. • The energy determines which agent survives or dies out.

Itsy Bitsy Spider • Itsy Bitsy spider, an implementation of genetic-based crawler, experimented on Yahoo database. • During the evaluation, the genetic approach dose not outperform the best first search algorithm. (recall high, precision no significant difference) • However, Itsy Bitsy is a simple version of InfoSpiders, no neural network and some other components, and no ability to autonomously reasoning.

Ant-based Crawlers • Based on a model of social insect collective behavior. • Studies on how blind animals, such as ants, are able to find out the shortest ways from their nest to the feeding sources and back. • Ants can release an hormonal substance, the pheromone, to mark the ground, leaving a trail. • Other ants follow the train and reinforce the trail.

Mechanism • The first ants returning to their nest from the feeding sources are those which chosen the shortest paths. • The back and forth trip let them release pheromone twice. • Others, if have to make choice between different paths, will prefer those with more pheromone path.

Ant-based Crawlers • Each agent corresponds to a virtual ant, move from urli to urlj. • The system execution is divided into cycles; in each of them, the ants make a sequence of moves. • At the end of a cycle, the ants update the pheromone intensity values of the followed path as a function of the retrieved resource scores.

Ant-based Crawlers • The transition probability from urli to urljat cycle t is • Prevent circular paths, each ant stores a L list containing the visited urls.

Updating Rule • The pheromone of trail from urli to urlj at cycle t+1 • Adaptivity: the pheromone intensities are updated according to the visited resource scores.

Intelligent Crawling’s Statistical Model • Aims at learning statistically characteristics of the linkage structure of the Web while performing search. • Using particular knowledge obtained in the search to calculate the conditional probability and interest ratio to determine whether the unseen page satisfies the user needs. • It does not need any collection of topical example for training. • The crawler adapts its behavior by learning the correlations among given features.

Reinforcement Learning-based Approaches • A classifier evaluates the relevance of a hypertext document with respect to the chosen topics. • The interesting documents found are the rewards. • To learn the text in the neighborhood of the hyperlink that most likely point to relevant pages during the crawling.

Outline • Introduction of web crawling • Exploiting the hypertextual information • Genetic-based crawler • Ant-based crawler • Machine learning-base crawler • Evaluation

Evaluation Methodologies • The goodness of the retrieved documents • The percentage of important page retrieved over the progress of the crawl is another often used measure.

An Example of Performance Plot • Calculated over 159 topics • One-tailed t-test performed, p < 0.01

Summarization • Focused crawling has become an interesting alternative to the current Web search tools. • A particular kind of crawlers able to seek out and collect the subset of Web pages related to a given topic. • With learning methods, adaptive focused crawlers are able to adapt the system behavior to the particular environment and input parameters during the search. • Evaluation results show how the whole searching process may profit of those techniques and increase crawling performance.

Reference • Core paper: • Alessandro Micarelli and Fabio Gasparetti, Adaptive Focused Crawling • Additional papers: • Gautam Pant, Padmini Srinivasan, and Filippo Menczer, Crawling the Web ,Web Dynamics, Springer-Verlag, 2003. • Martin Ester, Matthias Groß, Hans-Peter Kriegel, Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies (VLDB2001)

Questions & Comments? Thanks!

Adaptive Focused Crawling