Web Information retrieval (Web IR)

Web Information retrieval (Web IR) Handout #11: FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Web Crawling • Search engines do not index the entire Web • Therefore, we have to focus on the most valuable and appealing ones • To do this, a better crawling criterion is required • FICA

Breadth-First Crawling u q v r w p x s y t BFS Advantages Why it is a acceptable algorithm? z

Logarithmic Distance Crawling When i points to j then: u q v log4 dpv=log4+log3=1.07 r w p x s y t dpt=log4 z dpz=log4+log2=0.9

FICA • Intelligent surfer model • It is based on reinforcement learning

Priority Queue FICA (On-line) Web • Distance is used as the priority value Web pages Downloader URLs Text and Metadata Repository URL1 URL2 … FICA scheduler URLs Seeds

Comparison with Others Web Partial Ranking Algorithm Downloader Repository URLs and Links URL1 URL2 … Seeds

Experimental Results • Experiment was done on UK web graph including 18 million web pages • We chose PageRank as an ideal ranking mechanism

FICA Properties • Its time complexity is O(ElogV) • Complexity of Partial PageRank is • FICA outperforms others in discovering highly important pages • It requires small memory for computation • It is online & adaptive

FICA as a Ranking Algorithm • We used Kendall's metric for correlation between two rank lists • Ideal is PageRank

Dynamic Version of FICA

Web Information retrieval (Web IR)