110 likes | 241 Views
Web Information retrieval (Web IR). Handout #11: FICA: A Fast Intelligent Crawling Algorithm. Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir. Web Crawling. Search engines do not index the entire Web
E N D
Web Information retrieval (Web IR) Handout #11: FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir
Web Crawling • Search engines do not index the entire Web • Therefore, we have to focus on the most valuable and appealing ones • To do this, a better crawling criterion is required • FICA
Breadth-First Crawling u q v r w p x s y t BFS Advantages Why it is a acceptable algorithm? z
Logarithmic Distance Crawling When i points to j then: u q v log4 dpv=log4+log3=1.07 r w p x s y t dpt=log4 z dpz=log4+log2=0.9
FICA • Intelligent surfer model • It is based on reinforcement learning
Priority Queue FICA (On-line) Web • Distance is used as the priority value Web pages Downloader URLs Text and Metadata Repository URL1 URL2 … FICA scheduler URLs Seeds
Comparison with Others Web Partial Ranking Algorithm Downloader Repository URLs and Links URL1 URL2 … Seeds
Experimental Results • Experiment was done on UK web graph including 18 million web pages • We chose PageRank as an ideal ranking mechanism
FICA Properties • Its time complexity is O(ElogV) • Complexity of Partial PageRank is • FICA outperforms others in discovering highly important pages • It requires small memory for computation • It is online & adaptive
FICA as a Ranking Algorithm • We used Kendall's metric for correlation between two rank lists • Ideal is PageRank