1 / 11

Web Information retrieval (Web IR)

Web Information retrieval (Web IR). Handout #11: FICA: A Fast Intelligent Crawling Algorithm. Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir. Web Crawling. Search engines do not index the entire Web

bianca-cole
Download Presentation

Web Information retrieval (Web IR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Information retrieval (Web IR) Handout #11: FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

  2. Web Crawling • Search engines do not index the entire Web • Therefore, we have to focus on the most valuable and appealing ones • To do this, a better crawling criterion is required • FICA

  3. Breadth-First Crawling u q v r w p x s y t BFS Advantages Why it is a acceptable algorithm? z

  4. Logarithmic Distance Crawling When i points to j then: u q v log4 dpv=log4+log3=1.07 r w p x s y t dpt=log4 z dpz=log4+log2=0.9

  5. FICA • Intelligent surfer model • It is based on reinforcement learning

  6. Priority Queue FICA (On-line) Web • Distance is used as the priority value Web pages Downloader URLs Text and Metadata Repository URL1 URL2 … FICA scheduler URLs Seeds

  7. Comparison with Others Web Partial Ranking Algorithm Downloader Repository URLs and Links URL1 URL2 … Seeds

  8. Experimental Results • Experiment was done on UK web graph including 18 million web pages • We chose PageRank as an ideal ranking mechanism

  9. FICA Properties • Its time complexity is O(ElogV) • Complexity of Partial PageRank is • FICA outperforms others in discovering highly important pages • It requires small memory for computation • It is online & adaptive

  10. FICA as a Ranking Algorithm • We used Kendall's metric for correlation between two rank lists • Ideal is PageRank

  11. Dynamic Version of FICA

More Related