170 likes | 329 Views
Freshness Policy. Binoy Dharia , K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA. Freshness Policy.
E N D
Freshness Policy BinoyDharia, K. Rohan Gandhi, MadhuraKolwadkar Department of Computer Science University of Southern California Los Angeles, CA
Freshness Policy • Freshness policy also known as Revisit policy is the process of determining the order and time to re-crawl the web pages by any crawler. • By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates and deletions which will make the crawled data out-of-date. • In order to display latest results to the user search engine must have an efficient revisit policy. • An efficient revisit policy will not only save time and bandwidth but also keep search engines data up-to-date.
Metrics for evaluation of Freshness Policy • Two metrics for determining how up to date a site is can be described as follows: • Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a pagepin the repository at timetis defined as: • Age: This is a measure that indicates how outdated the local copy is. The age of a page in the repository, at time is defined as:
Methodology • Tracked over 90 sites over a period of 2 weeks. • We divided them into 4 categories: • Movies • Technology • Education • News • Sites selected based on Alexa traffic Rankings – Rohan and Binoy • Developed crawler in Java to download original as well as cached version of Google and Bing for each web page twice a day– Binoy and Rohan • Implemented our own code to extract date and time from the cache for each web page- Rohan • Implemented our own Diff functionality to detect changes in a web page over a period of time which ignored html tags and scripts and considered data between the tags– Madhura • Data Integration – Madhura • Data Analysis – Binoy, Rohan and Madhura • Study of Nutch Adaptive Fetch Policy - Binoy
NUTCH 1.2 Setup • Installed Nutch with Lucene on local machine for crawling • Settings used for Nutch Crawling <name>db.fetch.interval.default</name><value>172800</value><name>db.fetch.schedule.class</name><value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value><name>db.fetch.schedule.adaptive.inc_rate</name><value>0.4</value><name>db.fetch.schedule.adaptive.dec_rate</name><value>0.2</value>
Nutch Crawling Snapshot • Average Freshness achieved with Nutch Fetch Policy – 0.5
Data Integration and Calculations (Excel) • Data Snippet after Integration • Age and Freshness Calculations • Per Site Average Age per Site = (Sum of Ages)/ (Number of Crawls) Average Freshness per Site = (Sum of Freshness) / (Number of Crawls) • Per Category Average Age per Category = (Sum of Average Site Ages) / (Site Count) Average Freshness per Category = (Sum of Average Site Freshness) / (Site Count)
Standard Deviation • Standard Deviation in Age for a Category (Days) = sqrt [ (sum of squares of age difference) / Site Count ]
Data Analysis • Age Comparison between Google and Bing • Conclusions : • Google Database is much more up to date as compared to Bing • Google crawls news sites more than once a day • Google crawling cycle is mostly consistent across different categories • Google average crawling cycle is 0.8 Days • Bing average crawling cycle is 4.6 Days
Data Analysis • Freshness Comparison between Google and Bing • Conclusions : • News sites change frequently and so even though the Age for News sites is low, cached page is usually not fresh • Google Average Freshness is 0.65 • Bing Average Freshness is 0.28
Data Analysis • Comparison of Standard Deviation across Domains • Conclusions : • Google’s standard deviation is low which indicates category of a site is not a major factor while deciding frequency of crawl • Same inference does not apply for Bing
Data Analysis • Alexa Rank (x-axis) vs Google Cache Age (y-axis) • Conclusion: • Google - Sites with high traffic are crawled more frequently
Data Analysis • Alexa Rank (x-axis) vs Bing Cache Age (y-axis) • Conclusion : • Bing crawling is uniform across sites with varying traffic volume
Data Analysis • Date Modified vs Crawl Date • Conclusion : • Google Crawling seems to be more adaptive to original site changes while Bing crawling is uniform for sites with high ranking
Data Analysis • Date Modified vs Crawl Date • Conclusion : • Google as well as Bing Crawling seems to be uniform for low ranking sites
Conclusions • Google Freshness Policy Factors Identified • Popularity/Traffic volume • Category not considered • Frequency of Change of a page affects Crawling cycle – Adaptive ! • Bing Freshness Policy Factors Identified • Site popularity is not considered • Category is considered • Frequency of Change of a page affects Crawling cycle – Adaptive !
Limitations and Future Work • Limitations • Conclusions are drawn on a limited random data sample because of • Crawling restrictions on Google cached data • Change in Bing cached links every time Bing’s cached repository is updated • Larger time frame is required to identify crawling behavior of each search engine • High Freshness was observed for Nutch as crawling interval was low • Future Work • Additional factors like number of incoming and outgoing links can be noted and its co-relation to crawling can be observed • Factors like ranking, popularity, number of outgoing links can be incorporated in Nutch Adaptive Fetch Policy