Freshness Policy

Freshness Policy BinoyDharia, K. Rohan Gandhi, MadhuraKolwadkar Department of Computer Science University of Southern California Los Angeles, CA

Freshness Policy • Freshness policy also known as Revisit policy is the process of determining the order and time to re-crawl the web pages by any crawler. • By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates and deletions which will make the crawled data out-of-date. • In order to display latest results to the user search engine must have an efficient revisit policy. • An efficient revisit policy will not only save time and bandwidth but also keep search engines data up-to-date.

Metrics for evaluation of Freshness Policy • Two metrics for determining how up to date a site is can be described as follows: • Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a pagepin the repository at timetis defined as: • Age: This is a measure that indicates how outdated the local copy is. The age of a page in the repository, at time is defined as:

Methodology • Tracked over 90 sites over a period of 2 weeks. • We divided them into 4 categories: • Movies • Technology • Education • News • Sites selected based on Alexa traffic Rankings – Rohan and Binoy • Developed crawler in Java to download original as well as cached version of Google and Bing for each web page twice a day– Binoy and Rohan • Implemented our own code to extract date and time from the cache for each web page- Rohan • Implemented our own Diff functionality to detect changes in a web page over a period of time which ignored html tags and scripts and considered data between the tags– Madhura • Data Integration – Madhura • Data Analysis – Binoy, Rohan and Madhura • Study of Nutch Adaptive Fetch Policy - Binoy

NUTCH 1.2 Setup • Installed Nutch with Lucene on local machine for crawling • Settings used for Nutch Crawling <name>db.fetch.interval.default</name><value>172800</value><name>db.fetch.schedule.class</name><value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value><name>db.fetch.schedule.adaptive.inc_rate</name><value>0.4</value><name>db.fetch.schedule.adaptive.dec_rate</name><value>0.2</value>

Nutch Crawling Snapshot • Average Freshness achieved with Nutch Fetch Policy – 0.5

Data Integration and Calculations (Excel) • Data Snippet after Integration • Age and Freshness Calculations • Per Site Average Age per Site = (Sum of Ages)/ (Number of Crawls) Average Freshness per Site = (Sum of Freshness) / (Number of Crawls) • Per Category Average Age per Category = (Sum of Average Site Ages) / (Site Count) Average Freshness per Category = (Sum of Average Site Freshness) / (Site Count)

Standard Deviation • Standard Deviation in Age for a Category (Days) = sqrt [ (sum of squares of age difference) / Site Count ]

Data Analysis • Age Comparison between Google and Bing • Conclusions : • Google Database is much more up to date as compared to Bing • Google crawls news sites more than once a day • Google crawling cycle is mostly consistent across different categories • Google average crawling cycle is 0.8 Days • Bing average crawling cycle is 4.6 Days

Data Analysis • Freshness Comparison between Google and Bing • Conclusions : • News sites change frequently and so even though the Age for News sites is low, cached page is usually not fresh • Google Average Freshness is 0.65 • Bing Average Freshness is 0.28

Data Analysis • Comparison of Standard Deviation across Domains • Conclusions : • Google’s standard deviation is low which indicates category of a site is not a major factor while deciding frequency of crawl • Same inference does not apply for Bing

Data Analysis • Alexa Rank (x-axis) vs Google Cache Age (y-axis) • Conclusion: • Google - Sites with high traffic are crawled more frequently

Data Analysis • Alexa Rank (x-axis) vs Bing Cache Age (y-axis) • Conclusion : • Bing crawling is uniform across sites with varying traffic volume

Data Analysis • Date Modified vs Crawl Date • Conclusion : • Google Crawling seems to be more adaptive to original site changes while Bing crawling is uniform for sites with high ranking

Data Analysis • Date Modified vs Crawl Date • Conclusion : • Google as well as Bing Crawling seems to be uniform for low ranking sites

Conclusions • Google Freshness Policy Factors Identified • Popularity/Traffic volume • Category not considered • Frequency of Change of a page affects Crawling cycle – Adaptive ! • Bing Freshness Policy Factors Identified • Site popularity is not considered • Category is considered • Frequency of Change of a page affects Crawling cycle – Adaptive !

Limitations and Future Work • Limitations • Conclusions are drawn on a limited random data sample because of • Crawling restrictions on Google cached data • Change in Bing cached links every time Bing’s cached repository is updated • Larger time frame is required to identify crawling behavior of each search engine • High Freshness was observed for Nutch as crawling interval was low • Future Work • Additional factors like number of incoming and outgoing links can be noted and its co-relation to crawling can be observed • Factors like ranking, popularity, number of outgoing links can be incorporated in Nutch Adaptive Fetch Policy

Freshness Policy

Freshness Policy

Presentation Transcript

Freshness

Reliable Product Freshness with Thermoforming Film

Team 2 - Freshness

Methods of Assessing Freshness Quality of Chill-Stored Fish

Fine-Grained Replication and Scheduling with Freshness and Correctness Guarantees

SMART TAG PROJECT -Delivering Freshness-

Feel Freshness & Enjoy Nature At Bandhavgarh Jungle Walk

Put Freshness Around the World with Flowers to Brazil

Get accolades with your exquisite facial freshness

Rooftop Garden NYC- Adding Freshness to Your Urban Space

Preserving Freshness & Purity with Dairy Packaging Machines

CBD Infused Drinks For Daily Freshness

How Universal Laser Service Has Given Freshness to Engraving

Natural storage snap & grip online for food freshness

Freshness of Renewed Energy with Soft Drinks vs Energy Drinks

The Glorious Odour of Freshness

Incentive-Driven and Freshness-Aware Content Dissemination in Selfish Opportunistic

Menghai Tea another Name of Purity and Freshness

Enlightening the Skin to Freshness - Skin Rejuvenation Treatment

christian alcoholic rehab Remold the budding lives with freshness

Simco food display keeps freshness intact