1 / 0

Measuring redundancy level on the Web

Measuring redundancy level on the Web. Presented by Lixia Zhang. Alexander Afanasyev Jiangzhe Wang Chunyi Peng LIXIA ZHANG. AINTEC 2011 November 11, 2011. Redundancy on the Internet. Potential redundancy sources the Internet is getting more and more commercialized

torn
Download Presentation

Measuring redundancy level on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring redundancy level on the Web

    Presented by Lixia Zhang Alexander Afanasyev Jiangzhe Wang ChunyiPeng LIXIA ZHANG AINTEC 2011 November 11, 2011
  2. Redundancy on the Internet Potential redundancy sources the Internet is getting more and more commercialized one of the growing industries is contextual ads people are trying to play with search engines to attract more visitors to get more revenue from ads duplicating content is one of the way to accomplish the task What is wrong with such “malicious” redundancy? keyword search ambiguity pagerank helps, but may fail how to rank results? will Pagerank top the originals? how to find the original source?
  3. Redundancy Natural redundancy WEB interface for mailing list archives multiple servers provide different access to the same data News articles are reposted from the news feeds Others Malicious redundancy and plagiarism substantial portion of the textual content of one page is repeated in the exact form on the other page
  4. Do we know the degree of malicious redundancy?
  5. What we want to know? Given a page, how many pages on the Web duplicate (redundant) content on this page Search engine to discover pages Search for random 32-word phrases Can a page on the Web be uniquely identified in a search engine by a random phrase from page? Is distribution of the redundancies bi-modal (new pages vs. bots that copy everything content to attract more visitors)? What factors may affect page redundancy? Search engine (Google, Yahoo, Bing) Links source for sampling sets (DMOZ, delicious) Different topics (Recreation, Sports, Home, Health, Computer, News, Food, Games, Research, Culture)
  6. Why we choose our methodology? Unrealistic to analyze the whole Internet a representative random set of pages could be enough to see general trends existing search engines already indexed a good portion of the Internet Only pages indexed by the search engines is “visible” to users primary source for discovering content is using Google/Bing/Yahoo,…
  7. Methodology Sampling sets DMOZ (www.dmoz.org) Lists all “good” websites by category Crawl: Randomly choose 2% of links in a given category Delicious(www.delicious.com) Lists last bookmarked links – real links that people care about Crawl: randomly choose 80% of links in a given category (bookmarking is already random process) Phrase extraction Download page Extract all textual information from page Split all text info sentence (MorphAdorner Java Library) All html elements that break flow, such as <br /> forced to end sentence Eliminate sentences less than 5 words long Randomly pick sentence and choose up to 32 words to a search phrase from the sentence and all subsequent sentences
  8. Actual architecture for our measurement dmoz delicious Crawler (sampler) Repository Page downloader Searcher Get Phrase PostProc estimate, URLs Analyzer
  9. Country TLD coverage in the sampling sets Good coverage Also good coverage
  10. Results Query estimates Overall histograms for Google, Yahoo, and Bing Log-log scale histograms by sampling set: DMOZ and delicious Log-log scale cumulative distribution functions By engine By engine and category: Recreation, Sports, Home, Health, Computer, News, Food, Games, Research, Culture Histograms for query estimates by engine and different category One more interesting result not really related to redundancies
  11. Histograms for query estimates in range from 1 to 60
  12. Log-log scale histograms for query estimates e-2.1 e-2 e-1.5 e-2 e-1.95 e-1.5
  13. CDF of page redundancies (log-log) Google 75% of queries resulted in less than 10 results Yahoo and Bing 90% of queries resulted in less than 10 results Similar results
  14. CDF of page redundancies by category (log-log scale) Google Yahoo Bing
  15. CDF of page redundancies (log-log scale) Closer look to Google
  16. Results by different category
  17. Main results Overall, redundancy level is negligible most of webpages are not replicated at all the majority of replicated pages duplicated very limited number of times Some popular pages are replicated large number of times Among all page categories, replication distribution has similar characteristics power-law-like distribution
  18. Summary Small-scale redundancy measurement 100,000 links sampled DMOZ and Delicious as a service Categories: Recreation, Sports, Home, Health, Computer, News, Food, Games, Research, Culture 100K queries for Google, Yahoo, and Bing Most of the pages are uniquely identified in the search index by a random 32-word phrase Consistent pattern over topics though with a slight difference Recreational/Sports topic has increased number of redundant pages compared to other categories Almost consistent results from different big search engines Google tends to return higher number of links (estimates) for the same queries Bing and Yahoo yield practically the same results
  19. Thank you! Questions?
More Related