90 likes | 302 Views
Evolution of Web from a Search Engine Perspective. Saket Singam sks2141@columbia.edu. Introduction. Larger and Diverse growth of Web => Search Engine becoming “Killer Application” Search Engines typically “crawl” web pages in advance
E N D
Evolution of Web from a Search Engine Perspective Saket Singam sks2141@columbia.edu
Introduction • Larger and Diverse growth of Web => Search Engine becoming “Killer Application” • Search Engines typically “crawl” web pages in advance • Discussion 1) What’s new on the Web ? • New pages created @ rate of 8% per week • 20 % of Web pages are accessible after 1 year • Borrowing content from the existing pages- 62 % of the content in these pages is new, after 1 year, 50% of the Web has new content 2) How much change ? • Once a page is created, it is likely to go through either a minor change or no change 3) Can we predict future changes ? • Frequency of changes • Degree of Change
Experimental Setup • Selection of Sites • “Representative” as well as “Interesting” samples of Web • About 5 top-ranked pages from a subset of Topical Categories of the Google Directory • Download of Pages (almost a year) • Pages from 154 “Popular” Web Sites • Downloaded weekly in a Breadth-first order starting from Root pages of the Web Site until all reachable pages or a maximum of 200,000 pages • Total Number of pages in weekly download = 3-5 million (avg 4.4 million) • Size = 65 Gb before compression per week • Total of 3.3TB of web history data and 4TB of arrived data (links,shingles) • Table : “Fraction of pages included in this Experiment”
What’s New on the Web? – Pages, Content and links • Weekly Birth Rate of pages • How many new pages are created per week ? • Identity - URL of the popular page • Average Weekly Birth rate is 8% • Once every month, # new pages higher than in previous week • Birth, death and replacement • How many new pages created, disappear and replaced • Crawling in Slow Mode and over a period of 39 weeks • 20% Survival rate of web pages
Creation of a new Content • How much new content is present • Shingling Technique used • W-shingle- contiguous ordered subsequence of “w” words • New shingles are created at slower rate than the new pages • New shingles @ 5% per week => 62% of URL content is new • Link-Structure Evolution • Search engines should efficiently capture the Link Structure • Significantly Dynamic Structure • Initial links are available @ 25% per week as compared to 8% for new pages and 5% for new content
Changes in the Existing Pages • Change Frequency Distribution (Presence of Change) • how often the web page is “Altered” • Most pages change very frequently or very infrequently • Degree of Change (SEO) • Metrics:- TF.IDF Word Distance • Exact order of Terms ignored • Minor changes such as advertisements, counters etc cause minor changes in the content of the pages that are detected • Search engines can exploit this only by re-downloading revised pages
Predictability • Overall Predictability • Metrics:- Group A (Red) :- top 80% Group B (Yellow) :- top 80-90% Group C (Green) :- top 90-95% Group D (Blue) :Remaining pages • Why is this degree of predictability required? • Predictability - individual site • Individual sites – www.columbia.edu and www.eonline.com considered for study
Conclusion • Aspects of Evolving Web that are of particular interest in terms of search engine design has been studied through this research over a period of 1 year • Existing pages are been removed “Rapidly” from the Web and replaced by New ones, whereas the new pages tend to borrow the contents from the existing ones • Pages that are changing significantly over time have predictable degree of change • Link Structure is evolving at a faster rate than most of the pages themselves • Effort is to maximize Search Quality by making effective use of available resources to incorporate the changes
Thank You • References: ->B.E. Brewington and G. Cybenko.How dynamic is the web? In proceeding of the Ninth WWW Conference, Amsterdam, The Netherlands, 2000 ->S. Brin and L.Page. The anatomy of large-scale hypertextual Web search engine. In the Proceeding of Seventh WWW Conference, Brisbane, Australia, 1998 -> D.Fetterly, M. Manasse, M. Najork and J.L. Wiener. A large-scale study of evolution of web pages. In Proceedings of Twelfth WWW Conference, Budapest, Hungary, 2003 -> B.H. Murray and A.Moore. Sizing the internet. White Paper, Cyveillance, Inc., 2000