230 likes | 257 Views
Datamining the Internet: Alexa. Brewster Kahle President, Alexa Internet brewster@alexa.com. To Answer Any Question. Know a lot Know what is important Be right enough. Alexa: The web navigation service that learns from people. Know Alot: Other Repositories.
E N D
Datamining the Internet: Alexa Brewster Kahle President, Alexa Internet brewster@alexa.com
To Answer Any Question... • Know a lot • Know what is important • Be right enough Alexa: The web navigation service that learns from people
Know Alot: Other Repositories • Library of Alexandria: 800GB (400k scrolls @2MB) • Library of Congress: 20TB (20M books, ascii) • Dialog Information Service: 3-5TB • Video Store: 8TB (5k videos, 1GB/hr) • Public Branch Library: 3TB (300k scanned books) • Radio Station: 1TB (15k hrs of music) • . . . Alexa’s Internet Archive: 10TB
Protocol Number of Sites Total Data New stuff GB/month WWW 640,000 4,000 GB 1000 (http) gopher 5,000 100 GB ? ftp 10,000 5,000 GB ? NetNews 33,000 500 GB 30 groups Know A lot: Gathering • Web Snapshot on T3 in 20 days • User’s Paths essential as well
Web Stats • 1million sites, doubling every 6 months (millions of authors) • More videos, dynamic pages, Java etc. • 15 links on each page
Storage Snapshot of the Web on Tape Jukebox costs $80k
Knowing what is Important:Mining the WWW for Quality • Content: 100 million pages • Link Structure: 750 million links • Usage paths: many 100 million hits
Be Right Enough: being useful • Competition • Directories: • Biggest only links to < 1% of the WebPages • Search Engines: • Returning 1000’s of hits (sometimes millions) • Trends: • Move to “channels” of less content, but good • limit crawling (50M pages and holding)
Be Right Enough: Alexa • Where am I? • Where do I want to go? • Alexa: • “Can I trust this information” • What should I look at next?