1 / 22

Datamining the Internet: Alexa

Datamining the Internet: Alexa. Brewster Kahle President, Alexa Internet brewster@alexa.com. To Answer Any Question. Know a lot Know what is important Be right enough. Alexa: The web navigation service that learns from people. Know Alot: Other Repositories.

sasham
Download Presentation

Datamining the Internet: Alexa

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Datamining the Internet: Alexa Brewster Kahle President, Alexa Internet brewster@alexa.com

  2. To Answer Any Question... • Know a lot • Know what is important • Be right enough Alexa: The web navigation service that learns from people

  3. Know Alot: Other Repositories • Library of Alexandria: 800GB (400k scrolls @2MB) • Library of Congress: 20TB (20M books, ascii) • Dialog Information Service: 3-5TB • Video Store: 8TB (5k videos, 1GB/hr) • Public Branch Library: 3TB (300k scanned books) • Radio Station: 1TB (15k hrs of music) • . . . Alexa’s Internet Archive: 10TB

  4. Protocol Number of Sites Total Data New stuff GB/month WWW 640,000 4,000 GB 1000 (http) gopher 5,000 100 GB ? ftp 10,000 5,000 GB ? NetNews 33,000 500 GB 30 groups Know A lot: Gathering • Web Snapshot on T3 in 20 days • User’s Paths essential as well

  5. 8 Terabytes so far

  6. Web Stats • 1million sites, doubling every 6 months (millions of authors) • More videos, dynamic pages, Java etc. • 15 links on each page

  7. Storage Snapshot of the Web on Tape Jukebox costs $80k

  8. Knowing what is Important:Mining the WWW for Quality • Content: 100 million pages • Link Structure: 750 million links • Usage paths: many 100 million hits

  9. Be Right Enough: being useful • Competition • Directories: • Biggest only links to < 1% of the WebPages • Search Engines: • Returning 1000’s of hits (sometimes millions) • Trends: • Move to “channels” of less content, but good • limit crawling (50M pages and holding)

  10. Be Right Enough: Alexa • Where am I? • Where do I want to go? • Alexa: • “Can I trust this information” • What should I look at next?

  11. Travel Agents

  12. Conde Naste Travel

  13. Ford Vehicles Homepage

  14. Ford’s Mustang Page

  15. Independent Mustang Page

  16. Surrealism Page

  17. Women Surrealists

  18. Archive in action

  19. Alexa Conclusion

More Related