1 / 77

The Information Avalanche: Reducing Information Overload

The Information Avalanche: Reducing Information Overload. Jim Gray Microsoft Research Onassis Foundation Science Lecture Series http://www.forth.gr/onassis/lectures/2002-07-15/index.html Heraklion, Crete, Greece, 15-19 July 2002. Thesis .

nora-franco
Download Presentation

The Information Avalanche: Reducing Information Overload

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Information Avalanche:Reducing Information Overload Jim Gray Microsoft Research Onassis Foundation Science Lecture Series http://www.forth.gr/onassis/lectures/2002-07-15/index.html Heraklion, Crete, Greece, 15-19 July 2002

  2. Thesis • Most new information is digital(and old information is being digitized) • A Computer Science Grand Challenge: • Capture • Organize • Summarize • Visualize This information • Optimize Human Attention as a resource. • Improve information quality

  3. Information Avalanche • The Situation – a census of the data • We can record everything • Everything is a LOT! • The Good news • Changes science, education, medicine, entertainment,…. • Shrinks time and space • Can augment human intelligence • The Bad News • The end of privacy • Cyber Crime / Cyber Terrorism • Monoculture • The Technical Challenges • Amplify human intellect • Organize, summarize and prioritize information • Make programming easy.

  4. How much information is there? Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All BooksMultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

  5. EB PB TB Information CensusLesk Varian & Leyman • ~10 Exabytes • ~90% digital • > 55% personal • Print: .003% of bytes5TB/y, but text has lowest entropy • Email is (10 Bmpd) 4PB/y and is 20% text (estimate by Gray) • WWW is ~50TBdeep web ~50 PB • Growth: 50%/y

  6. 93%

  7. Storage capacity beating Moore’s law • Improvements:Capacity 60%/yBandwidth 40%/yAccess time 16%/y • 1000 €/TB today • 100 €/TB in 2007

  8. Disk Storage Cheaper than Paper • File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10€/ft2) 180$ total 700$ 0.03 €/sheet • Disk: disk (160 GB =) 200$ ASCII: 500 m pages 2e-7 €/sheet (10,000x cheaper) Image: 1 m photos 3e-4 €/photo (100x cheaper) • Store everything on disk

  9. Why Put Everything in Cyberspace? Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Locate Process Analyze Summarize

  10. Storage trends • Right now, it’s affordable to buy 100 GB/year • In 5 years you can afford to buy 1TB/year!(assuming storage doubles every 18 months)

  11. Trying to fill a terabyte in a year

  12. MemexAs We May Think, Vannevar Bush, 1945 “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely”

  13. Gordon Bell’s MainBrain™Digitize EverythingA BIG shoebox? • Scans 20 k “pages” tiff@ 300 dpi 1 GB • Music: 2 k “tacks” 7 GB • Photos: 13 k images 2 GB • Video: 10 hrs 3 GB • Docs: 3 k (ppt, word,..) 2 GB • Mail: 100 k messages 3 GB 18 GB

  14. Gary Starkweather • Scan EVERYTHING • 400 dpi TIFF • 70k “pages” ~ 14GB • OCR all scans (98% recognition ocr accuracy) • All indexed (5 second access to anything) • All on his laptop.

  15. Access!

  16. 50% personal, What about the other 50% • Business • Wall Mart online: 1PB and growing…. • Paradox: most “transaction” systems have mere PBs. • Have to go to image/data monitoring for big data • Government • Online government is big thrust (cheaper, better,…) • Science

  17. CERN Tier 0 Instruments: CERN – LHCPeta Bytes per Year Looking for the Higgs Particle • Sensors: 1000 GB/s (1TB/s) • Events 75 GB/s • Filtered 5 GB/s • Reduced 0.1 GB/s ~ 2 PB/y • Data pyramid: 100GB : 1TB : 100TB : 1PB : 10PB

  18. LHC Requirements (2005- ) • 1E9 events pa @ 1MB/ev = 1PB/year/expt • Reconstructed = 100TB/recon/year/expt • Send to Tier1 Regional Centres • => 400TB/year to RAL? • Keep one set + derivatives on disk • …and rest on tape • But UK plans a Tier1 clone • Many data clones Source: John Gordon IT Department, CLRC/RAL CUF Meeting, October 2000

  19. Science Data VolumeESO/STECF Science Archive • 100 TB archive • Similar at Hubble, Keck, SDSS,… • ~1PB aggregate

  20. Level 1A 4 editions of 4 Level 2 products, each is small, but… E4 E3 E2 time E1 Data Pipeline: NASA • Level 0: raw data data stream • Level 1: calibrated data measured values • Level 1A: calibrated & normalized flux/magnitude/… • Level 2: derived data metrics vegetation index • Data volume • 0 ~ 1 ~ 1A << 2 • Level 2 >> level 1 because • MANY data products • Must keep all published • data Editions (versions) EOSDIS Core System Information for Scientists, http://observer.gsfc.nasa.gov/sec3/ProductLevels.html

  21. 3 x 2 TB databases 18TB disk tri-plexed (=6TB) 3 + 1 Cluster 99.96% uptime 1B page views5B DB queries Now a .NET web service TerraServerhttp://TerraService.net/

  22. All in the database 200x200 pixel tiles compressed Spatial access z-Tranform Btree 1 m resolution 2 m resolution 12 TB 1 TB 95 % U.S. Coverage 100% U.S. Coverage Image Data USGS Topo Maps USGS Aerial photos “DOQ” Encarta Virtual Globe 1 Km resolution 100 % World Coverage

  23. 2200 2200 2200 E E J J O O 2200 2200 2200 G F P Q K L 2200 2200 2200 R S M N H I Hardware 8 Compaq DL360 “Photon” Web Servers One SQL database per rack Each rack contains 4.5 tb 261 total drives / 13.7 TB total Fiber SAN Switches Meta Data Stored on 101 GB “Fast, Small Disks”(18 x 18.2 GB) SQL\Inst1 Imagery Data Stored on 4 339 GB “Slow, Big Disks” (15 x 73.8 GB) SQL\Inst2 SQL\Inst3 To Add 90 72.8 GB Disks in Feb 2001 to create 18 TB SAN Spare 4 Compaq ProLiant 8500 Db Servers

  24. 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 TerraServer Lessons Learned • Hardware is 5 9’s (with clustering) • Software is 5 9’s (with clustering) • Admin is 4 9’s (offline maintenance) • Network is 3 9’s (mistakes, environment) • Simple designs are best • 10 TB DB is management limit1 PB = 100 x 10 TB DBthis is 100x better than 5 years ago. • Minimize use of tape • Backup to disk (snapshots) • Portable disk TBs

  25. Sensor Applications • Earth Observation • 15 PB by 2007 • Medical Images & Information + Health Monitoring • Potential 1 GB/patient/y  1 EB/y • Video Monitoring • ~1E8 video cameras @ 1E5 MBps  10TB/s  100 EB/y filtered??? • Airplane Engines • 1 GB sensor data/flight, • 100,000 engine hours/day • 30PB/y • Smart Dust: ?? EB/y http://robotics.eecs.berkeley.edu/~pister/SmartDust/ http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html

  26. y x y x ln(z) Ln(x/y) What do they do with the databusiness, government, scienceMore later in talk • Look for anomalies • 1, 2, 1, 2, 1, 1, 1, 2, -5, 1, 0, 2, • Look for trends and patterns • 1, 2, 3, 4, 5, • Look for correlations • ln(x) – ln(y) ~ c ln(z) • Look at summaries then drill down to details • LOTS of histograms

  27. Premise: Grid Computing • Store exabytes once or twice (for redundancy) • Access them from anywhere • Implies huge archive/data centers • Supercomputer centers become super data centers • Examples: Google, Yahoo!, Hotmail,CERN, Fermilab, SDSC

  28. Bandwidth: 3x bandwidth/year for 25 more years • Today: • 40 Gbps per channel (λ) • 12 channels per fiber (wdm): 500 Gbps • 32 fibers/bundle = 16 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth • Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps

  29. Underlying Theme Digital Everything • From “words and numbers”to “sights and sounds” New Devices • From isolated to adaptive, synchronized, and connected Automation • From dumb to Web services • From manual to self-tuning, self organizing, and self maintaining • Beyond reliability to availability One inter-connected network • From stand alone/basic connectivity to always wired (and wireless) • Everything over IP

  30. Information Avalanche • The Situation – a census of the data • We can record everything • Everything is a LOT! • The Good news • Changes science, education, medicine, entertainment,…. • Shrinks time and space • Can augment human intelligence • The Bad News • The end of privacy • Cyber Crime / Cyber Terrorism • Monoculture • The Technical Challenges • Amplify human intellect • Organize, summarize and prioritize information • Make programming easy.

  31. Online Science • All literature online • All data online • All instruments online • Great analysis tools.

  32. Online Education • All literature online • All lectures online • Interactive and time-shifted education • Just-in-time education • Available to everyone everywhere • Economic model is not understood (who pays?) • One model: “society pays”

  33. Online Business • Frictionless economy • Near-perfect information • Very efficient • Fully customized products • Example: Wallmart / Dell: • Traditional business 1-10 inventory turns/y • eBuisiness 100-500 turns/y: no inventory • VERY efficient, huge economic advantage • Your customers & suppliers loan you money!

  34. Online Medicine • Traditional medicine: • Can monitor your health continuously • Instant diagnosis • Personalized drugs • New Biology • DNA is software • “solve each disease” • Huge impact on agriculture too

  35. Cyber-Space Shrinks Time and Distance • Everyone is always connected • Can get information they want • Can communicate with friends & family • Everything is online • You never miss a meeting/game/party/movie (you can always watch it) • You never forget anything (its there somewhere)

  36. Sustainable Society • Year 2050: 9 B people living at Europe’s standard of living • 100M people in a city? • Environment can’t sustain it • More efficient cities/transportation/… • 20% consume 60% now if 100 % consume 1/3 of current levels net consumption unchanged. • Need to reduce energy/water/metal consumption 3x in developed world.

  37. CyberSpace (data) and ToolsCan Augment Human Intelligence • See next talk (12 CS challenges) • MyMainBrain is a personal example:improved memory • Data mining tools are promising

  38. Information Avalanche • The Situation – a census of the data • We can record everything • Everything is a LOT! • The Good news • Changes science, education, medicine, entertainment,…. • Shrinks time and space • Can augment human intelligence • The Bad News • The end of privacy • Cyber Crime / Cyber Terrorism • Monoculture • The Technical Challenges • Amplify human intellect • Organize, summarize and prioritize information • Make programming easy.

  39. The End Of Privacy • You can find out all about me. • Organizations can precisely track us • Credit cards, email, cellphone, … • Animals have “tags” in them, I will probably get a tag (eventually)(I already carry a dozen ID & smart cards). • “You have no privacy, get over it” Scott Mcnealy

  40. The Centralization of Power • Computers enable an Orwellian future (1984) • The government can know everything you ever • Buy • Say • Hear • See/Read/… • Where you are (phone company already knows) • Who you see and talk to • OK now, but what if Nero/Hitler/Stalin/.. comes to power?

  41. Cyber Crime • You can steal my identity • Sell my house • Accumulate huge debts • Make a video of me doing terrible things. • You can steal on a grand scale • Now Trillions of dollars are online. • A LARGE honey-pot for criminals.

  42. Cyber Terrorism • It is easier to attack/destroy than to steal. • Viruses, data corruption, data modification • Denial of Service • Hijacking and then destroying equipment • Utilities (water, energy, transportation) • Production (factories)

  43. Monoculture • Radio & TV & movies & Internetare making the world more homogenous. • ½ the world has never made a phone call • But this is changing fast (they want to make phone calls!) • The wired world enables communities to form very easily – e.g. Sanskrit scholars. • But the community has to speak a common language.

  44. Information Clutter • Most mail is junk mail • Most eMail will soon be junk mail • 30% of hotmail, 75% of my mail (~130 m/d). • Telemarketing wastes people’s time. • Creates info-glut • You have 50,000 new mail messages • Need systems and interfaces to filter, summarize, prioritize information

  45. Information Avalanche • The Situation – a census of the data • We can record everything • Everything is a LOT! • The Good news • Changes science, education, medicine, entertainment,…. • Shrinks time and space • Can augment human intelligence • The Bad News • The end of privacy • Cyber Crime / Cyber Terrorism • Monoculture • The Technical Challenges • Amplify human intellect • Organize, summarize and prioritize information • Make programming easy.

  46. Technical Challenges • Storing information • Organizing information • Summarizing information • Visualizing information • Make programming easy

  47. The personal Terabyte (all your stuff online)So you’ve got it – now what do you do with it? • Probably not accessed very often but TREASURED (what’s the one thing you would save in a fire?) • Can you find anything? • Can you organize that many objects? • Once you find it will you know what it is? • Once you’ve found it, could you find it again? • Research Goal: Have GOOD answers for all these Questions

  48. Bell, Gemmell, Lueder: MyLifeBits Guiding Principles • Freedom from strict hierarchy • Full text search & Collections • Many visualizations • “don’t metaphor me in” • Annotations add value • So make them easy! • Keep the links when you author • “transclusion” • Everything goes in a database

  49. How will we find it?Put everything in the DB (and index it) • Need dbms featuresConsistency, Indexing, Pivoting, Queries, Speed/scalability, Backup, replicationIf you don’t use one, creating one! • Simple logical structure: • Blob and link is all that is inherent • Additional properties (facets == extra tables)and methods on those tables (encapsulation) • More than a file system • Unifies data and meta-data • Simpler to manage • Easier to subset and reorganize • Set-oriented access • Allows online updates • Automatic indexing, replication SQL SQL

  50. How do we represent it to the outside world? • <?xml version="1.0" encoding="utf-8" ?> • -<DataSet xmlns="http://WWT.sdss.org/"> • -<xs:schema id="radec" xmlns="" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata"> • <xs:element name="radec" msdata:IsDataSet="true"> • <xs:element name="Table"> • <xs:elementname="ra" type="xs:double" minOccurs="0" /> • <xs:elementname="dec" type="xs:double" minOccurs="0" /> • … • -<diffgr:diffgram xmlns:msdata="urn:schemas-microsoft-com:xml-msdata" xmlns:diffgr="urn:schemas-microsoft-com:xml-diffgram-v1"> • -<radec xmlns=""> • -<Table diffgr:id="Table1" msdata:rowOrder="0"> • <ra>184.028935351008</ra> • <dec>-1.12590950121524</dec> • </Table> • … • -<Table diffgr:id="Table10" msdata:rowOrder="9"> • <ra>184.025719033547</ra> • <dec>-1.21795827920186</dec> • </Table> • </radec> • </diffgr:diffgram> • </DataSet> • File metaphor too primitive: just a blob • Table metaphor too primitive: just records • Need Metadata describing data context • Format • Providence (author/publisher/ citations/…) • Rights • History • Related documents • In a standard format • XML and XML schema • DataSet is great example of this • World is now defining standard schemas schema Data or difgram

More Related