How much information?

How much information? Adapted from a presentation by: Jim GrayMicrosoft Research http://research.microsoft.com/~gray Alex Szalay Johns Hopkins University http://tarkus.pha.jhu.edu/~szalay/

How much information is there in the world • What can we store. • What is stored. • Why are we interested.

Infinite Storage? • The Terror Bytes are Here • 1 TB costs 1k$ to buy • 1 TB costs 300k$/y to own • Management & curation are expensive • Searching 1TB takes minutes or hours • Petrified by Peta Bytes? • But… people can “afford” them so, – Even though they can never actually be seen in your lifetime • Automate the process Yotta Zetta Exa Peta Tera Giga Mega Kilo We are here

How much information is there? Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All Books MultiMedia All books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

First Disk 1956 • IBM 305 RAMAC • 4 MB • 50x24” disks • 1200 rpm • 100 ms access • 35k$/y rent • Included computer & accounting software(tubes not transistors)

Storage capacity beating Moore’s law • Improvements:Capacity 60%/yBandwidth 40%/yAccess time 16%/y • 1000 $/TB today • 100 $/TB in 2007 Moores law 58.70% /year TB growth 112.30% /yearsince 1993 Price decline 50.70% /yearsince 1993 Most (80%) data is personal (not enterprise)This will likely remain true.

Disk Storage Cheaper Than Paper • File Cabinet (4 drawer) 250$Cabinet: Paper (24,000 sheets) 250$ Space (2x3 @ 10€/ft2) 180$ Total 700$ 0.03 $/sheet3 pennies per page • Disk: disk (250 GB =) 250$ ASCII: 100 m pages 2e-6 $/sheet(10,000x cheaper)micro-dollar per page Image: 1 m photos 3e-4 $/photo (100x cheaper)milli-dollar per photo • Store everything on diskNote: Disk is 100x to 1000x cheaper than RAM

Trying to fill a terabyte in a year

Portable Computer: 2006? • 100 Gips processor • 1 GB RAM • 1 TB disk • 1 Gbps network • “Some” of your software finding things is a data mining challenge

80% of data is personal / individual. But, what about the other 20%? • Business • Wall Mart online: 1PB and growing…. • Paradox: most “transaction” systems < 1 PB. • Have to go to image/data monitoring for big data • Government • Government is the biggest business. • Science • LOTS of data.

Q: Where will the Data Come From?A: Sensor Applications • Earth Observation • 15 PB by 2007 • Medical Images & Information + Health Monitoring • Potential 1 GB/patient/y  1 EB/y • Video Monitoring • ~1E8 video cameras @ 1E5 MBps  10TB/s  100 EB/y filtered??? • Airplane Engines • 1 GB sensor data/flight, • 100,000 engine hours/day • 30PB/y • Smart Dust: ?? EB/y http://robotics.eecs.berkeley.edu/~pister/SmartDust/ http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html

Premise: DataGrid Computing • Store exabytes twice (for redundancy) • Access them from anywhere • Implies huge archive/data centers • Supercomputer centers become super data centers • Examples: Google, Yahoo!, Hotmail,BaBar, CERN, Fermilab, SDSC, …

Thesis • Most new information is digital(and old information is being digitized) • An Information Science Grand Challenge: • Capture • Organize • Summarize • Visualize this information • Optimize Human Attention as a resource • Improve information quality

The Evolution of Science • Observational Science • Scientist gathers data by direct observation • Scientist analyzes data • Analytical Science • Scientist builds analytical model • Makes predictions. • Computational Science • Simulate analytical model • Validate model and makes predictions • Data Exploration Science Data captured by instrumentsOr data generated by simulator • Processed by software • Placed in a database / files • Scientist analyzes database / files

Computational Science Evolves • Historically, Computational Science = simulation. • New emphasis on informatics: • Capturing, • Organizing, • Summarizing, • Analyzing, • Visualizing • Largely driven by observational science, but also needed by simulations. • Too soon to say if comp-X and X-info will unify or compete. BaBar, Stanford P&E Gene Sequencer From http://www.genome.uci.edu/ Space Telescope

Next-Generation Data Analysis • Looking for • Needles in haystacks – the Higgs particle • Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling • Correlation functions are N2, likelihood techniques N3 • As data and computers grow at same rate, we can only keep up with N logN • A way out? • Discard notion of optimal (data is fuzzy, answers are approximate) • Don’t assume infinite computational resources or memory • Requires combination of statistics & computer science

Smart Data (active databases) • If there is too much data to move around, take the analysis to the data! • Do all data manipulations at database • Build custom procedures and functions in the database • Automatic parallelism guaranteed • Easy to build-in custom functionality • Databases & Procedures being unified • Example temporal and spatial indexing • Pixel processing • Easy to reorganize the data • Multiple views, each optimal for certain types of analyses • Building hierarchical summaries are trivial • Scalable to Petabyte datasets

Challenge: Make Data Publication & Access Easy • Augment FTP with data query: Return intelligent data subsets • Make it easy to • Publish: Record structured data • Find: • Find data anywhere in the network • Get the subset you need • Explore datasets interactively • Realistic goal: • Make it as easy as publishing/reading web sites today.

Data Federations of Web Services • Massive datasets live near their owners: • Near the instrument’s software pipeline • Near the applications • Near data knowledge and curation • Super Computer centers become Super Data Centers • Each Archive publishes a web service • Schema: documents the data • Methods on objects (queries) • Scientists get “personalized” extracts • Uniform access to multiple Archives • A common global schema • Challenge: • What is the object model for your science? Federation

Web Services: The Key? Your program Web Server http • Web SERVER: • Given a url + parameters • Returns a web page (often dynamic) • Web SERVICE: • Given a XML document (soap msg) • Returns an XML document • Tools make this look like an RPC. • F(x,y,z) returns (u, v, w) • Distributed objects for the web. • + naming, discovery, security,.. • Internet-scale distributed computing Web page Your program Web Service soap Data In your address space objectin xml

Emerging technologies • Look at science • High end computation and storage

How much information?