290 likes | 502 Views
The Future of the VO. (the Tale of Tails...). Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research. The VO: Fast (and Furious). We have been moving forward at a very fast pace First services in production Big international collaboration Strong national projects
E N D
The Future of the VO (the Tale of Tails...) Alex SzalayThe Johns Hopkins University Jim GrayMicrosoft Research
The VO: Fast (and Furious) • We have been moving forward at a very fast pace • First services in production • Big international collaboration • Strong national projects • But we need to understand • where are we running? • who will use the VO? • why? • Answers at the intersection of Technology, Sociology and Economics
Astronomy in an Exponential World • Astronomers have a several hundred TB now • 1 pixel (byte) / sq arc second ~ 4TB • Multi-spectral, temporal, … → 1PB • Data doubles every year • Q: How much disk space do you own? • 0.1TB • 1TB • 10TB • 100TB Q: How long can this growth continue?
Evolving Science • Thousand years ago: science was empirical describing natural phenomena • Last few hundred years: theoretical branch using models, generalizations • Last few decades: a computational branch simulating complex phenomena • Today: data exploration (eScience) synthesizing theory, experiment and computation with advanced data management and statistics new algorithms!
Technical Challenges • Data Access: • Move analysis to the data • Locality is the key (e.g.: Image Stacking Service) • If downloaded, keep it • Discovery: • Shannon new dimensions • Federation requires data movement (UDT) • Analysis: • max NlogN algorithms possible
Sociological Challenges • How to avoid trying to be everything for everybody? • Rapidly changing “outside world” • Make it simple!!! • Publishing: • Exponential linear • Data reliability credits and career paths
Where are we going? • Relatively easy to predict until 2010 • Exponential growth continues • Most ground based observatories join the VO • More and more sky surveys in different wavebands • Simulations will have VO interfaces: can be ‘observed’ • Much harder beyond 2010 • PetaSurveys are coming on line (PANSTarrs, VISTA, LSST) • Technological predictions much harder • Changing funding climate • Changing sociology
HEP Van de Graaf Cyclotrons National Labs International Labs SSC vs LHC Optical Astronomy 2.5m telescopes 4m telescopes 8-10m class telescopes Surveys/Time Domain 30-100m telescopes Similarities to HEP • Similar trends with a 20 year delay, • fewer and ever bigger projects… • increasing fraction of cost is in software… • more conservative engineering… • Can the exponential trend continue, or will be logistic? • What can astronomy learn from High Energy Physics?
But: Why Is Astronomy Different? • Especially attractive for the wide public • Data has more dimensions • Spatial, temporal, cross-correlations • Diverse and distributed • Many different instruments from many different places and many different times • A broad distribution of different questions
Future How long does the data growth continue? • High end always linear • Exponential comes from technology + economics rapidly changing generations • like CCD’s replacing plates, and become ever cheaper • How many new generations of instruments do we have left? • Software is also an instrument • hierarchical data replication • virtual data • data cloning
Technology+Sociology+Economics • Neither of them is enough • We have technology changing very rapidly • Google, tags, sensors, Moore's Law • Trend driven by changing generations of technologies • Sociology is changing in unpredictable ways • In general, people will use a new technology if it is • Offers something entirely new • Or substantially cheaper • Or substantially simpler • Funding is essentially level
Tale of the Tails • Long tailed distributions • Pareto: 20% of population holds 80% of wealth • Zipf: word frequency follows a power law • C. Anderson: everything on the web is a power law • Lognormal vs Gaussian • Multiplicative processes lead to lognormal Log P = Log p1 + Log p2 + … + Log pn … • Central limit theorem: Log P is a normal random var • Kapteyn: random fragmentation • Lognormal resembles a 1/f over large dynamic range • Extremely important in web-based economics • Amazon, Time-Warner, blogs, etc
Tale of the Tails #2 • Barabasi: Power laws tend to arise in social systemswhere people are faced with many choices • The more choices, distribution more extreme • Measured by the distance between #1 and the median • Most elements in the power law system are below the average • People’s choices affect one another, they are not random independent events
Examples: the Grid • The size of computational problems is multiplicative • Has to have a lognormal distribution • Computers bought for the average job will not be large enough in the tail, but the system is still often idel • Need to borrow CPU for large jobs and loan when idle M. Ripeanu (UC): Top 500 computers
SkyServer tables Footprints and Cardinalities S. Lubow (STScI)
Analyzing the SkyServer • Sloan Digital Sky Survey: Pixels + Objects • About 500 attributes per “object”, 400M objects • Currently 2.4TB fully public • Prototype eScience lab (800 users) CasJobs • Moving analysis to the data • Visual tools • Join pixels with objects • Prototype in data publishing • 200 million web hits in 5 years • 1,000,000 distinct usersvs 10,000 astronomers http://skyserver.sdss.org/
Data Sharing in the VO • Users are more willing to part with their data if machine obtained • What is the business model? • Three tiers (power law!!!) (a) big surveys (b) value added, refereed products (c) mode ad-hoc data, images, outreach info • largely done (a) • need “Journal for Data” to solve (b) • need Flickr and an integrated environment for virtual excursions for (c)
EDR DR1 DR1 DR2 DR2 DR2 DR3 DR3 DR3 DR3 Data Reliability • Gilmore: Is new data necessary better? • Yes: more of it, better calibrations • But: always on the edge, Malmquist bias, etc • Usage of old data: changing into a power law • (CNN, Time-Warner) • Data publishing: once published, must stay • SDSS: DR1 is still used
VO Trends • VO is inevitable, a new way of doing science • Present on every physical scale today, not just astronomy (NEON, Neptune, CERN, MS) • Driven by advances in technology, and economics, mapped onto society • Boundary conditions: funding will be at best level • Computational methods, algorithmic thinking will come just as naturally as mathematics today
VO Technology • We will have Petabytes • We will need to save them, move them • several big archive centers connected • Need Journal for Data • curation is the key • Always will be an open-ended modular system • Archives -- also computational services • driven by economics: cheaper to process than move
VO Economics • The Price of Software • 30% from SDSS, 50% for LSST • should there be full reuse vs no reuse today? • neither: we are not systems integrators • risks and benefits are power law • repurpose for other disciplines is an example • The Price of Data • $100,000 /paper (Norris etal) • Drives new projects • For SDSS there are 1300 refereed papers for $100M so far • Level budgets
VO Sociology • Learn from particle physics • do not for granted that there will be a next one • small is beautiful • What happens to the rest of astronomy after the world's biggest telescope? • The impact of power laws: • we need to look at problems in octaves • the astronomers may be the tail of our users • there is never a natural end or an edge (except for our funding)
The Changing VO • Boundary conditions change, we need to change every year! • We must change at least as fast as the outside world or we will be left behind • We will make mistakes! We need to recognize and recover from them, step back and do it differently • If we do not make mistakes, we are not taking enough risks • But: we need to buffer/dampen these changes to the astronomy community
Summary: The Future of the VO • Does not have much of a past… • We need to keep running forward • We must take risks • Technology driving Sociology - limited by economics • Everything is a power law – do not make assumptions! • Enormous potential • May be the only way to do 'small science' in 2020