300 likes | 429 Views
Astrophysics with Terabytes of Data. Alex Szalay The Johns Hopkins University. Living in an Exponential World. Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral, temporal, … → 1PB
E N D
Astrophysics with Terabytes of Data Alex SzalayThe Johns Hopkins University
Living in an Exponential World • Astronomers have a few hundred TB now • 1 pixel (byte) / sq arc second ~ 4TB • Multi-spectral, temporal, … → 1PB • They mine it looking fornew (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space • Data doubles every year • Data is public after 1 year • Same access for everyone • But: how long can this continue?
Evolving Science • Thousand years ago: science was empirical describing natural phenomena • Last few hundred years: theoretical branch using models, generalizations • Last few decades: a computational branch simulating complex phenomena • Today: data exploration (eScience) synthesizing theory, experiment and computation with advanced data management and statistics
The Challenges Exponential data growth: Distributed collections Soon Petabytes Data Collection Discovery and Analysis Publishing New analysis paradigm: Data federations, Move analysis to data New publishing paradigm: Scientists are publishers and Curators
Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists Publishing Data • Exponential growth: • Projects last at least 3-5 years • Data sent upwards only at the end of the project • Data will never be centralized • More responsibility on projects • Becoming Publishers and Curators • Data will reside with projects • Analyses must be close to the data
Accessing Data • If there is too much data to move around, take the analysis to the data! • Do all data manipulations at database • Build custom procedures and functions in the database • Automatic parallelism guaranteed • Easy to build-in custom functionality • Databases & Procedures being unified • Example temporal and spatial indexing • Pixel processing • Easy to reorganize the data • Multiple views, each optimal for certain analyses • Building hierarchical summaries are trivial • Scalable to Petabyte datasets active databases!
Making Discoveries • Where are discoveries made? • At the edges and boundaries • Going deeper, collecting more data, using more colors…. • Metcalfe’s law • Utility of computer networks grows as the number of possible connections: O(N2) • Federating data • Federation of N archives has utility O(N2) • Possibilities for new discoveries grow as O(N2) • Current sky surveys have proven this • Very early discoveries from SDSS, 2MASS, DPOSS
Data Federations • Massive datasets live near their owners: • Near the instrument’s software pipeline • Near the applications • Near data knowledge and curation • Super Computer centers become Super Data Centers • Each Archive publishes (web) services • Schema: documents the data • Methods on objects (queries) • Scientists get “personalized” extracts • Uniform access to multiple Archives • A common “global” schema Federation
The Virtual Observatory • Premise: most data is (or could be online) • Federating the different surveys will provideopportunities for new science • It’s a smart telescope: links objects and data to literature on them • Software became the capital expense • Share, standardize, reuse.. • It has to be SIMPLE • You can form your own small collaborations
Strong International Collaboration • Similar efforts now in 15 countries: • USA, UK, Canada, France, Germany, Italy, Holland, Japan, Australia, India, China, Russia, Hungary, South Korea, ESO, Spain • Total awarded funding world-wide is over $60M • Active collaboration among projects • Standards, common demos • International VO roadmap being developed • Regular telecons over 10 timezones • Formal collaboration International Virtual Observatory Alliance (IVOA)
Boundary Conditions Standards driven by evolving new technologies • Exchange of rich and structured data (XML…) • DB connectivity, Web Services, Grid computing External funding climate Application to astronomy domain • Data dictionaries (UCDs) • Data models • Protocols • Registries and resource/service discovery • Provenance, data quality, DATA CURATION!!!! Boundary conditions Dealing with the astronomy legacy • FITS data format • Software systems
Current VO Challenges • How to avoid trying to be everything for everybody? • Database connectivity is essential • Bring the analysis to the data • Core web services, higher level applications on top • Use the 90-10 rule: • Define the standards and interfaces • Build the framework • Build the 10% of services that are used by 90% • Let the users build the rest from the components • Rapidly changing “outside world” • Make it simple!!!
Where are we going? • Relatively easy to predict until 2010 • Exponential growth continues • Most ground based observatories join the VO • More and more sky surveys in different wavebands • Simulations will have VO interfaces: can be ‘observed’ • Much harder beyond 2010 • PetaSurveys are coming on line (PANSTarrs, VISTA, LSST) • Technological predictions much harder • Changing funding climate • Changing sociology
HEP Van de Graaf Cyclotrons National Labs International (CERN) SSC vs LHC Optical Astronomy 2.5m telescopes 4m telescopes 8m class telescopes Surveys/Time Domain 30-100m telescopes Similarities to HEP • Similar trends with a 20 year delay, • fewer and ever bigger projects… • increasing fraction of cost is in software… • more conservative engineering… • Can the exponential continue, or will be logistic? • What can astronomy learn from High Energy Physics?
Why Is Astronomy Different? • Especially attractive for the wide public • It has no commercial value • No privacy concerns, freely share results with others • Great for experimenting with algorithms • Data has more dimensions • Spatial, temporal, cross-correlations • Diverse and distributed • Many different instruments from many different places and many different times • Many different interesting questions
Trends CMB Surveys • 1990 COBE 1000 • 2000 Boomerang 10,000 • 2002 CBI 50,000 • 2003 WMAP 1 Million • 2008 Planck 10 Million Angular Galaxy Surveys • 1970 Lick 1M • 1990 APM 2M • 2005 SDSS 200M • 2008 VISTA 1000M • 2012 LSST 3000M Time Domain • QUEST • SDSS Extension survey • Dark Energy Camera • PanStarrs • SNAP… • LSST… Galaxy Redshift Surveys • 1986 CfA 3500 • 1996 LCRS 23000 • 2003 2dF 250000 • 2005 SDSS 750000 Petabytes/year by the end of the decade…
Fast 1% Tier2 Tier0 Tier1 10% 100% Tier2 Tier2 Challenges • Real-Time Detection for 3B objects • Pixels (exponential growth slowing down) • Size projection: 100PB by 2020 • Data Transfer (grows slower than data) • Data Access (hierarchical usage) • Fault Tolerance and Data Protection
SkyServer • Sloan Digital Sky Survey: Pixels + Objects • About 500 attributes per “object”, 300M objects • Spectra for 1M objects • Currently 2TB fully public • Prototype eScience lab • Moving analysis to the data • Fast searches: color, spatial • Visual tools • Join pixels with objects • Prototype in data publishing • 70 million web hits in 3.5 years http://skyserver.sdss.org/
DR1 DR1 DR2 DR2 DR2 DR3 DR3 DR3 DR3 Public Data Release: Versions! EDR • June 2001: EDR • Early Data Release • July 2003: DR1 • Contains 30% of final data • 150 million photo objects • July 2005: DR4 at 3.5TB • 60% of data • 4 versions of the data • Target, best, runs, spectro • Total catalog volume 5TB • See Terascale sneakernet paper… • Published releases served forever • EDR, DR1, DR2, …. • Soon to include email archives, annotations • O(N2) – only possible because of Moore’s Law!
Spatial Features • Precomputed Neighbors • All objects within 30” • Boundaries, Masks and Outlines • 27,000 spatial objects • Stored as spatial polygons Time Domain: • Precomputed Match • All objects with 1”, observed at different times • Found duplicates due to telescope tracking errors • Manual fix, recorded in the database • MatchHead • The first observation of the linked list used as unique id to chain of observations of the same object
3 Ways To Do Spatial • Hierarchical Triangular Mesh (extension to SQL) • Uses table valued stored procedures • Acts as a new “spatial access method” • Ported to Yukon CLR for a 17x speedup. • Zones: fits SQL well • Surprisingly simple & good on a fixed scale • Constraints: a novel idea • Lets us do algebra on regions., implemented in pure SQL • Paper:There Goes the Neighborhood: Relational Algebra for Spatial Data Search
Pipeline Parallelism: 2.5 hours Or… as fast as we can read USNOB + .5 hours 2MASS:USNOBZone:ZoneComparison MergeAnswer Build Index Source Tables Zones 2MASS→USNOB 350 Mrec 12 GB 2MASS 471 Mrec 140 GB 0:-1 64 Mrec 2 GB Next zone 0:0 260 Mrec 9 GB USNOB 1.1 Brec 233 GB Next zone 350 Mrec 12 GB 0:+1 26 Mrec 1 GB USNOB→2MASS 2 hours .5 hour
Next-Generation Data Analysis • Looking for • Needles in haystacks – the Higgs particle • Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • ‘Optimal’ statistics have poor scaling • Correlation functions are N2, likelihood techniques N3 • For large data sets main errors are not statistical • As data and computers grow with Moore’s Law, we can only keep up with N logN • A way out? • Discard notion of optimal (data is fuzzy, answers are approximate) • Don’t assume infinite computational resources or memory • Requires combination of statistics & computer science
Organization & Algorithms • Use of clever data structures (trees, cubes): • Up-front creation cost, but only N logN access cost • Large speedup during the analysis • Tree-codes for correlations (A. Moore et al 2001) • Data Cubes for OLAP (all vendors) • Fast, approximate heuristic algorithms • No need to be more accurate than cosmic variance • Fast CMB analysis by Szapudi et al (2001) • N logN instead of N3 => 1 day instead of 10 million years • Take cost of computation into account • Controlled level of accuracy • Best result in a given time, given our computing resources
Today’s Questions • Discoveries • need fast outlier detection • Spatial statistics • Fast correlation and power spectrum codes (CMB + galaxies) • Cross-correlations among different surveys (sky pixelization + fast harmonic transforms on sphere) • Time-domain: • Transients, supernovae, periodic variables • Moving objects, killer’ asteroids, Kuiper-belt objects….
Other Challenges • Statistical noise is smaller and smaller • Error matrix larger and larger (Planck…) • Systematic errors becoming dominant • De-sensitize against known systematic errors • Optimal subspace filtering (…SDSS stripes…) • Comparisons of spectra to models • 106 spectra vs 108 models (Charlot…) • Detection of faint sources in multi-spectral images • How to use all information optimally (QUEST…) • Efficient visualization of ensembles of 100M+ data points
Systematic Errors • SDSS P(k), main issue: • Effects of zero points, flat field vectors result in large scale, correlated patterns • Two tasks: • Estimate how large is the effect • De-sensitize statistics • Monte-Carlo simulations: • 100 million random points, assigned to stripes, runs, camcols, fields, x,y positions and redshifts => database • Build MC error matrix due to zeropoint errors • Include error matrix in the KL basis • Some modes sensitive to zero points (# of free pmts) • Eliminate those modes from the analysis => projectionStatistics insensitive to zero points afterwards
Simulations • Cosmological simulations have 109 particles and produce over 30TB of data (Millennium) • Build up dark matter halos • Track merging history of halos • Use it to assign star formation history • Combination with spectral synthesis • Too few realizations • Hard to analyze the data afterwards • What is the best way to compare to the real universe
Summary • Databases became an essential part of astronomy: most data access will soon be via digital archives • Data at separate locations, distributed worldwide, evolving in time: move analysis, not data! • Good scaling of statistical algorithms essential • Many outstanding problems in astronomy are statistical, current techniques inadequate, we need help! • The Virtual Observatory is a new paradigm for doing science: the science of Data Exploration!