1 / 30

Astrophysics with Terabytes of Data

Astrophysics with Terabytes of Data. Alex Szalay The Johns Hopkins University. Living in an Exponential World. Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral, temporal, … → 1PB

arden
Download Presentation

Astrophysics with Terabytes of Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Astrophysics with Terabytes of Data Alex SzalayThe Johns Hopkins University

  2. Living in an Exponential World • Astronomers have a few hundred TB now • 1 pixel (byte) / sq arc second ~ 4TB • Multi-spectral, temporal, … → 1PB • They mine it looking fornew (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space • Data doubles every year • Data is public after 1 year • Same access for everyone • But: how long can this continue?

  3. Evolving Science • Thousand years ago: science was empirical describing natural phenomena • Last few hundred years: theoretical branch using models, generalizations • Last few decades: a computational branch simulating complex phenomena • Today: data exploration (eScience) synthesizing theory, experiment and computation with advanced data management and statistics

  4. The Challenges Exponential data growth: Distributed collections Soon Petabytes Data Collection Discovery and Analysis Publishing New analysis paradigm: Data federations, Move analysis to data New publishing paradigm: Scientists are publishers and Curators

  5. Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists Publishing Data • Exponential growth: • Projects last at least 3-5 years • Data sent upwards only at the end of the project • Data will never be centralized • More responsibility on projects • Becoming Publishers and Curators • Data will reside with projects • Analyses must be close to the data

  6. Accessing Data • If there is too much data to move around, take the analysis to the data! • Do all data manipulations at database • Build custom procedures and functions in the database • Automatic parallelism guaranteed • Easy to build-in custom functionality • Databases & Procedures being unified • Example temporal and spatial indexing • Pixel processing • Easy to reorganize the data • Multiple views, each optimal for certain analyses • Building hierarchical summaries are trivial • Scalable to Petabyte datasets active databases!

  7. Making Discoveries • Where are discoveries made? • At the edges and boundaries • Going deeper, collecting more data, using more colors…. • Metcalfe’s law • Utility of computer networks grows as the number of possible connections: O(N2) • Federating data • Federation of N archives has utility O(N2) • Possibilities for new discoveries grow as O(N2) • Current sky surveys have proven this • Very early discoveries from SDSS, 2MASS, DPOSS

  8. Data Federations • Massive datasets live near their owners: • Near the instrument’s software pipeline • Near the applications • Near data knowledge and curation • Super Computer centers become Super Data Centers • Each Archive publishes (web) services • Schema: documents the data • Methods on objects (queries) • Scientists get “personalized” extracts • Uniform access to multiple Archives • A common “global” schema Federation

  9. The Virtual Observatory • Premise: most data is (or could be online) • Federating the different surveys will provideopportunities for new science • It’s a smart telescope: links objects and data to literature on them • Software became the capital expense • Share, standardize, reuse.. • It has to be SIMPLE • You can form your own small collaborations

  10. Strong International Collaboration • Similar efforts now in 15 countries: • USA, UK, Canada, France, Germany, Italy, Holland, Japan, Australia, India, China, Russia, Hungary, South Korea, ESO, Spain • Total awarded funding world-wide is over $60M • Active collaboration among projects • Standards, common demos • International VO roadmap being developed • Regular telecons over 10 timezones • Formal collaboration International Virtual Observatory Alliance (IVOA)

  11. Boundary Conditions Standards driven by evolving new technologies • Exchange of rich and structured data (XML…) • DB connectivity, Web Services, Grid computing External funding climate Application to astronomy domain • Data dictionaries (UCDs) • Data models • Protocols • Registries and resource/service discovery • Provenance, data quality, DATA CURATION!!!! Boundary conditions Dealing with the astronomy legacy • FITS data format • Software systems

  12. Current VO Challenges • How to avoid trying to be everything for everybody? • Database connectivity is essential • Bring the analysis to the data • Core web services, higher level applications on top • Use the 90-10 rule: • Define the standards and interfaces • Build the framework • Build the 10% of services that are used by 90% • Let the users build the rest from the components • Rapidly changing “outside world” • Make it simple!!!

  13. Where are we going? • Relatively easy to predict until 2010 • Exponential growth continues • Most ground based observatories join the VO • More and more sky surveys in different wavebands • Simulations will have VO interfaces: can be ‘observed’ • Much harder beyond 2010 • PetaSurveys are coming on line (PANSTarrs, VISTA, LSST) • Technological predictions much harder • Changing funding climate • Changing sociology

  14. HEP Van de Graaf Cyclotrons National Labs International (CERN) SSC vs LHC Optical Astronomy 2.5m telescopes 4m telescopes 8m class telescopes Surveys/Time Domain 30-100m telescopes Similarities to HEP • Similar trends with a 20 year delay, • fewer and ever bigger projects… • increasing fraction of cost is in software… • more conservative engineering… • Can the exponential continue, or will be logistic? • What can astronomy learn from High Energy Physics?

  15. Why Is Astronomy Different? • Especially attractive for the wide public • It has no commercial value • No privacy concerns, freely share results with others • Great for experimenting with algorithms • Data has more dimensions • Spatial, temporal, cross-correlations • Diverse and distributed • Many different instruments from many different places and many different times • Many different interesting questions

  16. Trends CMB Surveys • 1990 COBE 1000 • 2000 Boomerang 10,000 • 2002 CBI 50,000 • 2003 WMAP 1 Million • 2008 Planck 10 Million Angular Galaxy Surveys • 1970 Lick 1M • 1990 APM 2M • 2005 SDSS 200M • 2008 VISTA 1000M • 2012 LSST 3000M Time Domain • QUEST • SDSS Extension survey • Dark Energy Camera • PanStarrs • SNAP… • LSST… Galaxy Redshift Surveys • 1986 CfA 3500 • 1996 LCRS 23000 • 2003 2dF 250000 • 2005 SDSS 750000 Petabytes/year by the end of the decade…

  17. Fast 1% Tier2 Tier0 Tier1 10% 100% Tier2 Tier2 Challenges • Real-Time Detection for 3B objects • Pixels (exponential growth slowing down) • Size projection: 100PB by 2020 • Data Transfer (grows slower than data) • Data Access (hierarchical usage) • Fault Tolerance and Data Protection

  18. SkyServer • Sloan Digital Sky Survey: Pixels + Objects • About 500 attributes per “object”, 300M objects • Spectra for 1M objects • Currently 2TB fully public • Prototype eScience lab • Moving analysis to the data • Fast searches: color, spatial • Visual tools • Join pixels with objects • Prototype in data publishing • 70 million web hits in 3.5 years http://skyserver.sdss.org/

  19. DR1 DR1 DR2 DR2 DR2 DR3 DR3 DR3 DR3 Public Data Release: Versions! EDR • June 2001: EDR • Early Data Release • July 2003: DR1 • Contains 30% of final data • 150 million photo objects • July 2005: DR4 at 3.5TB • 60% of data • 4 versions of the data • Target, best, runs, spectro • Total catalog volume 5TB • See Terascale sneakernet paper… • Published releases served forever • EDR, DR1, DR2, …. • Soon to include email archives, annotations • O(N2) – only possible because of Moore’s Law!

  20. Spatial Features • Precomputed Neighbors • All objects within 30” • Boundaries, Masks and Outlines • 27,000 spatial objects • Stored as spatial polygons Time Domain: • Precomputed Match • All objects with 1”, observed at different times • Found duplicates due to telescope tracking errors • Manual fix, recorded in the database • MatchHead • The first observation of the linked list used as unique id to chain of observations of the same object

  21. Things Can Get Complex

  22. 3 Ways To Do Spatial • Hierarchical Triangular Mesh (extension to SQL) • Uses table valued stored procedures • Acts as a new “spatial access method” • Ported to Yukon CLR for a 17x speedup. • Zones: fits SQL well • Surprisingly simple & good on a fixed scale • Constraints: a novel idea • Lets us do algebra on regions., implemented in pure SQL • Paper:There Goes the Neighborhood: Relational Algebra for Spatial Data Search

  23. Pipeline Parallelism: 2.5 hours Or… as fast as we can read USNOB + .5 hours 2MASS:USNOBZone:ZoneComparison MergeAnswer Build Index Source Tables Zones 2MASS→USNOB 350 Mrec 12 GB 2MASS 471 Mrec 140 GB 0:-1 64 Mrec 2 GB Next zone 0:0 260 Mrec 9 GB USNOB 1.1 Brec 233 GB Next zone 350 Mrec 12 GB 0:+1 26 Mrec 1 GB USNOB→2MASS 2 hours .5 hour

  24. Next-Generation Data Analysis • Looking for • Needles in haystacks – the Higgs particle • Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • ‘Optimal’ statistics have poor scaling • Correlation functions are N2, likelihood techniques N3 • For large data sets main errors are not statistical • As data and computers grow with Moore’s Law, we can only keep up with N logN • A way out? • Discard notion of optimal (data is fuzzy, answers are approximate) • Don’t assume infinite computational resources or memory • Requires combination of statistics & computer science

  25. Organization & Algorithms • Use of clever data structures (trees, cubes): • Up-front creation cost, but only N logN access cost • Large speedup during the analysis • Tree-codes for correlations (A. Moore et al 2001) • Data Cubes for OLAP (all vendors) • Fast, approximate heuristic algorithms • No need to be more accurate than cosmic variance • Fast CMB analysis by Szapudi et al (2001) • N logN instead of N3 => 1 day instead of 10 million years • Take cost of computation into account • Controlled level of accuracy • Best result in a given time, given our computing resources

  26. Today’s Questions • Discoveries • need fast outlier detection • Spatial statistics • Fast correlation and power spectrum codes (CMB + galaxies) • Cross-correlations among different surveys (sky pixelization + fast harmonic transforms on sphere) • Time-domain: • Transients, supernovae, periodic variables • Moving objects, killer’ asteroids, Kuiper-belt objects….

  27. Other Challenges • Statistical noise is smaller and smaller • Error matrix larger and larger (Planck…) • Systematic errors becoming dominant • De-sensitize against known systematic errors • Optimal subspace filtering (…SDSS stripes…) • Comparisons of spectra to models • 106 spectra vs 108 models (Charlot…) • Detection of faint sources in multi-spectral images • How to use all information optimally (QUEST…) • Efficient visualization of ensembles of 100M+ data points

  28. Systematic Errors • SDSS P(k), main issue: • Effects of zero points, flat field vectors result in large scale, correlated patterns • Two tasks: • Estimate how large is the effect • De-sensitize statistics • Monte-Carlo simulations: • 100 million random points, assigned to stripes, runs, camcols, fields, x,y positions and redshifts => database • Build MC error matrix due to zeropoint errors • Include error matrix in the KL basis • Some modes sensitive to zero points (# of free pmts) • Eliminate those modes from the analysis => projectionStatistics insensitive to zero points afterwards

  29. Simulations • Cosmological simulations have 109 particles and produce over 30TB of data (Millennium) • Build up dark matter halos • Track merging history of halos • Use it to assign star formation history • Combination with spectral synthesis • Too few realizations • Hard to analyze the data afterwards • What is the best way to compare to the real universe

  30. Summary • Databases became an essential part of astronomy: most data access will soon be via digital archives • Data at separate locations, distributed worldwide, evolving in time: move analysis, not data! • Good scaling of statistical algorithms essential • Many outstanding problems in astronomy are statistical, current techniques inadequate, we need help! • The Virtual Observatory is a new paradigm for doing science: the science of Data Exploration!

More Related