190 likes | 211 Views
Explore the exponential world of astrophysics data, where astronomers handle terabytes of data and mine for new and interesting objects such as quasars. Discover the challenges in managing and analyzing immense datasets, and the importance of databases in conducting parallel data searches. Learn about the groundbreaking research in astronomy and the development of new algorithms for data analysis. Delve into the specialized field of astronomy and its unique characteristics, perfect for experimenting with algorithms and sharing results freely with the public. Join the international collaboration in the National Virtual Observatory and the Sloan Digital Sky Survey, aiming to map the cosmos in unprecedented detail.
E N D
Astrophysics with Terabytes of Data Alex SzalayThe Johns Hopkins University Jim GrayMicrosoft Research
Astronomy in an Exponential World • Astronomers have a few hundred TB now • 1 pixel (byte) / sq arc second ~ 4TB • Multi-spectral, temporal, … → 1PB • They mine it looking fornew (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space • Data doubles every year • Same access for everyone
You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years Oh!, and 1PB ~4,000 disks At some point you need indices to limit searchparallel data search and analysis This is where databases can help If there is too much data to move around, take the analysis to the data! Do all data manipulations at database Build custom procedures and functions in the database You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1K$ … 3 years and 1M$ Data Access is Hitting a Wall FTP and GREP are not adequate
Next-Generation Data Analysis • Looking for • Needles in haystacks – the Higgs particle • Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • ‘Optimal’ statistics have poor scaling • Correlation functions are N2, likelihood techniques N3 • For large data sets main errors are not statistical • As data and computers grow with Moore’s Law, we can only keep up with N logN • Take cost of computation into account • Controlled level of accuracy • Best result in a given time, given our computing resources • Requires combination of statistics & computer science • New algorithms
Why Is Astronomy Special? • Especially attractive for the wide public • It has no commercial value – “worthless!” (Jim Gray) • No privacy concerns, freely share results with others • Great for experimenting with algorithms • It is real and well documented • High-dimensional (with confidence intervals) • Spatial, temporal • Diverse and distributed • Many different instruments from many different places and many different times Virtual Observatory • The questions are interesting • There is a lot of it (soon Petabytes)
National Virtual Observatory • NSF ITR project, “Building the Framework for the National Virtual Observatory” is a collaboration of 17 funded and 3 unfunded organizations • Astronomy data centers • National observatories • Supercomputer centers • University departments • Computer science/information technology specialists • PI and project director: Alex Szalay (JHU) • CoPI: Roy Williams (Caltech/CACR) • Natural cohesion with Grid Computing • Several widely used applications now up and running • Registry, Datascope, SkyQuery, Unified data access
International Collaboration • Similar grass-roots efforts now in 17 countries: • USA, Canada, UK, France, Germany, Italy, Holland, Japan, Australia, India, China, Russia, Spain, Hungary, South Korea, ESO, Brazil • Total awarded funding world-wide is over $60M • Active collaboration among projects • Standards, common demos • International VO roadmap being developed • Regular teleconferences over 10 timezones • Formal collaboration International Virtual Observatory Alliance (IVOA)
Sloan Digital Sky Survey Goal Create the most detailed map of the Northern sky “The Cosmic Genome Project” Two surveys in one Photometric survey in 5 bands Spectroscopic redshift survey Automated data reduction 150 man-years of development High data volume 40 TB of raw data 5 TB processed catalogs Data is public The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg Sloan Foundation, NSF, DOE, NASA
The Imaging Survey Drift scan of 10,000 square degrees 24k x 1M pixel “panoramic” images in 5 colors – broad-band filters (u,g,r,i,z) 2.5 Terapixels of images
The Spectroscopic Survey • Expanding universe • redshift = distance • SDSS Redshift Survey • 1 million galaxies • 100,000 quasars • 100,000 stars • Two high throughput spectrographs • spectral range 3900-9200 Å • 640 spectra simultaneously • R=2000 resolution, 1.3 Å • Features • Automated reduction of spectra • Very high sampling density and completeness
The SkyServer Portal • Sloan Digital Sky Survey: Pixels + Objects • About 500 attributes per “object”, 400M objects • Currently 2.4TB fully public • Prototype eScience lab (800 users) • Moving analysis to the data • Fast searches: color, spatial • Visual tools • Join pixels with objects • Prototype in data publishing • 200 million web hits in 5 years • 930,000 distinct users http://skyserver.sdss.org/
Precision Cosmology • Main questions • Dark Matter, Dark Energy • Over the last few years • Detected that the Universe accelerates • Detected the baryon bumps • Constrained the neutrino mass • Contribute substantially to constraints combined with CMB • Surveys: ‘systems’ astronomy!! • Measuring the parameters of the Universe to a few percent accuracy
SDSS Power Spectrum Main challenge: with so much data the dominant errors are systematic, not statistical! Using large simulations to understand significance of detection
Trends CMB Surveys • 1990 COBE 1000 • 2000 Boomerang 10,000 • 2002 CBI 50,000 • 2003 WMAP 1 Million • 2008 Planck 10 Million Angular Galaxy Surveys • 1970 Lick 1M • 1990 APM 2M • 2005 SDSS 200M • 2008 VISTA 1000M • 2012 LSST 3000M Time Domain • QUEST • SDSS Extension survey • Dark Energy Camera • PanStarrs: 1PB by 2007 • LSST: 100PB by 2020 Galaxy Redshift Surveys • 1986 CfA 3500 • 1996 LCRS 23000 • 2003 2dF 250000 • 2005 SDSS 750000 Petabytes/year by the end of the decade…
Simulations • Cosmological simulations have 109 particles and produce over 30TB of data (Millennium) • Build up dark matter halos • Track merging history of halos • Use it to assign star formation history • Combination with spectral synthesis • Realistic distribution of galaxy types • Need more realizations (now 50) • Hard to analyze the data afterwards -> need DB • What is the best way to compare to real data?
Exploration of Turbulence For the first time, we can now “put it all together” • Large scale range, scale-ratio O(1,000) • Three-dimensional in space • Time-evolution and Lagrangian approach (follow the flow) Unique turbulence database • We are creating a database of O(2,000) consecutive snapshots of a 1,0243 simulation of turbulence: close to 100 Terabytes • Treat it as an experiment
Wireless Sensor Networks • Use 200 wireless (Intel) sensors, monitoring • Air temperature, moisture • Soil temperature, moisture, at least in two depths (5cm, 20 cm) • Light (intensity, composition) • Gases (O2, CO2, CH4, …) • Long-term continuous data • Small (hidden) and affordable (many) • Less disturbance • >200 million measurements/year • Collaboration with Microsoft • Complex database of sensor data and samples, derived from astronomy
Summary • Data growing exponentially • Requires a new model • Having more data makes it harder to extract knowledge • Information at your fingertips • Students see the same data as professionals • More data coming: Petabytes/year by 2010 • Need scalable solutions • Move analysis to the data! • Same thing happening in all sciences • High energy physics, genomics/proteomics, medical imaging, oceanography, environmental science… • Data Exploration: an emerging new branch of science • We need multiple skills in a world of increasing specialization…