Analyzing Large Datasets in Astrophysics

Towards an International Virtual Observatory,Garching, 2002 Analyzing Large Datasets in Astrophysics (Living in an exponential world….) Alexander Szalay The Johns Hopkins University

Outline • Collecting Data • Exponential Growth • Making Discoveries • Publishing Data • VO: How will it work? • Web Services • Atomic vs Composite services • Distributed queries with SkyQuery • Cross-Matching Algorithm • SkyNode Web Services + Portal • Statistical Analysis of large data sets Alex Szalay, Garching 2002

The World is Exponential • Astrophysical data is growing exponentially • Doubling every year (Moore’s Law+):both data sizes and number of data sets • Computational resources scale the same way • Constant $$$ will keep up with the data • Main problem is the software component • Currently components are not reused • Software costs are increasingly larger fraction • Aggregate costs are growing exponentially Alex Szalay, Garching 2002

Making Discoveries • When and where are discoveries made? • Always at the edges and boundaries • Going deeper, using more colors…. • Metcalfe’s law • Utility of computer networks grows as the number of possible connections: O(N2) • VO: Federation of N archives • Possibilities for new discoveries grow as O(N2) • Current sky surveys have proven this • Very early discoveries from SDSS, 2MASS, DPOSS Alex Szalay, Garching 2002

Publishing Data Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists Alex Szalay, Garching 2002

Changing Roles • Exponential growth: • Projects last at least 3-5 years • Data sent upwards only at the end of the project • Data will be never centralized • More responsibility on projects • Becoming Publishers and Curators • Larger fraction of budget spent on software • Lot of development duplicated, wasted • More standards are needed • Easier data interchange, fewer tools • More templates are needed • Develop less software on your own Alex Szalay, Garching 2002

Emerging New Concepts • Standardizing distributed data • Web Services, supported on all platforms • Custom configure remote data dynamically • XML: Extensible Markup Language • SOAP: Simple Object Access Protocol • WSDL: Web Services Description Language • Standardizing distributed computing • Grid Services • Custom configure remote computing dynamically • Build your own remote computer, and discard • Virtual Data: new data sets on demand Alex Szalay, Garching 2002

NVO: How Will It Work? • Define commonly used `atomic’ services • Build higher level toolboxes/portals on top • We do not build `everything for everybody’ • Use the 90-10 rule: • Define the standards and interfaces • Build the framework • Build the 10% of services that are used by 90% • Let the users build the rest from the components Alex Szalay, Garching 2002

Atomic Services • Metadata information about resources • Waveband • Sky coverage • Translation of names to universal dictionary (UCD) • Simple search patterns on the resources • Cone Search • Image mosaic • Unit conversions • Simple filtering, counting, histogramming • On-the-fly recalibrations Alex Szalay, Garching 2002

Higher Level Services • Built on Atomic Services • Perform more complex tasks • Examples • Automated resource discovery • Cross-identifications • Photometric redshifts • Outlier detections • Visualization facilities • Expectation: • Build custom portals in matter of days from existing building blocks (like today in IRAF or IDL) Alex Szalay, Garching 2002

SkyQuery • Distributed Query tool using a set of services • Feasibility study, built in 6 weeks from scratch • Tanu Malik (JHU CS grad student) • Tamas Budavari (JHU astro postdoc) • Implemented in C# and .NET • Won 2nd prize of Microsoft XML Contest • Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2 Alex Szalay, Garching 2002

Architecture Web Page Image cutout SkyQuery SkyNodeSDSS SkyNode2Mass SkyNodeFirst Alex Szalay, Garching 2002

Cross-id Steps SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) > 2AND o.type=3 • Parse query • Get counts • Sort by counts • Make plan • Cross-match • Recursively, from small to large • Select necessary attributes only • Return output • Insert cutout image Alex Szalay, Garching 2002

Monte-Carlo Simulation • Comparing different algorithms for 3-way xid • Transmit all the data • Transmit after filtering • Recursive cross-match • Surveys • SDSS • 2MASS • First • Random variables: • Sky Area (0..10 sqdeg) • Selectivity of each subselect (0..1) • Efficiency of join (0.5..2) • Selectivity of common select (0..1) Alex Szalay, Garching 2002

SkyNode • Metadata functions (SOAP) • Info, Tables, Columns, Schema, Functions, Keysearch • Query functions (SOAP) • Dataset Query(String sqlCmd) • Dataset Xmatch(Dataset input, String sqlCmd, float eps) • Database • MS SQL Server • Upload dataset • Very fast spatial search engine (HTM-based)crossmatch takes <3 ms/object over 15M in SDSS • User defined functions and stored procedures Alex Szalay, Garching 2002

SkyQuery SkyNode 1 SkyNode 2 SkyNode 3 Data Flow query http://www.skyquery.net Alex Szalay, Garching 2002

Optimal Statistics • The examples for optimal statistics have poor scaling • Correlation functions N2, likelihood techniques N3 • As data sizes grow at Moore’s law, computers can only keep up with at most N logN algorithms • What goes? • Notion of optimal is in the sense of statistical errors • Assumes infinite computational resources • Assumes that only source of error is statistical • `Cosmic Variance’: we can only observe the Universe from one location (finite sample size) • Solutions require combination of Statistics and CS • New algorithms: not worse than N logN Alex Szalay, Garching 2002

Clever Data Structures • Heavy use of tree structures: • Up-front cost, but only N logN • Large speedup later • Tree-codes for correlations (A. Moore et al 2001) • Fast, approximate heuristic algorithms • No need to be more accurate than cosmic variance • Fast CMB analysis by Szapudi etal (2001) • N logN instead of N3 => 1 day instead of 10 million years • Take cost of computation into account • Controlled level of accuracy • Best result in a given time, given our computing resources Alex Szalay, Garching 2002

Angular Clustering with Photo-z • w() by Peebles and Groth: • The first example of publishing and analyzing large data • Samples based on rest-frame quantities • Strictly volume limited samples • Largest angular correlation study to date • Very clear detection of • Luminosity and color dependence • Results consistent with 3D clustering T. Budavari, A. Connolly, I. Csabai, I. Szapudi, A. Szalay, S. Dodelson, J. Frieman, R. Scranton, D. Johnston and the SDSS Collaboration Alex Szalay, Garching 2002

343k 316k 254k 185k 280k 127k 326k 185k The Samples 2800 square degrees in 10 stripes, data in custom DB All: 50M mr<21 : 15M 10 stripes: 10M 0.1<z<0.3 -20 > Mr 2.2M 0.1<z<0.5 -21.4 > Mr 3.1M -20 > Mr >-21 1182k -21 > Mr >-23 931k -21 > Mr >-22 662k -22 > Mr >-23 269k Alex Szalay, Garching 2002

The Stripes • 10 stripes over the SDSS area, covering about 2800 square degrees • About 20% lost due to bad seeing • Masks: seeing, bright stars, etc. • Images generated from query by web service Alex Szalay, Garching 2002

The Masks • Stripe 11 + masks • Masks are derived from the database • Search and intersect extended objects with boundaries Alex Szalay, Garching 2002

The Analysis • eSpICE : I.Szapudi, S.Colombi and S.Prunet • Integrated with the database by T. Budavari • Extremely fast processing (N logN) • 1 stripe with about 1 million galaxies is processed in 3 mins • Usual figure was 10 min for 10,000 galaxies => 70 days • Each stripe processed separately for each cut • 2D angular correlation function computed • w(): average with rejection of pixels along the scan • flat field vector causes mock correlations Alex Szalay, Garching 2002

Angular Correlations I. • Luminosity dependence: 3 cuts -20> M > -21 -21> M > -22 -22> M > -23 Alex Szalay, Garching 2002

Angular Correlations II. • Color Dependence 4 bins by rest-frame SED type Alex Szalay, Garching 2002

Summary • Exponential data growth – distributed data • Web Services – hierarchical architecture • Use the 90-10 rule (maybe 80-20) • There are clever ways to federate datasets! • Statistical analyses do not follow Moore’s law • Need to revisit optimal statistics • Give interesting new tools into the hands of smart young people… • They will quickly turn them into cutting edge science Alex Szalay, Garching 2002

Virtual Observatory Astronomy with an attitude… Alex Szalay, Garching 2002

Analyzing Large Datasets in Astrophysics

Analyzing Large Datasets in Astrophysics

Presentation Transcript

Visualization of large astrophysical simulations datasets

Challenges in survival analysis with large datasets

Color Compatibility From Large Datasets

Analysis with Extremely Large Datasets

Identifying functional subnetworks in large-scale datasets

Collaboration on Large Datasets using Globus

Adding GO for Large Datasets

Challenges in Mining Large Image Datasets

Algorithmic Analysis of Large Datasets

Analyzing ever growing datasets in PHENIX

Best Practices in Loading Large Datasets

Large Array Astrophysics Detectors (I)

GGS Lecture: Knowledge discovery in large datasets

Adventures in Web Services for Large Geophysical Datasets

Analyzing large-scale cheminformatics and chemogenomics datasets through dimension reduction

Analysis with Extremely Large Datasets

Analyzing Metabolomic Datasets

Analyzing Large Data

Analyzing ever growing datasets in PHENIX

Clustering Large Datasets in Arbitrary Metric Space

Challenges in survival analysis with large datasets

Analyzing Metabolomic Datasets