Roman Olschanowsky Data Applications and Services

Roman Olschanowsky Data Applications and Services

Why SDSC Data Central? • Today’s scientists and engineers are increasingly dependent on valued community data collections and comprehensive data CI • Need for large data • many users with large data needs • extend above and beyond what their home environments • increasingly dependent on valued data collections used community-wide • Comprehensive data environment that incorporates access to the full spectrum of data enabling resources

Why SDSC Data Central? • SDSC has experienced increasing demand by the domain communities for collaborations on data driven discovery including • hosting, managing and publishing data in digital libraries • sharing data through the Web and data grids • creating, optimizing, porting large scale databases • data intensive computing with high bandwidth data movement • analyzing, visualizing, rendering and data mining large scale data • preservation of data in persistent archives • building collections, portals, ontologies, etc. • providing resource, services and expertise

Data-Driven Discovery Japanese Art Images – 600 GB JCSG/SLAC – 15.7 TB NVO – 93 TB Life Sciences Astronomy Arts, and Humanities Engineering SCOOP 7TB SCEC – 270 TB TeraBridge 4 TB Geosciences Ocean Science

Community Data Collections and Databases • Researchers are in need of a trusted repository for the management, publishing and sharing of their data with project and community members • Many increasingly dependent on valued community data • In response to the large number of requests, SDSC has formed DataCentral

What is Data Central? • Comprehensive data environment that incorporates access to the full spectrum of data enabling resources • Started as the first program of its kind to support research and community data collections and databases • Umbrella for SDSC Production Data Efforts enabling “Data to Discovery”

Portals, Data Grids, WAN File Systems Data Intensive computing, High Bandwidth Data Movement Consulting, Support, SACs, Ontologies, Education Foster Sharing and Collaboration, Collection Management, Data Services Preservation, Digital Libraries, Offsite Backup, Chronopolis Data Analysis, Databases, Data Mining, Visualization, Rendering Data to Discovery

What does SDSC Data Central offer? • SDSC has been actively working with and collaborating with many researchers and national scale projects in their integrated data efforts • We offer Expertise and Resources for: • Public Data Collections and Database Hosting • Long-term storage and preservation (tape and disk) • Remote data management and access (SRB, portals) • Data Analysis, Visualization and Data Mining • Professional, qualified 24/7 support

Data Resources Available through DataCentral • Expertise in • High performance large data management, hosting and publishing • Data migration, upload and sharing through the grid • Database application tuning, porting and optimization • SQL query tuning and schema design • Data analysis, visualization and data mining • Portal creation and collection publication • Preservation of data in persistent archives, etc.

DataCentral Resources 6 PB HPSS silo DataCentral infrastructure: • HW and SW resources • SAMQFS, HPSS, Disk, DB servers • Web farm • Accounting system • Data management tools and data analysis SW • Appropriate space, power, cooling, UPS systems • Human resources • System administrators, collection specialists supporting users and applications • 24/7 operators SDSC Servers Staff Expertise SDSC machine room Storage resources Security Networking UPS systems Software tools Web services 24/7 Operations, etc. Web-based portal

Community Self-Selection: SDSC Data Central • First program of its kind to support research and community data collections and databases • Comprehensive resources • Disk: 400 TB accessible via HPC systems, Web, SRB, GridFTP • Databases: DB2, Oracle, MySQL • SRB: Collection management • Tape: 25 PB, accessible via file system, HPSS, Web, SRB, GridFTP • Data collection and database hosting • Batch oriented access • Collection management services • Collaboration opportunities: • Long-term preservation • Data technologies and tools • Examples of Allocated Data Collections include • Bee Behavior (Behavioral Science) • C5 Landscape DB (Art) • Molecular Recognition Database(Pharmaceutical Sciences) • LIDAR (Geoscience) • LUSciD (Astronomy) • NEXRAD-IOWA (Earth Science) • AMANDA (Physics) • SIO_Explorer (Oceanography) • Tsunami and Landsat Data (Earthquake Engineering) • UC Merced Library Japanese Art Collection (Art) • Terabridge (Structural Engineering)

Data Systems SAM/QFS HPSS GPFS SRB Data Services Data migration/upload, usage and support (SRB) Database selection and Schema design (Oracle, DB2, MySQL) Database application tuning and optimization Portal creation and collection publication Data analysis (e.g. Matlab, SAS) and mining (e.g. WEKA) DataCentral Data-oriented Toolkits and Tools Biology Workbench Montage (astronomy mosaicking) Kepler (Workflow management) Vista Volume renderer (visualization), etc. Services, Tools, and Technologies for Data Management and Synthesis

Public Data Collections Hosted in SDSC’s DataCentral

How do we combine data, knowledgeand information management with simulation and modeling? Applications: Medical informatics, Biosciences, Ecoinformatics,… Visualization How do we represent data, information and knowledge to the user? Data Mining, Simulation Modeling, Analysis, Data Fusion How do we detect trends and relationships in data? Knowledge-Based Integration Advanced Query Processing How do we obtain usableinformation from data? Grid Storage Filesystems, Database Systems How do we collect, accessand organize data? How do we configure computer architectures to optimally support data-oriented computing? Cyberinfrastructure and Data –Integration infrastructure IntegratIon High speed networking Networked Storage (SAN) instruments sensornets Storage hardware Coordination, Interoperability

What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? Middlewarefederates dataacross disciplinaryvocabularies Portals, Domain Specific APIsprovide accessto data Disciplinary Databases Users Anatomy Organisms ? Data Integration Physiology Organs Complex “multiple-worlds” mediation Cell Biology Cells Organelles Proteomics Genomics Biopolymers Medicinal Chemistry Atoms GeoPhysical (gravity contours) Geologic Map (Virginia) GeoChronologic (Concordia) Foliation Map (structure DB) GeoChemical Cyberinfrastructure and Data –Data Integration Geosciences Life Sciences

Cyberinfrastructure-enabled Research Examples

“The Universe is now being explored systematically, in a panchromatic way, over a range of spatial and temporal scales that lead to a more complete, and less biased understanding of its constituents, their evolution, their origins, and the physical processes governing them.” • Towards a National Virtual Observatory Tracking the Heavens Hubble Telescope Palomar Telescope Sloan Telescope

The Virtual Observatory • Premise: most observatory data is (or could be) online • So, the Internet is the world’s best telescope: • It has data on every part of the sky • In every measured spectral band: optical, x-ray, radio.. • It’s as deep as the best instruments • It is up when you are up • The “seeing” is always great • It’s a smart telescope: links objects and data to literature on them • Software has became a major expense • Share, standardize, reuse.. Slide modified from Alex Szalay, NVO

Downloading the Night Sky • The National Virtual Observatory • Astronomy community came together to set standards for services and data • Interoperable, multi-terabyte online databases • Technology-enabled, science-driven. • NVO combines over 100 TB of data from 50 ground and space-based telescopes and instruments to create a comprehensive picture of the heavens • Sloan Digital Sky Survey, Hubble Space Telescope, Two Micron All Sky Survey, National Radio Astronomy Observatory, etc. Hubble Telescope Palomar Telescope Sloan Telescope

Using Technology to Evolve Astronomy • Looking for • Needles in haystacks – the Higgs particle • Haystacks -- Dark matter, Dark energy • Statistical analysis often deals with • Creating uniform samples • Data filtering • Assembling relevant subsets • Censoring bad data • “Likelihood” calculations • Hypothesis testing, etc. • Traditionally these are performed on files, most of these tasks are much better done inside a database Slide modified from Alex Szalay, NVO

How NVO Works • Raw data comes from large-scale telescopes • Telescopes provide daily sweep of the sky, scientists “clean data” which is then converted from temporal to spatial data, allowing indexing over both dimensions. • All NVO data on website available to the public without restriction (by community agreement, all data public after 1 year) • NVO databases distributed and mirrored at multiple sites Crab Nebula Palomar Telescope

Making Discoveries Using the NVO Scientists at Johns Hopkins, Caltech and other institutions confirmed the discovery of a new brown dwarf. Search time on 5,000,000 files went from months to minutes using NVO database tools and technologies. Brown dwarfs are often called the “missing link” in the study of star formations. They are considered small, cool “failed stars”.

Cyberinfrastructure and NVO • Sky surveys from major telescopes indexed and catalogued in NVO databases by time and spatial location using Storage Resource Broker and other tools • NVO collections archived at multiple sites, accessed by Grid technologies • Software tools and web portals create an environment for ingestion of new information, mining, discovery and dissemination

Moving the Earth The earth is constantly evolving through the movement of “plates” Using plate tectonics, the Earth's outer shell (lithosphere) is posited to consist of seven large and many smaller moving plates. As the plates move, their boundaries collide, spread apart or slide past one another, resulting in geological processes such as earthquakes, volcanoes and the development of mountains, typically at plate boundaries.

Earthquake Simulations Major Earthquakes on the San Andreas Fault, 1680-present • Simulation results provide new scientific information enabling better • Estimation of seismic risk • Emergency preparation, response and planning • Design of next generation of earthquake-resistant structures • Results provide immense societal benefits which can help in saving many lives and billions in economic losses • Geoscience researchers can now use massive amount of geological, historical, and environmental data to simulate natural disasters such as earthquakes. • Focus is on understanding big earthquakes and their impact. • Simulations combine large-scale data collections, high-resolution models, supercomputer runs 1906 M 7.8 ? 1857 M 7.8 1680 M 7.7 How dangerous is the San Andreas Fault?

TeraShakesimulates a 7.7 earthquake along the southern San Andreas fault close to LA using seismic, geophysical, and other data from the Southern California Earthquake Center

HowTerashake Works • How TeraShake simulates earthquakes: • Divide up Southern California into “blocks” • For each block, get all the data on ground surface composition, geological structures, fault information, etc.

Data parking of 100s of TBs for many months “Fat Nodes” with 256 GB of DS for pre-processing and post visualization 10-20 TB data archived a day 240 procs on SDSC Datastar, 5 days, 1 TBof main memory 47 TB output data for 1.8 billion grid points Continuous I/O 2GB/sec SCEC Data Requirements Resources must support a complicated orchestration of computation and data movement Parallelfile system Dataparking The next generation simulation will require even more resources: Researchers plan to double the temporal/spatial resolution of TeraShake “I have desired to see a large earthquake simulation for over a decade. This dream has been accomplished.” Bernard Minster, Scripps Institute of Oceanography

DataCentral-enabled Examples

SURA Coastal Ocean Observing and Prediction‘SCOOP’ scoop.sura.org “With SDSC’s Storage Resource Broker software we can access these data sets in DataCentral through the Grid from anywhere in the world.” This will help SCOOP researchers make their data available to the wider coastal modeling community. The data for the 2005 hurricane season are particularly valuable, and the SCOOP collection in DataCentral covers Katrina, Rita, and Wilma—three of the strongest category five hurricanes on record. Simulations predicting the approach of Katrina to New Orleans. The wind fields of Katrina are shown as white/grey ribbons, clearly showing the hurricane vortex. The yellow to red coloring beneath the eye of the hurricane shows the storm surge moving cross the gulf, pushed by the hurricane’s wind. W. Benger and S. Venkataraman, CCT/LSU.

SIOExplorer: Web Exploration of Seagoing Archives • “Bridging the Gap between Libraries and Data Archives” • Data • 50 years of digital data • Growing 200 GB per year • Images • 99 years of SIO Archives • Documents • Reports, publications, books • UCSD Libraries • Scripps Institution of Oceanography • San Diego Supercomputer Center 3000 cruises online at SIO SIOExplorer.ucsd.edu

AMANDA(Antarctic Muon And Neutrino Detector Array) The dream of constructing a radically different telescope has been realized by the innovative AMANDA-II project. Instead of sensing light, like all telescopes since the time of Galileo, AMANDA responds to a fundamental particle called a neutrino. Neutrino messengers provide a startlingly new view of the Universe. The 20TB/yr produced requires manipulation, processing, filtering, and Monte Carlo data analyses for the search of high energy neutrinos. A full data analysis requires a total space of 40TB/yr. amanda.uci.edu AMANDA -II MAPO South Pole Airport 1500 m Amundsen-Scott South Pole Station 2000 m

City of Hope’s Informatics Core Lab • Landscape of tools: • LIMS - collecting and organizing the lab data. • Microarray analysis - quantifying cellular response to stimulii • Virtual screening of lead compounds. • Together, SDSC and City of Hope push forward the limits of genomics enabled medicine. • The future of genomics-enabled medicine depends on the creation of tools that allow scientists to explore the relationships between specific genetic characters and specific disease outcomes. • Each of these tools has, at its core, the use of high end computation and high end data storage and integration.

WMS Global Mosaic • High resolution global mosaic of the earth • Greyscale geotiffs with geolocation tags for GIS integration • Produced from 8200 individual Landsat7 scenes over 500MB each • These data sets provide global imagery, elevation data, and formats for NASA World Wind and GEON’s GeoFusion browser

Roman Olschanowsky Data Applications and Services