Big Data Analytics for Life & Agricultural Sciences

Big Data Analytics for Life & Agricultural Sciences

IBM Big Data Session • IBM Big Data Platform Overview Bill Zanine, GM IBM Advanced Analytics for Big Data • IBM Big Data Life Sciences Bill Zanine Healthcare & Research Use Cases

The IBM Big Data Platform IBM Blue Gene/Q High-performance for computationally intensive application Platform Computing High-performance framework for distributed computing High-Performance Computing Stream Computing Hadoop InfoSphere BigInsights Hadoop-based low latency analytics for variety and volume InfoSphere Streams Low Latency Analytics for streaming data MPP Data Warehouse IBM Puredata System for Operational Analytics BI+Ad Hoc Analytics Structured Data IBM Smart Analytics System Operational Analytics on Structured Data

IBM Big Data Strategy: Optimized Analytic & Compute Platforms • Big data, business strategy and new analytics require optimized analytic platforms: • Integrate and manage the full variety, velocity and volume of data • Variety, velocity and volume further drive the need for optimized compute platforms • Operate on the data in it’s native form, on it’s current location, minimizing movement • Simplify data management, embed governance, seamlessly leverage tools across platforms Analytic Applications BI / Reporting Exploration / Visualization FunctionalApp IndustryApp Predictive Analytics Content Analytics IBM Big Data Platform Visualization & Discovery Application Development Systems Management Analytic Accelerators Data Warehouse Stream Computing High-Performance Computing HadoopSystem Information Integration & Governance

IBM Big Data Platforms Span the Breadth of Data Intensive Analytic Computing Needs Big Data Computing • No single architecture can fulfill all the compute needs for big data analytics, exploration and reporting • Specialized architectures and technologies are optimized to solve specific compute profiles, data volumes and data types • Architectures must be leveraged appropriately for scalable, cost-effective computing Peta-scale Analytics Real-time Analytics Petascale Computing Petascale Data Processing Petascale Interactive Data Processing In-Line Analytics Highly-Planned, Batch Scalable Batch Unstructured Batch & Interactive Structured Highly Planned Autonomous

Sensor Data: Manufacturing, Medical, Environmental & Lab Patient Monitoring Quality Control Data Intensive Computing for Life Sciences Streaming Data Sequencer Data, Assays, Medical Records, Tissue & Cell Imaging SNP Alignment Image Classification Attribute Extraction Peta-Scale Data Simulations & Models Publication Graphs Protein Science Molecular Dynamics Complex Graph Analysis Peta-Scale Computing Image Metadata, Genomic Profiles, Environmental & Patient Attribution Translational Medicine Genotype/Phenotype Predictive Healthcare Data Intensive, Interactive Computing Complex Simulations Matrix Acceleration Compute Acceleration Data Intensive Compute Acceleration

Tackling Computational Challenges in Life Sciences • The Vast Datascape of Life Sciences • Scalable, Cost Efficient Genomic Processing and Analysis • Scalable Sensor Processing and Geospatial Analytics • Health Outcomes and Epidemiology

Combining Big Data and Smarter Analytics toImprove Performance in the Life Sciences Weather Yield SNP Claims Biologic Records Health Records Plant Records Environmental Records Soil Lab Results Human Plant Prescription Utilities Disease Longitudinal Vegetation Tissue Diagnosis Demographics RNA Topography Tissue Proteins

Scalable Genomic Data Processing

SUNY Buffalo – Center for Computational ResearchData Intensive Discovery Initiative

SUNY Buffalo – Large Gene Interaction AnalyticsUB Center for Protein Therapeutics • Use the new algorithms and add multiple variables that before were nearly impossible to achieve • Reduce the time required to conduct analysis from 27.2 hours without the IBM Puredata data warehouse appliance to 11.7 minutes with it • Carry out their research with little to no database administration • Publish multiple articles in scientific journals, with more in process • Proceed with studies based on ‘vector phenotypes’—a more complex variable that will further push the IBM Puredata data warehouse appliance platform

Revolution R – Genome Wide Association Study • Genome Wide Association Study (GWAS) • An epidemiological study that consists of the examination of many common genetic variants in different individuals to see if any variant is associated with a trait • Revolution R allows the bio-statisticians to work with Puredata as if they were simply using R on their desktop • Simplicity with performance considered a “game-changer” by end-users • CRAN Library support allows them to benefit from the aggregate knowledge of the R community • Extensive selection of packages for bio-statistics and relevant analytic techniques allows developers to be significantly more productive “What finished in 2 hours 47 minutes on the 1000-6 was still running on the HPC environment and was not estimated to complete until after 14+ days.” - Manager – IT Solution Delivery

EC2 vsPuredata: Bowtie Financial Value Proposition • What amount of processing on EC2 equates to the cost of a Puredata system? • Bowtie on EC2 • Assume the new system will be used for Bowtie, and only Bowtie • Today, Bowtie takes 3 hours on 320 EC2 CPU • Cost of each Bowtie run = 3 hours * 320 CPU * $0.68 per CPU per hour • $653* per Bowtie Run • Bowtie on Puredata • TF-6 costs $600K*, or $200K per Year assuming 3 year deferral • How many times would a customer have to run Bowtie on EC2 (On-Demand) for the same expenditure? • $200K per year / $653 per run = 306 Bowtie Runs per Year • Basically, Puredata is a better financial value proposition • If the need to run Bowtie exceeds 300 times a year, for 3 years, • Also, Puredata TF-6 offers 9x the processing capacity for Bowtie of a comparably priced EC2 environment *Costs are relative, based upon list pricing 2010

Benchmarks on Tanay ZX Series For all the options traded in US in a given day as reported by OPRA (500K to 1 million trades), implied volatility can be calculated by Tanay ZX Series in less than 500 milliseconds

Applications in Healthcare • Predictive Modeling • Relationship between conditions, genetics, demographics and outcomes • Survival modeling • Monte Carlo Simulations • Gibbs Sampling for MCMC studies on patient data • Drug design - molecular docking • Gene Sequencing • Parallelized Basic Local Alignment Search Tool (BLAST)

Sensor & Geospatial Data Processing

IBM Puredata Spatial - Precision Farming High Speed Analytics on Farm Telematics Yield data (GPS), Soil Data, Common Land Units(CLUs), Elevation, Farm Plots Example –Farm Equipment Company Intersect: 48 million crop yield records (points) 30 million Common Land Units Result: ~411,000 Summary Records by CLU (min, max, avg yield) Total Time ~ 45 min “We would not even attempt to do this large a process on Oracle.” -Customer GIS Analyst Page 17

VestasA global wind energy company based in Denmark Information Management • Business Challenge • Improve placement of wind turbines – save time, increase output, extend service life • Project objectives • Leverage large volume of weather data (2.8 PB today; ~16 PB by 2015) • Reduce modeling time from weeks to hours. • Optimize ongoing operations. Why IBM? • Domain expertise • Reliability, security, scalability, and integrated solution • Standard enterprise software support • Single-vendor for software, hardware, storage, support The Solution: • IBM InfoSphere BigInsights Enterprise Edition • IBM xSeries hardware

Information Management University of Ontario Institute of TechnologyDetecting life-threatening conditions in neonatal care units • Business Challenge • Premature births and associated health risks are on the rise. • Enormous data loss from patient monitoring equipment. 3600 readings/hr reduced to 1 spot reading/hr • Analyze physical behaviors (heart rate, respiration, etc) it is possible to determine when the body is coming under stress. • Project objectives • Analyze ALL the data in real-time to detect when a baby is becoming unwell earlier than is possible today. • Reduce avg length of stay in neonatal intensive care  reducing healthcare costs. • The benefits • Analyze ~90 million points of data per day per patient in real-time . . . every reading taken is analyzed. • Able to stream the data into a database, and shown that the process can keep pace with the incoming data. Solution Components: InfoSphere Streams • On premises • In the cloud Warehouse to correlate physical behavior across different populations. Models developed in warehouse used to analyze streaming data. “I could see that there were enormous opportunities to capture, store and utilize this data in real time to improve the quality of care for neonatal babies.” Dr. Carolyn McGregor Canada Research Chair in Health Informatics University of Ontario Institute of Technology

Pacific Northwest Smart Grid Demonstration Project • Capabilities: • Stream Computing – real-time control system • Deep Analytics Appliance – analyze massive data sets • Demonstrates scalability from 100 to 500K homes while retaining 10 years’ historical data • 60k metered customers in 5 states • Accommodates ad hoc analysis of price fluctuation, energy consumption profiles, risk, fraud detection, grid health, etc.

Hardcore Research

Computational Biology and Healthcare – Groups and Projects • Computational Biology Center (Watson Research Lab) • Comparative Genomics • Protein Folding • DNA Transistor (nanopore sequencing) • Healthcare Informatics (Almaden Research Lab) *** • AALIM: Advanced Analytics for Information Management • The Spatiotemporal Epidemiological Modeler (STEM) • Genome-Wide Association Studies for Predictive Healthcare *** • Healthcare Solutions (Haifa Research Lab) • HIV Therapy Prediction (based on virus DNA markers) • HYPERGENES (genetics of hypertension)

Aligner Variation Caller Aligner Variation Caller AlignedReads Variation byChromosomal Region DNA Reads • • • • • • Chr 1 Chr 1 SNPs Aligner Variation Caller Chr 22 Chr 1 Chr 22 SNPs Chr 22 Chr 22 Chr Y SNPs Chr Y MAP Step REDUCEStep Bioinformatics on Hadoop: Alignment and Variant Calling • This DNA sequence analysis workflow is implemented in the academic software Crossbow (Bowtie aligner + SOAPsnp variant caller)

Health Outcomes and Epidemiology

University of Ontario Institute of TechnologyDetecting life-threatening conditions in neonatal care units Information Management • Business Challenge • Premature births and associated health risks are on the rise. • Enormous data loss from patient monitoring equipment. 3600 readings/hr reduced to 1 spot reading/hr • Analyze physical behaviors (heart rate, respiration, etc) it is possible to determine when the body is coming under stress. • Project objectives • Analyze ALL the data in real-time to detect when a baby is becoming unwell earlier than is possible today. • Reduce avg length of stay in neonatal intensive care  reducing healthcare costs. The benefits • Analyze ~90 million points of data per day per patient in real-time . . . every reading taken is analyzed. • Able to stream the data into a database, and shown that the process can keep pace with the incoming data. Solution Components: InfoSphere Streams • On premises • In the cloud Warehouse to correlate physical behavior across different populations. Models developed in warehouse used to analyze streaming data.

Health Analytics – GPRD & OMOP Benchmarking • General Practice Research Database – GPRD • European Clinical Records Database - 11 Million Patients, 24 Years • Pharmacovigilance, Drug Utilization and Health Outcomes Analytics • Observation Medical Outcomes Partnership - OMOP • Shared Models and Methods for Health Outcomes Analytics • Migration of GPRD & OMOP data models, ETL and SAS programs to Puredata • 266 GB of Raw Data • Compressed to 75 GB • Puredata 1000-6 • SAS & SAS Access 9.2

Improved Clinical Insights Problem Effect Post-launch monitoring of clinical data required the manual integration of data across several data providers Differences in data assets did not facilitate integration over time Flat-file oriented processing significantly increase complexity of analysis Data inquiries performed overnight via manually developed SAS programs Pre-formatted data sets (to simplify integration) did not enable access to unique characteristics of different sources Significant data duplication at the risk of good data management Implementation Scope Migration of SAS-based flat files to a relational environment in Puredata Optimization of existing SAS code to best leverage Puredata in-database functionality Integration of clinical data from Premiere and IMS with sales & marketing data from IMS, WK and SDI Improvement Metrics Result Reduction in time to perform data inquiries and analytics Advance analytic capabilities for clinical analysis Ability of end-users to gain immediate benefit without significant retooling or new technologies Immediate 10x performance improvement on existing SAS applications with little to no rework or original code Leveraging traditional S&M data assets as an early indicator for more detailed investigations with clinical data Company data management strategy focused on centralizing data assets on Puredata to improve analytic capabilities

Revolution R - Medication Response Modeling • Understand the major causes of morbidity and mortality related to inaccurate dosages of medications • Relating individual responses to medication with genetic makeup and environmental factors indicated by biomarker measurement • The Puredata architecture allowed for the re-use of the existing investment in R • 20x+ performance improvement over existing HPC infrastructure • Business logic of the R code remained intact • Speed at which they could interrogate the data allowed them to play with many models • Explored use of in-database analytics (dbLytix) • 1000x performance improvement of existing HPC infrastructure

Optum Insight – Predictive Healthcare with Fuzzy Logix • Predict who is at risk to develop diabetes 6 months, 1 year & 2 years out • Provide more advanced intervention services for health plans • Based upon medical observations and patient demographics • In-memory analytic environment was limited to analyzing small sets of patient cohorts with limited computational capability • “Optum Insight could not do this work without Netezza or Fuzzy Logix” • Leveraged 79x more medical observations, processed 150x faster • From 150 variables to 1700, with a capacity for 5000+ • Fast iterative analysis for continuous model improvement *Improvement estimates are conservative. Models on Puredata leveraged 10x the amount of data while performing several additional calculations that were not feasible with in-memory solution

Health Insurance Provider – SAS SQL Pass-Thru • IBM BCU database used for SAS applications • 5 years of policyholder data • “Nasty” queries – “Decision Tree across Neural Networks by zipcode” • Had just undergone 6 months of tuning • Puredata POC loaded raw data with no optimization • Testing used SAS/Access for ODBC • Will be even faster with SAS/Access for Puredata • 15x average performance improvement • Over 26x on long-running analytics

Catalina Marketing – In-Database Analytics • 35X improvement in staff productivity • Model development reduced from 2+ months to 2 days • 10’s of models to 100’s per year with the same staff • Increased depth of data per model • 150 to 3.2 Million features • 1 Million to 14.5 Trillion records per analysis • Impressive ROI on IT investment • 12 Million times more data processed per CPU per day • Direct correlation of models development to revenue

Catalina – SAS Scoring Accelerator

Harvard Medical School collaboration What is the Harvard Computational Pharmaco-Epidemiology Program? • Harvard Medical School faculty selects Puredatafor pharmaco-epi complex analytics, studies on drug effectiveness & safety • 100% of computation run on Puredata • Faculty are in Methods Core of FDA Mini-Sentinel & globally esteemed Why computational pharmaco-epidemiology? • FDA will be implementing a system to track safety of drugs and devices through active surveillance on tens to hundreds of terabytes of claims data (FDA Mini-Sentinel) • Pharma’s want to innovate ahead of Sentinel, find new markets and risks • Payers want to measure their insured population, providers, outcomes, ACO • Comparative effectiveness research is a top priority for next-generation healthcare Why is it special? • These end users have no IT budget, no DBA’s, period!

Big Data Analytics for Life & Agricultural Sciences