Exploring Genetic Causes of Disease with Big Data Technologies

Charles Schmitt Director, Informatics and Data Sciences Senior Researcher – Data Mining Renaissance Computing Institute Searching for the Genetic Causes of Disease with Hadoop (and other big data technologies…)

Who is involved? Biomedical Informatics Group Kirk Wilhelmsen, M.D. Chris Bizon, Ph.D. Xiaoshu Wang, Ph.D. Jason Reilly Phil Owens Guifeng Jin Michael Spiegel, Ph.D. Joshua Salisbury, Ph.D. Data Sciences Group Charles Schmitt, Ph.D. Erik Scott Nassib Nasser Keary Cavin MichealShoffner Collaborators Jonathan Berg, M.D. Jim Evans , M.D. Kari North, Ph.D. Ethan Lange, Ph.D. Rob Fowler, Ph.D. UNC HTSF UNC LCCC UNC Center for Bioinformatics UNC ITS RC RENCI ACIS UNC IPIT Multiple remote collaboration sites

Human DNA • Dynamic 3-d structure • 23 chromosomes • Nearly identical copies

Human Genetic Variations ATCGATCGATCAGACTA__GGGCTAGACTACGATCGATC – reference genome ATCGATCGGTCAGACTATCGGGCTA__CTACGAGCGCTC – patient maternal ATCGATCGGTCAGACTATCGGGCTA__CTACGATCGCTC – patient paternal SNPs: low millions Indels: low 100k • Structural variations • ~5-15% of genome is larger structural variants • (Nature Biotechnology Volume: 29, Pages: 723–730 Year published: (2011))

Next-Generation Sequencing 4x coverage Genome Exon Exon Reads x x x x • Low coverage/targeted sequencing: cheaper and faster to sequence, less data to store, • But… • Greater reliance on making statistical inferences • Different strategies for research and clinical use

Identifying variations Likely heterozygous (6 C, 9 Gs) (7 T, 9 G) Likely sequencing error (2 C, 14 T) (1 C, 15 A)

Identifying variations 2 homozygous SNPs unclear (6 C, 14 T)

Identifying variations CTT deletion (deltaF508)is the most common cause of cystic fibrosis

Clinical Binning – the critical information Slide provided by Jim Evans, M.D., Ph.D., Department of Genetics, UNC-CH

The promise of genetics requires a greater understanding of the underlying structure of the data

Computing on the Genome: Imputation ATCGATCGATCAG - reference ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC Its unclear if this patient is A/A, A/G, or G/G

Computing on the Genome: Imputation Population Evidence ATCGATCGATCAG - reference ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC ATCGGTCGGTCAG - patient 2 ATCGGTCGGTCAG – patient 3 ATCGGTCGGTCAG – patient 4 ATCGGTCGGTCAG – patient 5 ATCGATCGATCAG – patient 6 ATCGATCGATCAG – patient 7 ATCGATCGATCAG – patient 8 Infer the patient is homozygous for GG

Computing on the Genome: Imputation Population Evidence ATCGATCGATCAG - reference ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC ATCGGTCGGTCAG - patient 2 ATCGGTCGGTCAG – patient 3 ATCGGTCGGTCAG – patient 4 ATCGGTCGGTCAG – patient 5 ATCGATCGATCAG – patient 6 ATCGATCGATCAG – patient 7 ATCGATCGATCAG – patient 8 Hidden Markov Models for cross-genome statistical correlations (Thunder*) Imputation on 708 samples takes over 200,000 CPU hours to complete, or 22 CPU years How many samples do we need to impute on rare variants? * Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res. 2011 Jun;21(6):940-51

Convergent Haplotype Association Tagging Identifying moderate penetrant mutations from cross-population genetic structures CHAT: developed by Kirk Wilhelmsen

1 2 3 4 5 Using Graph Theory in CHAT A B C A B C 1 1 1 2 2 2 5 5 5 4 4 4 3 3 3

Discovered CHAT is 2800 SNPs in length and 26 mb

The promise of genetics requires better approaches to store and analyze large data

The cost of storing 100,000 genomes Empirical data, assuming ~100 Gb per sample compressed fastq, bam, vcf, and ancillary data files at coverage between 3-15x Empirical data, assuming ~20 Gb per sample at around 30x only storing compressed fastq and bam file 10 Pb = full human genomes at low coverage (1) 2 Pb = human exomes at medium coverage (2) Or: $5 to $25 million dollars for UNC Health Care System • Every patient’s genome once on enterprise data storage • Not including archived copies, not including analysis data sets Or: $15 to $75 billion dollars for the US to store every patient’s genome once Cost of disk space alone, not including refresh of equipment

There is more clinical genetic data … … gene expression (rna-seq) … per tissue data …time series data …the personal micro-biome Courtesy of NIH via WikiCommons

An Informatics Ecosystem for Clinical GenomicsAt ~8K genomes, will scale to ~10-20K genomesNeed to scale to 100,000-200,000+ genomes

High Performance Computing (HPC) Leverages: Traditional bioinformatics tools Traditional HPC workflow systems Computing • KillDevil (ITS RC) • 706 Traditional, GPU based, and large memory compute nodes • BlueRidge (RENCI) • 204 Traditional, GPU based, and large memory compute nodes • Croatan (RENCI) • 30 node big-data configuration with 1 Pb spinning disk • Topsail (UNC Genomics) • 400 traditional compute nodes • Kure (ITS RC) • 220 Traditional and large memory compute Nodes • Open Science Grid • Distributed cycle scavenging grid across research institutions • Teragrid • National HPC grid Storage • PB+ Dell/Isilonsystem at UNC • PB+ DDN/NetApp/Dell systems at RENCI

Aggregating genomic knowledge NCBI RefSeq Leverages strengths of RDBMS in structured knowledge representation dbSNP VarDB. Annotations of Clinical Variations PolyPhen HGMD (commercial) Protein Effects • VarDB: several TB database • Reference Genomes • Canonical Variants • Annotations • Indexes • AnnoBot: automated query system to update VarDB Other tools… Other databases

HadoopVCF Example: Allele Frequency Variant Data file 1 Variant Data file 2 HadoopVCFdeveloped by Chris Bizon

HadoopVCF Example: Generalized Samples Genomic Variants, Genomic Loci Each file holds different data for different samples and locations.

Hadoop: Generalized algorithm • Mapper • Key = subset of sample and loci • Value = intermediate sums • Reducer • Calculation over intermediate sums • Allele Frequencies, %missing, HWE p-values,… • Hadoop Distributed Cache • Context from VCF headers and/or RDBMS for each mapper

Why Hadoop? • Scalability for certain genome analysis patterns • Challenges: • Other analysis patterns: Hidden Markov Models, Permutation testing, Haplotype blocks, Graphs, Hierarchical graphs? • Share resources • Running on scheduled HPC clusters • Running on centralized HP storage system + local disks • Moving data to and from the worker nodes

Managing an R&D ecosystem with big data External Partner Resources Open Science Grid Teragrid Lab Machines UNC STORAGE (Tape, Drives) RENCI STORAGE (Tape, Drives) Genomics Storage Genomics HPC RENCI HPC IT Machines UNC HPC Clouds RENCI Hadoop Genomics Hadoop • Research versus production use: • Life Cycle: control increases over time as • Work scope increases • Expertise and technology matures • Risk increases • Number of groups touching data increases Wild West Analysts Automated Processes Developers IT Staff External Partners Data Providers

iRODS Data Virtualization UserClient Views & Manages Data Data Grid User Sees Single “Virtual Collection” /cuahsi/catalog /cuahsi/modeling /cuahsi/terrain SDSC /cuahsi/terrain RENCI /cuahsi/modeling Utah State Univ /cuahsi/catalog The iRODS Data Grid installs in a “layer” over storage systems, so you can view, manage, access, add, and share part or all of your data and metadata in a unified Collection.

Managing an R&D ecosystem with big data External Partners Open Science Grid Teragrid Lab Machines UNC STORAGE (Tape, Drives) RENCI STORAGE (Tape, Drives) Genomics Storage Genomics HPC RENCI HPC IT Machines UNC HPC Clouds RENCI Hadoop Genomics Hadoop • Control over: • Data movement and replication • Metadata standards • Archival, deletion, and retention • Integration with workflows, hadoop, databases • Hiding complexities • Automation • …, all policy driven • …, without breaking the in-place systems Posix DDN WOS RDBMS Web services NFS Hadoop Integrated Rules-Oriented Data System (iRODS) Data Services Programmatic APIs Data Workflows iRODS Clients Analysts Automated Processes Developers IT Staff External Partners Data Providers

Thank You

Exploring Genetic Causes of Disease with Big Data Technologies