How sample size, diversity, and target genes affect implementation and discovery of data sets

How sample size, diversity, and target genes affect implementation and discovery of data sets Genomics WG Chairs June 21, 2019

Goals of the Genomics WG • eMERGE has produced a number of GWAS with nearly significant hits or significant hits that require validation/replication. The Genomics workgroup will: • Coordinate further analysis of these datasets utilizing imputation with HRC of the eMERGE II data • Coordinate integration of GWAS from two new sites • Identify datasets that can either be bolstered or replicated by existing data at new eMERGE sites and facilitate exchange of data • Interact with the CC and SC to identify and test possible QC and analysis pipelines for rare variant association testing • Determine if preexisting sequencing standards are appropriate for the genes sequenced in the eMERGE III cohort. • In conjunction with the Phenotyping working group the Genomics workgroup will: • Identify/compile existing phenotype data • Create/maximize a central, highly detailed database for what data exists • Systematically evaluate where data can be enhanced • Prioritize data points that would be most powerful for both eMERGE II and eMERGE III data • Implement processes to procure highest priority data and hasten experimental progress • Update/Overhaul SPHINX to meet the broader needs of eMERGE III • Identify tools that need to be built for or included in DNA Nexus • Determine tools/metrics for functional annotation of variants • Include Structural Variants in final output

Biggest lesson – genomics is widely successful with large, rich datasets!

Good things come to those who wait! • Common variables for all data across Network creates rich data! • This dataset is a shining example of what Network-wide data should look like. • We have created a system that is usable across all sites with ease.

Genome-wide genotyping creates a richer discovery dataset • DISCOVERY– • Already know the phenotype – want to find causative genes • GWAS studies identify new genes for study, cannot be returned to subjects (yet) • Require extensive phenotyping algorithms 19 phenotypes implemented across Network

Panel-based sequencing creates richer implementation dataset • IMPLEMENTATION– • Select genes that we know and learn about phenotype • Penetrance • ROR • CNV analysis • Somatic mosacism • Not likely to include new gene discovery 19 phenotypes implemented across Network

Panel-based sequencing for discovery • We should have been more thoughtful about what our discovery hypothesis were for eMERGESeq data • Need to be vigilant about goals Find new genes Report Variants The Middle

Penetrance – network-wide discovery project for eMERGESeq data • Penetrance project in conjunction with Clinical Annotation group • Can calculate penetrance without both sequencing/adjudication and phenotype information on individuals with pathogenic variants • Time and resource consuming • Make sense to start phenotype/outcome data gathering as sequencing came online?

Adding additional data or making changes in analysis is costly! • 2016 –eI and eII legacy imputed array data was merged with pre-eIII array data subjects • ~73K - imputed array data available, IBD and PCA cpmplete • 2017 – Created new imputed data set using MIS/HRC • ~80K imputed array data available (no PCA/IBD yet) • 2019 – Published imputed dataset • ~84k • 2019 – Added additional samples • ~105k imputed array samples

Adding additional data or making changes in analysis is costly! • Adding additional samples for the array data led to delays in analysis • For eMERGESeq, we created a data freeze at 15k samples so that work could begin • Data freezes provide a platform to begin work and also to identify any problems with the dataset (production or analysis) early in the process • Allows the group to be more nimble and adjust to changing priorities more quickly

NHGRI – Push for Diversity • Dataset with over ~100k genotyped and phenotyped individuals is impressive! • Fine line between waiting for samples to improve diversity or sample size and just running what you have…especially if additional samples are not powered for phenotypes of interest. • Keep trying!! • Select subjects based on phenotype? Would require phenotyping algorithms very early in the process.

PGRNSeq lessons learned • Supplement at end of eII • Excitement over eIII meant that PGRNSeq data was abandoned • Need to stay excited about rich datasets!

A lesson in herding cats – Dealing with big projects • eMERGE’s greatest genetic strength and weakness is the size of the data • This project was hampered by delays due to adding more and more data • Data is great – redoing large scale analysis is not • We need to, as a group, better prioritize when size takes precedence over speed/efficiency • It is hard to be nimble when there is so much of everything: different data types, many people and shifting priorities

A lesson in herding cats – plant catnip! • Investigators are busy and are distracted by many projects • Be clear about the goals for particular meetings/calls/questionaires • Limit time for comments/replies if group consensus is required • Make participation easy • Keep surveys short • Set and reset working group short term priorities as needed

Lessons Learned and Earned, 1. Compute Time with Big Genomic Data 2. Naming Conventions and IDs Ian Byrell Stanaway, Ph.D. Department of Biomedical Informatics and Medical Education University of Washington Seattle, WA

Lesson 1: Compute Time with Big Genomic Data

Lesson 1: Compute Time with Big Genomic Data 4 Main processing scenarios: 1. Initial processing to "Make the Data" 2. People have late coming samples and want to "Add to the Data" 3. People have withdrawn samples and want to "Remove from the Data" 4. Quality control processing, bigger the data the slower the calculations

Initial processing to "Make the Data" 1. Imputed 83,717 samples Just the array preprocessing and imputation: ~6 months to impute and merge 78 array batches Longest impute time of the biggest batch ~96 hours 2. PGx 9,010 samples with targeted variants in ~100 genes Alignment time: ~2 weeks Calling Multisample Variants: ~1 week 3. eMERGEseq 24,956 samples with targeted variants ~100 genes Alignment time: ~1 month Calling the Multisample Variants: ~2 weeks

Lesson 1: Compute Time with Big Genomic Data 4 Main processing scenarios: 1. Initial processing to "Make the Data" 2. People have late coming samples and want to "Add to the Data" 3. People have withdrawn samples and want to "Remove from the Data" 4. Quality control processing, bigger the data the slower the calculations

2. People have late coming samples and want to "Add to the Data" 1. ~1 day on our cluster to pre-process the array batch for imputation 2. Upload to the Michigan Imputation Server Wait time of 2-7 days for your data to reach the top of the queue and start processing..... ....24 to 96 hours for the imputation to finish Only allows 3 imputation batches per login.... 3. ~9 days to rewrite the new merged VCFs. 4. Must now redo all the QC... Scenario 4.

3. People have withdrawn samples and want to "Remove from the Data" 1. To remove samples: The processing time to re-write the by chromosome merged VCFs for 105,108 human subjects with ~40 million variants: chr1 remove ids Mon Apr 22 13:30:45 PDT 2019 Tue Apr 30 20:07:49 PDT 2019 ~9 Days Compute Time! 2. Must now redo all the QC... Scenario 4.

4. Quality Control Processing, bigger the data the slower the calculations 1. Frequency and LD pruning ~5 days 2. PCA... after merging the by chromosome VCFs ~1/2 day 3. Must repeat steps 1 and 2 for each ancestry: European, African, Asian and Hispanic after the initial PCA with k-means 4. IBD... start time Tue Jun 4 08:30:56 2019 end time Mon Jun 17 08:39:57 2019 .............14 days doing the IBD calculation for ~5 billion pairwise IBDs .......It took 4 days to plot with the 83,717, will see with the 105,108

Lesson 1: Compute Time with Big Genomic Data • Summary: • If we decide to change the Imputed data it takes at minimum about 1.5 months to get the genetic data files put together again with QC done. • Planning a versioning release schedule with data freezes would minimize re-calculation, re-writing and redoing of QC. • We cannot finish genetic QC until the consent and demographics files have been updated to the Coordinating Center so we can have a final consented sample ID list inspection and then release. • It would be better to have the consent, EHR information and demographics available when we first get the genetic data to facilitate Quality Control.

Lesson 2: Naming Conventions and IDs

Variability in eMERGE and Center IDs for the Imputation Genotype Files The Ideal Convention 81001827_81001827 81001827 81012488_81012488 81012488 81012238_81012238 81012238 81012152_81012152 81012152 101070300055_R01C01_68105537 68105537 101070300055_R02C01_68100287 68100287 101070300055_R04C01_68107400 68107400 101070300055_R05C01_68104102 68104102 R0351_49633471 49633471 R0356_49660134 49660134 R0357_49668863 49668863 R0362_49206654 49206654 What we find in the actual files 0_Axiom_95151054 95151054 0_Axiom_95669051 95669051 0_Axiom_95786244 95786244 0_Axiom_95957950 95957950 0_95174361 95174361 0_95224407 95224407 0_95559547 95559547 0_95583106 95583106 0_6228971148_R04C02 95521829 0_6092196066_R05C02 95415145 0_6190757002_R01C01 95996617 0_6996021247_R06C01 95370108 38275352_8007 38275352 38011665_8008 38011665 38996507_8009 38996507 38406228_8010 38406228 3054402_3054402 81017514 3071443_3071443 81017571 3056293_3056293 81017358 3070935_3070935 81017406 D27222970-01_D27222970-01 27222970 D27293320-01_D27293320-01 27293320 D27265530-01_D27265530-01 27265530 D27210650-01_D27210650-01 27210650

Lesson 2: • Naming Conventions and IDs • Summary: • It is best to have systematic naming conventions used by all sites and both sequencing centers. • Going forward, eMERGE may benefit from having the Coordinating Center assign eMERGE IDs to Site IDs at sample intake.

Genomics Working Group Panel eMERGE Commons and Genomic Analysis Human Genome Sequencing Center Baylor College of Medicine eMERGE Centralized Sequencing & Genotyping (CSG) Facility

Agenda • eMERGE Cloud Commons • Structure & Data • Genomic Analysis Project - SV Calling • Lessons Learned • Next Steps

Need for eMERGE Cloud Data Commons • Manage complex ecosystem • Unstructured “ftp-like” file serving not ideal • Security/Compliance is critical • Common Access Storage & Compute required • Analysis platform required • Optimized processing required

Features of eMERGE Data Commons • Common Data Repository • Storage • Portal access • PHI/non-PHI data separation • Security & permissioning • Billing & accounts mgmt • Ease of data transfers • Tracking & Metrics • Analysis Data Platform • Data mining & research • Cohort identification • Workflow builder • API • Storage & Compute

Data in Commons • PHI Partition • Accessioning & Reporting Portal - Clinical Reports/Related Files • HIPAA Compliant • Access by Clinical Site • Data Transfer Methods • Non-PHI Partition • Raw Data i.e. BAMs, VCFs, De-identified Reports, EXCID • Access to all Clinical Sites • Both BCM & Broad/LMM data • Analysis Tools/Results

Data in Commons - PHI Partition BCM Approved Reports CHOP eDAP NEPTUNE Northwestern University Mayo Clinic Meharry Medical College Clinical Sites Reports: -PDF -XML -JSON -HTML Clinical Site Specific Projects Columbia University Vanderbilt University Marshfield

Data in Commons - PHI Partition

Data in Commons - Non-PHI Partition Broad/LMM Data: BAM VCF De-ided Reports Columbia, Marshfield, Mayo Clinic, Meharry Medical College, Northwestern University, CHOP, Vanderbilt University BCM Raw Data BCM Data: BAM VCF-Annotated VCF Coverage De-ided Reports UW University, Geisinger, Group Health, CCH, BWH, Broad Raw Data Commons - BCM Sponsored Projects

Data in Commons - Non PHI Partition

eMERGE Commons Data Analysis Platform • Analysis Tools Available Include: • eCAP (eMERGE Commons Access Portal) • Multiple Variant Callers • Mosaicism • SV Calling with Parliament 2

Structural Variation with Parliament 2 • Aims • Increase Resolution & Sensitivity of CNV Calls • Include Population Frequency Annotation to CNV Calls • Aid with Interpretation and Pathogenicity Assignment • Identify Novel Copy Number Variants in Clinically Relevant Genes • Share Results with eMERGE Consortium Parliament 2 https://tinyurl.com/y3lvxfgj Fritz Sedlazeck Eric Venner

Structural Variation with Parliament 2 CPU hours : 115,400 hrs Storage : 26.3 TB Reference: GRCh38 BCM (14745) 16 cores. ~35 min/sample Fastq -Bz2 fastq BWA -Raw bam GATK -recal -realign Outcome Files - Breakdancer - Breakseq2 - CNVnator - Delly2 - Manta - Lumpy - SURVIVOR Parliament 2 - Bam in GRCh38 - Reference - 30-35/per sample Broad (10620) Revert Fastq -GRCh37 fastq -Fastqs BWA -Raw bam GATK -recal -realign

SV Calling Results - First Preliminary Freeze • Single samples up to 10,000 samples and running • First preliminary freeze of 2,000 samples

SV Calling Results - CCDG DataSet Sample Comparison • CNV calls follow the expected allele distribution • AF0.01: 2,030 • AF0.05: 349 • Overlap with CCDG F1 (~20k samples) • CNV 1,067 • DEL: 457 • DUP: 558

SV Calling Results - CNVs Across Genes Of Interest • Genes with CNVs: • BRCA1 • BRCA2 • MLH1 • MSH2 • MSH6 • PMS2

Lessons Learned • Need for common cloud platform • Security/Compliance • Process Optimization and Change Management • Ease of access for different target audiences • Integration/Interoperability with other cloud providers/local systems

Next Steps • SV Calling Results • eMERGE Data End of Phase Discussion • Integration & Interoperability with other cloud providers & local systems • AnVIL integration

Acknowledgements Acknowledgements Richard Gibbs Eric Boerwinkle Fritz Sedlazeck Donna Muzny Jianhong Hu Christie Kovar Viktoriya Korchina Eric Venner Victoria Yi Tsung-Jung Wu Liwen Wang John Didion FUNDING: NIH/NHGRI

How sample size, diversity, and target genes affect implementation and discovery of data sets

How sample size, diversity, and target genes affect implementation and discovery of data sets

Presentation Transcript

Africa’s Size and Diversity

Sample Size and Power

Sample Size and Power

Power and Sample Size

Power and Sample Size

Power and sample size

Sampling and Sample Size

Sample Size and Power

Sample Size and Power

Sample Size and Power

Power and Sample Size

Discovery of Mammalian Diversity

10.21.13 Discovery and Diversity of Cells

Analysis: Discovery of coregulated genes

Power and Sample Size

Power and Sample Size

Sample Size and Power

Power and Sample Size

Immunoglobulin Superfamily of Genes and Diversity of Antigen Recognition

Power and Sample Size

Sample Size and Power

How Lifestyle Can Affect Our Genes?