210 likes | 351 Views
Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine. “Step change”. Larsen. Why?. Technology Paradigm shift Genomic properties. EUCCONET Data Management Workshop. Clinical meaning ???????. Raw data.
E N D
Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine
“Step change” Larsen
Why? • Technology • Paradigm shift • Genomic properties EUCCONET Data Management Workshop
Clinical meaning ??????? Raw data
Two of the driving technologies: Chip based genotyping Next Generation Sequencing (NGS) EUCCONET Data Management Workshop
Basic flat Illumina output… EUCCONET Data Management Workshop
Derivation of flat file data from image based intensity reads: EUCCONET Data Management Workshop
HAPMAP - Illumina - Affymetrix CHR16_HAPMAP.recode.ped CHR16_HAPMAP.recode.map red_test_run_assoc.txt genetic_map_chr16.txt EUCCONET Data Management Workshop
NOD2 Crohn’s association Position (Mb) EUCCONET Data Management Workshop
Consequent shifting budgets… ~$5 + billion Per genome ~$70 million ~$1 million ~$ 60 000 Venter & Watson HGP NGS ~20Tb 1- Candidate 2- CHIP (designer) 3- Affy 500 4- Intensity data 5- NGS data (*LC) ~10Gb Data (bytes) ~2Mb Based on n~5000 EUCCONET Data Management Workshop
Based on the storage of re-sequence data, one can consider storage requirements for a next generation sequencing effort: Assuming a storage cost of about 1.5byte per bp of sequence reads for a low coverage ~2000 samples (as per UK10K for example) x 3 billion bp x 1.5 = 10 terabytes. That doesn't include any subsequent parsed data Double this just to have the data in all formats one might be able to use meaningfully. Yields ~20Tb “20 Tb is pretty small these days” if buying new storage capacity just to do this alone one may therefore be better accounting for up to 50-100Tb if buying bespoke. Cost – service costs can be as high as £1500 per Tb NGS project on some 2000 individuals can be as much as 40-50k on computing alone. EUCCONET Data Management Workshop
Also receiving data on: Copy number variation across the genome Expression data (e.g. records of messenger RNA to track gene activity) Methylome (markers of the epigenome) Not to mention phenotype data (a retrospective effort and an ever increasing pool) Raises the issue of linkage and data USE… EUCCONET Data Management Workshop
Not just storage… EUCCONET Data Management Workshop
D’ vs r^2 Varying matrix properties and overlaid ribbon plots: (here MAF) Male vs Female EUCCONET Data Management Workshop
Combinations of data processing/visualisation methods: e.g. follow-up of the dissection of the TCF2 locus and the counter results for T2D and prostate cancer - other T2D loci? CDKAL See: Amundadottir et al Nature Genetics 2007 EUCCONET Data Management Workshop
Not to mention iterative approaches! Generation of empirical distributions for the purpose of comparison, e.g. expression data Gene X Gene (and possibly environment) interation analysis which may span the genome EUCCONET Data Management Workshop
Overall As would expect, data requirements are increasing Genetic epidemiology has been boosted into a realm of real findings and Exciting capability by the existence of new technology Increases may (or may not) be more rapid than once thought Storage and manipulation of large data sets present new challenges A new breed of analysts is emerging The computer scientist with a passion for biology Perhaps windows is dead… EUCCONET Data Management Workshop