The Next Generation Sequencing Revolution in Bioinformatics

BIO-454Bio Computing Lecture 19: Next Generation Sequencing (NGS) (Cont’d) Dr. Mohammad Nassef Computer Science Department Faculty of Computers and Information Cairo University

The Human Genome Project Setia Pramana

The Human Genome Project • First draft genome of human in 2001, final draft in 2004 • Estimated costs: $3 billion (one dollar per nucleotide) • Time: 13 years • Used Sanger Sequencing • Today: Illumina: 1 week, 9500$ Exome: 6 weeks, $1000 Setia Pramana

Next Generation Sequencing (NGS) • New technologies allowing the massive production of tens of millions of short sequencing fragments. • These techniques could be used to • deal with similar problems than microarrays, • but also with many other. • They raised the promise of Personalized Medicine Setia Pramana

Next Generation Sequencing (NGS) • Also called: • Second Generation Sequencing • High-throughput Sequencing • Massively-parallel Sequencing Setia Pramana

Next Generation sequencing (NGS) • Based on sequencing huge number of short DNA fragments, the resulting short reads can either be: • Overlapped to form the original genome from scratch (Denovo Assembly) • This is similar to the Newspaper problem. • Aligned to a previously sequenced reference genome (Reference-based Assembly) • The short reads that align with specific locations in genome can provide information about the active/genetic regions in these locations.

The Newspaper Problem Sequencing of Genomes Biological Genomes Short Digital DNA Reads Need to assemble these Reads to form the entire genome!! FCI-CU-EG

The Newspaper Problem as an Overlapping Puzzle FCI-CU-EG

Modern Sequencing • Researchers take a small tissue or blood sample containing millions of cells with identical DNA, • They use biochemical methods to break (at random locations) the DNA of identical copies of a genome into fragments, and then, • They sequence these fragments to produce short reads. FCI-CU-EG

Challenges FCI-CU-EG

NGS vs. Microarray Technologies • The most common reasons for preferring to use Microarrays by researchers are: • (Economically) Cheap • Well-established technologies through around two decades, • Abundant datasets, • Enormous data analysis tools, • Can work with large number of samples! • However, NGS technologies are more accurate!

NGS Technologies/Platforms Setia Pramana

NGS • Reduced sequencing costs significantly, making large-scale or WGS studies much more affordable Setia Pramana

NGS Technologies/Platforms Setia Pramana

Differences between platforms • Run times vary from hours to days • Production range from Mb to Gb • Read length from <100 bp to > 1500 bp • Accuracy per base from 0.1% to 15% • Cost per base varies Setia Pramana

NGS Application RNA-seq Whole Genome Seq Gene Regulation NGS ExomeSeq Epigenetic Resequencing Metagenomics Setia Pramana

NGS Application • Whole genome re-sequencing • Ancient genomes • Metagenomics • Cancer genomics • Exome sequencing (targeted) • RNA sequencing • Chromatin immunoprecipitation (CHiP)-Seq: Protein interaction with DNA • Genomic Epidemiology • Epigenomic • Genetic human variation : SNP, CNV (diseases) • anything with DNA Setia Pramana

Sequencing Factory:Beijing Genome Institute • Purchased 128 HiSeq2000 sequencers from Illumina in January 2010 • each of which can produce 25 billion base pairs of sequence a day

NGS Application: Whole Genome Seq Setia Pramana

NGS Application: Exome Genome Seq Setia Pramana

NGS Application: RNA Sequencing Setia Pramana

Bioinformatics Challenges of NGS Setia Pramana

Sequencing has gotten Cheaper and Faster Cost of one human genome • HGP $ 3 billion (13 yrs) • 2004: $ 30,000,000 • 2008: $100,000 • 2010: $ 30,000 • 2011: $10,000 • 2012-13: $7,000 • 2014: $4,000 (~1 week) • ???: $1,000 The Race for the $1,000 Genome

(Sequencing) Cost is Getting Cheaper • Reduced sequencing costs significantly, making large-scale or WGS studies much more affordable Setia Pramana

NGS Challenges Setia Pramana

Huge Data Storage and HPC Demand

Generalized NGS Analysis Setia Pramana

NGS Challenges • Highest cost is (almost) not the sequencing but storage and analysis. • A standard human (30-40x) whole genome sequencing would create 100 Gb of data • Extreme data size causes problems • Just transferring and storing the data • Standard comparisons fail (N*N) • Standard tools can not be used • Think in fast and parallel programs Setia Pramana

Bioinformatics Challenges of NGS • Need for large amount of CPU power - Informatics groups must manage compute clusters -Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment - Another level of software complexity and challenges to interoperability Setia Pramana

Bioinformatics Challenges of NGS • VERY large text files (~10 million lines long) - Can’t do ‘business as usual’ with familiar tools such as Perl/Python. - Impossible memory usage and execution time - Impossible to browse for problems • Need sequence Quality filtering Setia Pramana

Data Management Issues • Raw data are large. How long should be kept? • Processed data are manageable for most people • 20 million reads (50bp) ~1Gb • More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM • Certain studies much more data intensive than other • Whole genome sequencing 30X coverage genome pair (tumor/normal) ~500 GB 50 genome pairs ~ 25 TB Setia Pramana

Bioinformatics Challenges of NGS • In NGS we have to process really big mounts of data, which is not trivial in computing terms. • Big NGS projects require super computinginfrastructures: it's not the case that any one can study everything. Small facilities must carefully choose their projects to be scaled with their computing capabilities Setia Pramana

Computational Infrastructure for NGS We can start with: - Computing cluster: Multiple nodes(servers) with of course multiple cores • High performance storage (TB, PB level) • Fast networks (10Gb ethernet, infiniband) - Enough space and conditions for the equipment ("servers room") - Skilled people (sys admin, developers) Setia Pramana

Big Computing Infrastructure • Distributed memory cluster Starting at 20 computing nodes 60 to240 cores At least 48GB RAM per node • Fast networks 10Gbit Infiniband • Optional MPI and GPUs environment depending on project requirements • Starting at 200.000€ (hardware only) Setia Pramana

Middle size infrastructure • "Small” distributed file system( around 50TB). • "Small” cluster (around 10 nodes, 80 to 120 cores). • At least giga bit ethernet network. • Price range: 50.000 –100.000 € (just hard ware) Setia Pramana

Small Infrastructure • Recommended at least 2 machines • 8 or 12 cores each machine • 48 Gb RAM minimum each machine. • BIG local disk. At least 4 TB each machine As much local disks as we can afford Price range: starting at 8.000€-10.000 € (2 machines) Setia Pramana

Alternatives • Cloud Computing • Grid Computing Setia Pramana

Swedish National Infrastructure for Large Scale DNA sequencing (SNISS) Setia Pramana

UPPNEX • UPPmaxNEXt generation sequence cluster & storage • Located at UPPMAX - Uppsala Multidisciplinary Center for Advanced • Computational Science (UPPMAX) • Dedicated computer cluster (500 nodes) • Uppnex is serving over 240 projects and hosting over 800 TB of data Setia Pramana

Interpretation Bottleneck

Big Collaboration • Need Collaborative expertise (human intelligence and intuition) are required for meaning and interpretation (Bergeron 2002) • Including on-demand communication & sharing of protocols, electronic resources, data, and findings among the stakeholders • Collaboration with other Big DATA sources: National Registers, BPJS, Hospitals, etc.

Next Generation Projects • 1000 Genomes Project (to provide a comprehensive resource on human genetic variation. ) • TCGA (The Cancer Genome Atlas) • MalariaGen: Sequencing thausands malaria isolates • 1001 Genome Project: Arabidopsis WGS • UK10K: Sequencing 10.000 healthy and disease affected individuals. • Southeast Asia Mycobacterium tuberculosis complex (MTBC) DB: Sequencing MTBC Isolates • Many more…..

Collaboration Challenges • Potential conflict between traditional silo researchers and those embracing Big Collaboration • Compatible technologies and Cloud infrastructures • IT management of groups with different tools, requirements and expectations • Ownership of data • Government regulations and policies • Accessible data repositories and lack of transparency in findings • Resources to support bioinformatics • Patient privacy

Five Domains of Genomic Research Green. 2011. Nature470, 204-213

Summary • Unraveling the Bioinformatics (Big) Data would provide right decisions at the right time for the right patients. • The problem is not producing data, but more on how to interpret them • Bioinformatician is one of the hotestjob 

Summary • Challenges: • Still expensive • Lack of Infrastructure (in developing countries) • Lack of skilled personal on Bioinformatics • Need (large scale) collaborations • Integrate different technologies and system • Making it all clinically relevant Setia Pramana

Manipulating RNA-seq Data in R The processed RNAseq datasets come in two formats: • A large dataset that contains all the sequenced RNA reads along with other information regarding each read. • This kind of dataset should be used in case you are interested in analyzing the RNA sequences (matches and differences between RNA sequences of different samples) • A dataset that reflects the expression level of genes according to the amount of sequenced RNA that have been aligned to a genetic regions in a reference genome. • This kind of dataset should be used when you are interested in comparing the gene expression levels between different samples.

Manipulating RNA-seq Data in R: Kind 1 • This kind of dataset is stored in FASTA/FASTQ files. • The difference: • FASTA files: Each RNA read has 2 lines (Info + sequence) • FASTQ files: Each RNA read has 4 lines (Info + sequence + Quality Info)

Sample FASTA File

The Next Generation Sequencing Revolution in Bioinformatics

The Next Generation Sequencing Revolution in Bioinformatics

Presentation Transcript

BIO

Bio

BIO-454 Bio Computing

Bio

Bio-bio-1 Team

Bio, Nano and Quantum Computing

Bio

BIO.

bio

Bio

Bio-Inspired Computing

Bio

BIO

Bio

“BIO ”

Bio

Bio-Inspired Computing

Bio

bio

bio