Analysis of Next-Generation Sequencing (NGS) Data

Analysis of Next-Generation Sequencing (NGS) Data Yun Li Department of Genetics Department of Biostatistics University of North Carolina

Notes up-front • Focus of my part today • Next-Generation DNA Sequencing • vs RNA-seq, CHIP-seq etc • One particular type of genetic variants: SNPs • vs indels, CNVs, SVs • Diploid humans • vs other model organisms • Complex disease genetics • vs Mendelian diseases

Outline This slide is updated. • Introduction to Basic Biology and Genetics • Introduction to NGS Techonology (Illumina Solexa technology as an Example) • A Typical Workflow for NGS Analysis • Raw NGS Data • Read Alignment and Basic Quality Control • SNP Detection and Genotype Calling • Design of NGS-based Studies

Introduction part 1: Biology & genetics primer

The Human Genome • Genome: an individual or specie’s genetic constitution; made up of chromosomes • Chromosome: threadlike body found in the nucleus of the cell and containing the genes; made up of double-stranded DNA and protein • Human Genome: • comprised of 46 = 2*23 chr’s (diploid) • 22 autosomes 1 to 22, mostly long to short • Present in two copies • One paternal, one maternal • Sex chromosome X, Y: males XY; females XX • Total ~3 billion base pairs (bp) of DNA

The Human Genome (cont’d) • Gene: segment of DNA with a detectable function (eg, code for a protein); ~20,000 genes in the human genome • Locus: specific gene or DNA segment or region on a chromosome • Allele: a particular form of a gene or DNA segment • Polymorphic • monomorphic Karyotype of a male

DNA • DNA (deoxyribonucleic acid): heteropolomer molecule constructed of sugars, phosphates, and bases that carries the genetic information • DNA is the information store: it encodes the information for cells and organisms to re-produce • DNA variation responsible for many individual differences

DNA (cont’d) • Base pair (bp): DNA is double-stranded, each strand is a series of the bases Nucleotide: one necleobase (nitrogenous base), a five-carbon sugar and one phosphate group

Transition vs Transversion Mutation • Two groups of nitrogenous bases • Purines: A and G • Primidines: C and T • Types of mutations • Transitions: within-group (A<->G; C<->T) • Transversion: btw-group (A<->C; A<->T; G<->C; G<->T) • transition to transversion ratio is ~2

Central Dogma • DNA -> RNA -> Protein (genome, transcriptome, proteome) • Transcription and translation

Genetic Markers • SNP/SNV • Single Nucleotide Polymorphism/Variant Haplotype: set of alleles together on the same chr. Haplotype1: AAGGGATCCAC Haplotype2: AAGGAATCCAC

SNPs • Single nucleotide substitution • The most abundant type of genetic variant in the human genome • >30,000,000 cataloged in the human genome • Easy to score cheaply, accurately • Vast majority two alleles (di-allelic or bi-allelic) • Nomenclature: rs number, eg, rs10885409 • Basis for genome-wide association studies (GWAS) • Microarray with 100,000s-1,000,000s SNPs • 1,000s of disease and trait association identified • http://www.genome.gov/gwastudies/: 6499 (as of 6/22/2012)

Genetic Markers: Length Polymorphism • Microsatellite • Simple repeat sequence • Often 2-4 bp repeat; eg, ---CACACACA--- • Common in the genome, often many different alleles; ~15,000 mapped to specific location • Nomenclature: D number, eg, D22S1 • Primary marker for linkage studies • Variable number of tandem repeat (VNTR) • Typical repeat of 10-100 bp • CNV/CNP: Copy Number Variant/Polymorphism • Typically > 200 bp • Indel: Insertion, deletion variant

Genetic Markers: Structure Variant • Structural variant • Generally defined as a region of DNA approximately 1Kb or larger in size and can include inversions and balanced translocations or genomic imbalances (eg., indels)

Genotype and Phenotype • Diploid: two copies of each chromosome per cell, as in most human cells; pairs called homologues • Genotype: genetic constitution of the individual, usually referring to the locus or loci under study • Phenotype: • observed characters of individuals; expression of genes and other relevant factors • =trait: what we observe • Complex phenotype: determined by combinations of genes, environment, behavior. Eg, diabetes, hypertension, almost everything • Mendelian=simple phenotype: completely determined by one or few genes. Eg, CF.

ABO Blood Group • Based on antigenic substances A and B present on the surface of red blood cells • Coded by a gene on chromosome 9 • Alleles: A,B,O • Genotypes • Homozygote: two copies of the same allele. Eg, AA, BB, OO • Heterozygote: two different alleles. Eg, AO, BO, AB. • n alleles => n(n+1)/2 possible genotypes

Genetic code • Universal translation from DNA and RNA to protein; 3 bases code for one amino acid (codon) • Synonymous vs non-synonymous variant

Introduction part 2: Intro to NGS technology

History of DNA Sequencing

A Road to Discover Human Genome hapmap.org www.1000genomes.org 1990-2003 2002 - 2008 -

Different Approaches • Deep whole genome sequencing • Expensive, only can be applied to limited samples currently • Most complete ascertainment of all variations • Low coverage whole genome sequencing • Modest cost, typically X00-X000s individuals sequenced • Complete ascertainment of common variations • Less compete ascertainment of rare variants • Exome capture and targeted region sequencing • Modest cost, high coverage • Most interesting part of the genome

With Complete Sequence Data • What is the contribution of each identified locus to a trait? • Multiple variants, common and rare • Effect size • What is the mechanism?What happens if we knockout a gene? • Most often, causal variant not examined directly by GWAS • Rare coding variants will provide important insights into mechanisms • What is the contribution of structural variation to disease? • These are hard to interrogate using current genotyping arrays • Are there additional susceptibility loci to be found? • Only subset of functional elements include common variants • Rare variants are more numerous and thus will point to additional loci

Mutation Allele Frequency Spectrum (n=100 chromosomes)

Site Frequency Spectrum • Number of variant allele at site • (n = 10,422 European Americans) • Total < 200 variant sites discovered in gene HHEX (7.9Kb) • Sanger sequencing, variants validated by 454 pyrosequencing • Black line: expected from Wright Fisher constant population size model and mutation rate estimated by Watterson’s method • Ref: Coventry et al (2010) Nat Commun 1(8):131 Figure3a.

Sequencing Technologies • Sanger Capillary Sequencing • essentially the single viable DNA sequencing technology for almost three decades since 1977 • Costs: ~$0.5 per Kb (~$1.5 million whole genome) • Time: ~100 min per Kb (>570 years one genome) • The Human Genome Project took ~13 years at 5 major sites + >30 sites across the globe • This cost and throughput prohibited its application to large scale sequencing-based studies.

NGS • Next-generation sequencing (NGS) • AKA, massively parallel sequencing (MPS), high throughput sequencing (HTS) • Debut ~2004-2005 • Cost: <$0.00005 per Kb (~$150 for 1X coverage) • the drop in costs is more dramatic than Moore’s Law.

Sequencing Cost Drop Beats Moore’s Law

NGS • Next-generation sequencing (NGS) • AKA, massively parallel sequencing (MPS), high throughput sequencing (HTS) • Debut ~2004-2005 • Cost: <$0.00005 per Kb (~$150 for 1X coverage) • the drop in costs is more dramatic than Moore’s Law. • Time: • <0.002 min per Kb (~4 days for whole genome) • Illumina HiSeq 2000: 100-300Gb/8days!

At Costs, though • Shorter reads • Sanger sequencing: up to ~1Kb • NGS technologies: typically 30-400bp • Implication: a lot of tasks (e.g, assembly, read alignment, haplotyping) become more challenging • Higher per-base sequencing error rate • Sanger sequencing: < 0.001% • NGS: 0.5-1% • Implication: Need redundant sequencing of each base to distinguish sequencing errors from true polymorphisms

Commonly used Technologies • Illumina Solexa sequencing-by-synthesis • Roche 454 pyrosequencing • Applied Biosystem SOLiD • Helicos Biosciences • Pacific Biosciences • Ion Torrent • Complete Genomics • Oxford Nanopore • …

Illumina Solexa Technology Mardis (2008), Annual Review of Genomics and Human Genetics 9: 387-402

Illumina Solexa Technology (cont’d) Reversible terminators: F (fluorescent labels) Metzker (2010), Nat Rev Genet 11: 31-46

Washed out un-incorporated nucleotides, take picture Metzker (2010), Nat Rev Genet 11: 31-46

Metzker (2010), Nat Rev Genet 11: 31-46

Metzker (2010) Nat Rev Genet 11: 31-46

Paired-ends

Mate Pairs/Paired Ends Medvedev et al (2009) Nat Methods 6: S13-20

Paired-End

Paired-End and Indel read pairs generated by shearing DNA into fragments of approximately the same length (300±80 bases) and then sequencing ~35 bases at each end Manske & Kwiatkowski (2009) Genome Res 19, 2125-2132

Deletion? Insertion? Deletion? Insertion?

Deletion Insertion

What do ngs data look like?

Break: 9:10-9:25

Real Data

Now what do our data look like? • What do you want them to look at?

Analysis of Next-Generation Sequencing (NGS) Data