370 likes | 597 Views
Special Topics in Genomics Lecture 1: Introduction. Instructor: Hongkai Ji Department of Biostatistics Email: hji@jhsph.edu. Outline of today’s lecture. Introduction to genome and genomics Topics and tools Relevance of statistics. DNA.
E N D
Special Topics in GenomicsLecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics Email: hji@jhsph.edu
Outline of today’s lecture • Introduction to genome and genomics • Topics and tools • Relevance of statistics
DNA DNAs (Deoxyribonucleic acids) are molecules to store genetic information of a living organism. DNA consists of two polymers made from four types of nucleotides: adenine (A) guanine (G), cytosine (C) and thymine (T). Purines: A, G; Pyrimidines: C, T Two polymers are complementary to each other and from a double-helix structure 5’-ACCGTTCGACGGTAA-3’ ||||||||||||||| 3’-TGGCAAGCTGCCATT-5’
Genome TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTCTCACACCTGACATGAAAAGGCACATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACAGGGCCCCCCTCGGGGGTTGCGCCCAGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGTGATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCATGGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGAGGCTGCTGGCCTTGCCGTCCAGGGGACAAGGAGCCAGAGTCCAGGTGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGACATAGGTCAACACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTGCCTGCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGGCCTAAGTTCTCTGCCTCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGTAGCTTTTAGAAGTATAAGGCTCCTGTTTCTGGGTCATATTAGTGTTGTTTTCACCTGTCCCCAGCCCTAAGCCAGGTGTGGCCAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTCCACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTATTAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTGGCCACATCAATCAAAAGGACCGTGACCAACTTCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTACAGCTCAGCAGGGCATCGTCACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGACACAATTCACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGGGCCTCAGGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTAAGGAAGGAACCTGTGGACTCCTCCCTACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGTCCTCTCCCATTTCAGCTTACCTGACAATGAAATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAGCACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCATAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGGCCTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTCAGCTGTGTCTATACTTGGGTTTTGTAT… Total amount of DNA in human genome: 3 * 109 base pairs (bp)
Gene Gene Gene Gene Gene Gene
Central Dogma Gene expression
X X A A A X X Y B Y B B Y Z C C C Z Z Z Y Topic 1: gene expression and microarray Expression No Expression Spatially Temporally
Microarray cDNA sample probe
TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA...TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA... TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA... Topic 2: transcriptional regulation Transcription factors (TF): TF1 TF2 Transcription factor binding sites (TFBS): CCACCCAC, TAATAAAAT TF1 TF2 TF1 TF2
Transcription factor binding motif TF GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA 123456789 TGGGTGGTC TGGGTGGTA TGGGAGGTC TGGGTGGTG TGAGTGGTC TGGGTGGTC 1 2 3 4 5 6 7 8 9 A 0 0 1 0 1 0 0 0 1 C 0 0 0 0 0 0 0 0 4 G 0 6 5 6 0 6 6 0 1 T 6 0 0 0 5 0 0 6 0 TF TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA TF CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TF TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG TF AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC TF ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG Transcription Factor Binding Sites (TFBS) Motif
Motifs are regulatory codes in the genome TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTCTCACACCACCCATGTTTTGTTTATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACAGGGCCCCCCTCGGGGGTTGCGCCCAGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGTGATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCATGGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGTTGCTGCTGGCCTTGCCGTCCAGGGGACAAGGAGCCAGAGTCCAGGTGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGACATAGGTCAACACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTGCCTGCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGGCCTAAGTTCTCTGCCTCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGTAGCTTTTAGAAGTATAAGGCTCCTGTTTCTGGGTCATATTAGTTTTGTTTTCACCTGTCCCCACCCATAAGCCAGGTGTGGCCAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTCCACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTATTAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTGGCCACATCAATCAAAAGGACCGTGACCAACTTCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTACAGCTCAGCAGGGCATCGTCACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGACACAATTCACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGGGCCTCAGGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTAAGGAAGGAACCTGTGGACTCCACCCAACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGTCCTCTCCCATTTCAGCTTACCTGACAATGAAATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAGCACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCATAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGGCCTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTCAGCTGTGTCTATACTTGGGTTTTGTAT Transcription Factor Binding Sites (TFBS) Gene
TF1 TF2 Transcription factors Other genes Activation TF1 TF2 Repression TACTACCACCCACAACATAATAAAATCTAA TTAATAAAATACCACCCACAACCTAAGGAT Gene2 Gene1 Other Interactions TF2 TF1 TF3 Gene3 Gene regulatory network TF3 Diseases Misregulation
Motif discovery and decoding regulatory programs in the genome Genomic Language Dictionary GGCCCTGAGCGGTCCCTATTGCTGGGTGGTCAATGCCCTTCATCTGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTC step1 step2 GGCCCTGAGCGGTCCCTATTGCTGGGTGGTCAATGCCCTTCATCTGGAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTC Human Language Dictionary guesswhatthestoryisaslongasyouknowthelanguageitshouldbeprettyeasy step1 Know Guess Be … step2 Guess what the story is. As long as you know the language, it should be pretty easy.
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Gene1 Gene2 Gene3 Finding motifs from co-regulated genes (Roth et al., 1998; Hughes et al., 2000; etc.) GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Condition1 Condition2 Gene 1 Gene 2 Gene 3 … Gene N
100~1000 bp 100~1000 bp 100~1000 bp Gene1 Gene2 Gene3 10k~1000k bp 10k~1000k bp 10k~1000k bp Gene1 Gene2 Gene3 Motif discovery is difficult in mammalian genomes due to a low signal-to-noise ratio yeast human
Topic 3: ChIP-chip and tiling array ChIP-chip (Chromatin ImmunoPrecipitation coupled with Microarray) 500~2000 bp long No IP IP
ChIP-chip on tiling arrays Probe: 25~60 bp long 35~300 bp spacing 500~2000 bp long IP CT IP1 1000 20 32 1120 800 50 12 1700 600 11 20 17 80 780 60 IP2 1200 30 25 1500 730 45 11 1650 700 15 30 23 90 790 70 CT1 80 32 30 21 32 35 22 50 30 24 25 33 12 30 10 CT2 20 25 27 50 29 60 17 45 20 13 15 29 21 45 13
500~2000bp 6~30bp A combined approach to study gene regulation ChIP-chip GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC Sequence Analysis
Topic 4: alternative splicing and exon array promoter intron intron gene exon exon exon transcription start site (TSS) splicing
Alternative splicing exon 1 exon 2 exon 3 exon 4 exon 5 Isoform 1 Isoform 2 Isoform 3
Topic 5: single nucleotide polymorphism and SNP array SNPs: occur every 100 to 1000 bp make up 90% of genetic variations minor allele frequency >= 1% (otherwise we call them mutations)
SNP array ACCGTGGA[C/T]CTGAACCG |||||||| | |||||||| TGGCACCT[G/A]GACTTGGC ACCGTGGA[G]CTGAACCG ACCGTGGA[C]CTGAACCG ACCGTGGA[T]CTGAACCG ACCGTGGA[A]CTGAACCG What will happen when the genotype is CC? CT? TT? Applications: 1. Genotyping & genome-wide association study 2. Copy number variations and loss of heterozygosity 3. Allele specific expression …
Topic 6: next-generation sequencing Traditional sequencing
Next-generation sequencing Prepare genomic DNA Attach DNA to surface Bridge amplification Fragement become double stranded Denature the double stranded molecules Complete amplification Determine first base Image first base Determine second base Image second base Sequence reads over multiple cycles Align data. >50 milliion clusters/flow cell, each 1000 copies of the same template, 1 billion bases per run, 1% of the cost of capillary-based method. (From: http://www.illumina.com/downloads/SS_DNAsequencing.pdf)
Array vs. next-generation sequencing Microarray, Exon array RNA-seq ChIP-chip ChIP-seq SNP array SNP/mutation detection by sequencing … …
Other topics • Epigenomics • Transposon • miRNA
Relevance of statistics Need new statistical theories and tools Genomics Statistics Guide development of efficient data analysis strategies
Gene i=1 i=2 i=3 … i=I t-statistic 1.2 6.7 5.1 … -0.5 p-value 0.30 0.001 0.002 … 0.56 Bonferroni adjustment Rejections … Example 1: multiple testing Multiplicity needs to be adjusted in order to determine statistical significance Bonferroni adjustment too stringent False discovery rate
False discovery rate (FDR) False discovery rate (FDR, Benjamini & Hochberg, 1995) FDR = E(V/R) = Pr(R>0)E(V/R|R>0) FWER = Pr(V ≥1)
Test 1 2 3 … I Sample Variance (df) … … … Pooling information Multiplicity caused some problem in controlling type I errors, but it can be used to improve statistical power! A common distribution Variance Estimates Modified t-statistics
Inference by iterative estimation/sampling (Gibbs sampler) A Example 2: motif discovery A C G T A .3 .2 .2 .3 C .2 .3 .3 .2 G .2 .3 .3 .2 T .3 .2 .2 .3 1 2 3 4 5 6 7 8 9 A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17 C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66 G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17 T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00 Background: 0 Motif: Θ S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA A: 000000000000001000000000000000000000000001000000000000000000000000000000 f (A,Θ | S) Marginalization: f (A | S) = ∫ f (A, Θ | S) dΘ