420 likes | 671 Views
Genomics and High Throughput Sequencing Technologies: Applications. Jim Noonan Department of Genetics. Outline. Personal genome sequencing. Rationale: understanding human disease Variant discovery and interpretation Genome reduction strategies ( exome sequencing ).
E N D
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics
Outline Personal genome sequencing • Rationale: understanding human disease • Variant discovery and interpretation • Genome reduction strategies (exome sequencing) Functional analysis of biological systems using sequencing • Transcriptome analysis: RNA-seq • Regulatory element discovery: ChIP-seq • Chromatin state profiling and the ‘histone code’ • Large-scale efforts: ENCODE and the NIH Epigenome Roadmap
The genetic architecture of human disease State, MW. Neuron 68:254 (2010)
Challenge: Interpreting genetic variation Cooper and Shendure, Nat Rev Genet 12:628 (2011)
Tools for identifying rare damaging mutations Protein-sequence based DNA-sequence based
All humans have rare damaging mutations Damages protein Conserved Cooper and Shendure, Nat Rev Genet 12:628 (2011)
Genome reduction: Exome sequencing Bamshad et al.Nat Rev Genet 12:745 (2011)
Finding disease-causing rare variants by exome sequencing Screen unrelated trios for recurrence De novo mutation • Likely to have functional effect • Recurrence in independent affected individuals • Absence in controls • Reveal critical pathways in disease
Outline Personal genome sequencing • Rationale: understanding human disease • Variant discovery and interpretation • Genome reduction strategies (exome sequencing) • Challenges to de novo genome assembly using short reads Functional analysis of biological systems using sequencing • Transcriptome analysis: RNA-seq • Regulatory element discovery: ChIP-seq • Chromatin state profiling and the ‘histone code’ • Large-scale efforts: ENCODE and the NIH Epigenome Roadmap
mRNA-seq workflow Wang et al. Nat Rev Genet 10:57 (2009) Martin and Wang Nat Rev Genet 12:671 (2011)
Gene expression profiling by massively parallel RNA sequencing (RNA-seq)
Quantifying gene expression by RNA-seq • Use existing gene annotation: • Align to genome plus annotated splices • Depends on high-quality gene annotation • Which annotation to use: RefSeq, GENCODE, UCSC? • Isoform quantification? • Identifying novel transcripts? • Reference-guided alignments: • Align to genome sequence • Infer splice events from reads • Allows transcriptome analyses of genomes with poor gene annotation • De novo transcript assembly: • Assemble transcripts directly from reads • Allows transcriptome analyses of species without reference genomes
RNA-seq reads mapped to reference Normalization methods: Reads per kilobase of feature length per million mapped reads (RPKM) • What is a “feature?” • What about genomes with poor genome annotation? • What about species with no sequenced genome? For a detailed comparison of normalization methods, see Bullard et al. BMC Bioinformatics 11:94.
What depth of sequencing is required to characterize a transcriptome? Wang et al. Nat Rev Genet 10:57 (2009)
Considerations • Gene length: • Long genes are detected before short genes • Expression level: • High expressors are detected before low expressors • Complexity of the transcriptome: • Tissues with many cell types require more sequencing • Feature type • Composite gene models • Common isoforms • Rare isoforms • Detection vs. quantification • Obtaining confident expression level estimates (e.g., “stable” RPKMs) requires greater coverage
Pervasive alternative splicing in humans Wang et al. Nature 456:470 (2008)
Composite gene model approach Map reads to genome Map remaining reads to known splice junctions • Requires good gene models • Isoforms are ignored • Which annotation to use: RefSeq, GENCODE, UCSC?
Strategies for transcript assembly Garber et al. Nat Methods 8:469 (2011)
ChIP-seq • Transcription factors • General transcription machinery • Modifications to histone tails • Methylated DNA
Rationale: identifying regulatory elements in genomes Noonan and McCallion, Ann Rev Genomics Hum Genet 11:1 (2010)
ChIP-seq peak calling ChIP-seq is an enrichment method Requires a statistical framework for determining the significance of enrichment ChIP-seq ‘peaks’ are regions of enriched read density relative to an input control Input = sonicated chromatin collected prior to immunoprecipitation
There are many ChIP-seq peak calling methods Wilbanks and Facciotti PLoS ONE 5:e11471 (2010)
The histone code Zhou et al. Nat Rev Genet 12:7 (2011)
Mapping and analysis of chromatin state dynamics in nine human cell types • Cell types: • H1 ESC • K562 (erythrocyte derived) • GM12878 (B-lymphoblastoid) • HepG2 (hepatocellular carcinoma) • HUVEC (umbilical vein endothelium) • HSMM (skeletal muscle myoblasts) • NHLF (lung fibroblast) • NHEK (epidermal keratinocytes) • HMEC (mammary epithelium) • Marks: • H3K4me3 (promoter/enhancer) • H3K4me2 (promoter/enhancer) • H3K4me1 (enhancer) • H3K9ac (promoter/enhancer) • H3K27ac (promoter/enhancer) • H3K36me3 (transcribed regions) • H4K20me1 (transcribed regions) • H3K27me3 (Polycomb repression) • CTCF Ernst et al., Nature 473:43 (2011)
Mapping and analysis of chromatin state dynamics in nine human cell types Ernst et al., Nature 473:43 (2011)
Chromatin state dynamics at WLS Ernst et al., Nature 473:43 (2011)
Functions associated with putative promoter and enhancer states • Annotation based on nearest TSS
ChIP-seq: enhancer identification in vivo Visel et al. Nature 457:854 (2009) • p300 = enhancer-associated factor • p300 binding = ~90% predictive of • enhancer activity
Systematic experimental annotation of regulatory functions Myers, PLoS Biol 9:e1001046 (2011)
The ENCODE Project http://genome.ucsc.edu/ENCODE/
The NIH Roadmap Epigenomics Project http://www.roadmapepigenomics.org/
ENCODE cell lines Myers, PLoS Biol 9:e1001046 (2011)
ENCODE Project data access http://genome.ucsc.edu/ENCODE/
Genome Browser interface and data types Genome Viewer Categories of data: displayed as tracks Discrete intervals (genes) or continuous (transcription) • Hyperlinks and pulldown tabs for individual tracks • Go to track description page • Hide or show data in genome viewer • Some tracks include multiple datasets (‘subtracks’) • Go to track description page to select
ENCODE Transcription track Display options Subtracks
Conclusions Personal genomics is becoming a reality • Genome sequencing will be a routine diagnostic tool • $5,000 to sequence single genome; current cost for clinical resequencing of single genes • Your genome will be sequenced • Long-read sequencing will solve de novo assembly issues • Data analysis and interpretation RNA-seq and ChIP-seq • Identifying genes and annotating regulatory function within and among genomes • Computational issues: data normalization, peak calling, differential • expression and binding • Large-scale studies revealing regulatory architecture of human & model genomes