280 likes | 366 Views
Session -- Using DNA sequence to detect variation related to disease Richard Wilson – WashU – deep sequencing of cancer tumors (AML) identified variations in 8 genes Richard Gibbs – Baylor College of Medicine – "Complete Genomics" – genome for < $5,000
E N D
Session -- Using DNA sequence to detect variation related to disease • Richard Wilson – WashU – deep sequencing of cancer tumors (AML) identified variations in 8 genes • Richard Gibbs – Baylor College of Medicine – "Complete Genomics" – genome for < $5,000 • Accurate sequencing by hybridization for DNA diagnostics and individual genomics, Drmanac, et al., Nature Biotechnology ASHG Redux 2008
Session -- Using DNA sequence to detect variation related to disease • Micahel Stratton – Wellcome Trust Cancer Institute – genomic sequencing of breast cancer cell lines • Copy number variations ("structural variants") • "genomic shards" – 305 rearrangements in breast cancer cell line • Difficult to assemble with short reads technology ASHG Redux
Session – Genomics I • Sharp – whole genome screen for novel imprinting genes • Bisulphite treatment – convert all un-methylated C's to U (uracil) -- then sequence and all methylated C's sites are ID'ed • Drawback – harsh, fragments DNA • High density HapMap of Humans, Dogs, and Cattle • Genotypes 900 dogs /w Affy 2.0 array at 61,344 SNPs • Dogs have very uniform phylogenetic tree with bread specific recombination rates ASHG Redux
Session – Genomics I • Biesecker – ClinSeq – effort to map phenotypic features to genotypes for atherosclerosis • 1000 subjects Rare Mendelian Variants Common Mendelian Variants Clinical data Penetrance Desired Data Unknown Territory Subjects Common SNPs 0.5 SNP Freq Genome ASHG Redux
Session – Genomics II • BGI (Beijing Genomics Institute) • First Asian genome sequenced • 100 bioinformaticians (-> 300) • 18 Solexas • 5 454's • 4 Solids (?) • Altshuler (1000 Genomes Project) – effort to sequence 1000 genomes to catalogue variations in genome • www.1000genomes.org • Duplicated amount of sequence in GenBank in Sept. • Again in October • Data release – Jan 2009 ASHG Redux
Reference: "Discovering Genomics, Proteomics, and Bioinformatics." Second Edition 2/e. Campbell and Heyer. 2007. ISB: 0-8053-8219-4. Chapter 2: Genome Sequence
reduction -- for a very long time molecular methods where primarily tools to dissect cells and understand how parts work in isolation • expansion -- genomics, in theory, enables science to begin piecing together how parts work together as a system (systems biology?) genomics
What is Genomics? • How to sequence a genome? • Annotating (annotation) • Protein function • Gene Ontology Overview
"involves large data sets" • human genome -- 3 billion nucleotides • hundreds of genomes have been finished • "high-throughput methods" • sequencing • measuring the expression of all genes • genotyping (1,000,000 SNPs on 1 chip) • other -omes • proteome, transcriptome, metabolome, variome?, exome • http://cancergenome.nih.gov/media/process_textonly.asp Genomics
preliminary sequencing • finishing (not always performed -- coverage) • annotating • The "dideoxy method" • Need (for DNA replication): • DNA, DNA polymerase, primers, deoxyribonucleotide triphosphates (dNTPs) (G,T,A,C)'s (one with radioactive atoms), dideoxyribonucleotide triphosphates (ddNTPs) How do we sequence a genome?
Next-generation sequencing technology • Cost per nucleotide down by factor of 100-1000 • Cost per run is still very high • Expen$ive for validation on an individual basis • Dideoxy method is very mature, very well understood Dideoxy Method Obsolete?
Under normal DNA polymerization, dNTPs are added to the end of the elongating strand of DNA. • If an ddNTP is incorporated, the elongation terminates -- also carries "label" -- radioactive isotope or fluorescent dye • This is performed in 4 different containers (test tubes), with each test tube having ddATP, ddGTP, ddCTP, and ddGTP. • Therefore, each tube terminates with the same ddNTP • Run these out on a gel, and smallest migrate fastest. • Expose to x-ray film (or scan with laser), read gel dideoxy method
Note -- this is pretty awful work • The gel material is toxic • Working with radioactive molecules • Slow and tedious • reading bands on glass • capturing/entering data • 500 bases took 24 hours (16,438 years to do the human genome with this method) Comment
Leroy Hood -- developed nonradioactive dideoxy method • ddNTP's are "labeled" with a different fluorescent dye • 1 lane could be used instead of 4 (why?) • A laser fluoresces the dye, the band can be "read", indicating which ddNTP terminated the sequence • The intensities of these bands are now captured and graphed -- in what is called a chromatogram • Lane in a gel is replaced with a capillary • Can run 96, or 384 capillaries at a time (Applied Biosystems) • A run is approximately 1 hour • 500 bases * 384 cap ==> 651 years Automated sequencing
Big 7 • human, mouse, yeast, E. coli, fly, worm, arabidopsis • medical applications • Pseudomonas aeruginosa (CF infection), mosquito, trypanosomes, HIV • evolutionary significance • microbes, archaea, chimp, gorilla, fugu fish • environmental impact • microbes • food production • wheat, rice, bovine, pig, yeast Choosing genomes
Automated sequencing almost requires automated base-calling • PHRED • reads chromatograms • quality assessment (for re-sequencing) • peak height and spacing • assemble multiple reads (PHRAP) into a "contig" • What about mutations, variations, SNPs? • Gaps • requires human intervention -- techniques to try and span specific DNA regions • ex) chromosome walking Automated Reads
2001 draft sequence published • 147,821 gaps • pressure to publish a sequence because of Celera and Craig Venter • 2004 • 341 gaps • Usually repeats (but may be epigenetic) • Very expensive to completely finish • many genomes never "finished" Gaps
"functionally" important sections of a genome • exons, introns, promoters, enhancers, splice sites, UTR's, • pseudogenes, SNPs, markers, repeats, Alus, gene duplications, gene families, micro-RNAs, methylation, phosphorylation, tissue specific alternative splicing, copy number variations, (CNVs, also called "structural variations") differential expression, gene function, ???? Annotation
Gene prediction (ORF finding) • was a hot topic • cooled when it became clear that EST sequencing was far superior • EST sequencing in human (and some model organisms -- rat, mouse, others) was very extensive -- millions of sequencing reads • The most effective approach to gene finding was the overlaying of EST sequences to genomic sequence (but note you need both). • Gene prediction was 40-60% at best • Gene prediction has made a bit of resurgence because of the cost savings of "in silico" gene finding Gene Identification
text -- mammalian genome contains approximately 225 BP per KB of pseudogenes • What are pseudogenes? Pseudogenes