340 likes | 522 Views
the most important insight from Watson & Crick’s structure for DNA was that genetic information is digitally encoded by lucky happenstance the genetics revolution has coincided with a revolution in our ability to process digital information. 1953.
E N D
the most important insight from Watson & Crick’s structure for DNA was that genetic information is digitally encoded by lucky happenstance the genetics revolution has coincided with a revolution in our ability to process digital information 1953
computational algorithms to find and align DNA sequences Query 554 TGGGGCTGGCAACAACTGGGCCAAAGGCCACTACACGGAGGGAGCCGAGCTGATCGAGAA 613 ||| || || || ||||||||||| ||||||||||| ||||||||||||||| |||| Sbjct 3095150 TGGAGCCGGGAATAACTGGGCCAAGGGCCACTACACAGAGGGAGCCGAGCTGGTCGACTC 3095091 Query 614 TGTCCTAGAGGTGGTGAGGCACGAGAGT--GAGAGCTGTGACTGCCTGCAGGGCTTCCAG 671 ||||| || ||||||||| | | |||| |||||||||||||| || |||||||||||| Sbjct 3095090 GGTCCTGGATGTGGTGAGG-AAG-GAGTCAGAGAGCTGTGACTGTCTCCAGGGCTTCCAG 3095033 Query 672 ATCG-TCCACTCCCTGGGCGGG-GGCACAGGCTCCGGGATGGGCACTCTGCTCATGAACA 729 | | |||||| ||||| ||| ||||| || |||||||||||||| |||||||| | || Sbjct 3095032 CT-GACCCACTCTCTGGG-GGGCGGCACGGGGTCCGGGATGGGCACCCTGCTCATCAGCA 3094975 Query 730 AGATTAGAGAGGAGTACCCGGACCGGATCATGAATTCCTTCAGCGTCATGCCTTCTCCCA 789 |||| | || |||||||| ||||| |||||||| |||||||||||||||| || |||| Sbjct 3094974 AGATCCGGGAAGAGTACCCAGACCGCATCATGAACACCTTCAGCGTCATGCCCTCACCCA 3094915 Query 790 AGGTGTCGGACACGGTGGTGGAGCCCTACAACGCGGTTCTGTCTATCCACCAGCTGATTG 849 ||||||| |||||||||||||||||||||||||| || ||| ||||||||||| | | Sbjct 3094914 AGGTGTCAGACACGGTGGTGGAGCCCTACAACGCCACCCTCTCTGTCCACCAGCTGGTGG 3094855 Smith-Waterman is the ideal; BLAST is faster but less sensitive; the compromises continue with nextgen algorithms (e.g. SOAP)
presidential announcement for sequencing of the human genome one year before its publication IHGSC Celera 26 Jun 2000: Craig Venter, Bill Clinton, Francis Collins
open access has been official policy since 1996 Bermuda Rules UCSC’s browser (US) http://genome.ucsc.edu/cgi-bin/hgTracks?org=human Ensembl’s contigView (UK) http://www.ensembl.org/Homo_sapiens/Location/View?r=X:151073054-151383976 NCBI’s mapViewer (US) http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&CHR=X&BEG=151073054&END=151383976 Nature’s human genome http://www.nature.com/nature/supplements/collections/humangenome/index.html
2001: parameters for shotgun sequencing of the human genome 500-bp sequence read << 3-Gb (3109) human genome [NB: read lengths even shorter with nextgen technology] Lander ES, Waterman MS. 1988. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2: 231-239 LW: random sampling at 1x coverage will be full of gaps LW: need 4x~8x coverage in order to bridge most of the gaps [NB: required coverage up to 50x with 100-bp nextgen reads] redundancy sounds expensive but it is actually cheaper private investments in technology development determine costs redundancy of coverage lowers substitutional error rate
hierarchical versus whole-genome-shotgun sequencing Celera: whole genome shotgun (WGS) 200 kb 500 bp IHGSC: hierarchical (BAC by BAC)
repeat1 repeat2 repeat2 repeat1 impossible to decide between assemblies 416 = 4.3-Gb suggests that a 16-bp overlap rule will suffice for the 3-Gb human genome, but this assumes the sequence is random and it is not repeated sequences cause the shotgun data to be misassembled
after a transposable element (TE) is inserted into the genome it is under no selectional pressure; hence over a period of a few hundred million years it becomes indistinguishable from random sequence almost half of human genome is identifiably transposable elements
biological definition of “repeat” overstates misassembly problem SIMILARITY of two DNA sequences is statistically significant if 50% of the bases match exactly (in contrast the threshold for similarity of two protein sequences is not well defined but the criterion normally used is that 30% of the amino acids must match) shotgun misassemblies occur when two sequences are more-or-less exactly repeated, in comparison to the mathematical rule programmed into the computer to determine when an overlap should be accepted, which is in general much more stringent than 50% biological nature of the repeated motif (e.g. transposable elements vs gene duplicates) is not directly important
read mates (clone-end pairs) bridge over troublesome repeats AGCATGCTGCAGTCA------------------------------------------------CGTCATGCTTAGGCTA sequence from end 1-------------------------------------------------sequence from end 2 distance between two ends is known and adjustable
factors that affect misassembly age of repeat: two identical sequences will diverge over time, so it is the younger repeats that are more likely to cause problems length of repeat: the longer a repeat is, in comparison to the sequence read, the more likely it will cause problems copy number: we can detect high copy number repeats by counting how often every sequence motif occurs in the data set, so it is only the undetectable low copy number repeats that cause problems inbred vs outbred: the overlap rules must allow for polymorphisms, so if we use an outbred that causes problems Phred quality: one of the most important breakthroughs was the ability to compute an accurate error probability for every base call clone insert sizes: many problems are circumvented if the read mates have the “right” distribution of insert sizes
recent segmental duplicationsin IHGSC and Celera genomes proportion of Celera aligned bases falls rapidly as identity exceeds 97% or length exceeds 15-kb but the total sequence lost is still just 2% to 3%
“Although it is clear that the detailed clone-ordered approach is superior in the resolution of segmental duplications, it would be unrealistic to propose that the sequencing community should abandon whole-genome-shotgun based approaches. These are the most efficient cost-effective means of capturing the bulk of the euchromatic sequence.”Evan Eichler (21 Oct 2004; Nature 431: 927-930)
1000 plant transcriptomes launched at end of 2008 10000 vertebrate genomes launched at end of 2010 back to BACs in future?
revisions to the “centraldogma” of molecular biology mRNA1 no introns protein isoform1 genomic DNA pre-mRNA with introns alternative splicing mRNA2 no introns protein isoform2 non-coding RNA genes since around 2000 scientists have characterized a new category of regulatory non-coding RNA genes whose existence had never even been suspected; they are distinguished by their small sizes relative to traditional RNA genes (i.e. 21 to 25 bp versus 96 bp) and are therefore commonly referred to as “microRNAs”
gene poor regions due to large introns or large intergenic regionshuman genes can be up to 2 Mb; rice genes are never over 10 kb Model 1: observed in animals large intron Model 2: observed in plants large intergenic region animal vs plant comparisons are discussed in rice genome paper Yu J, Hu S, Wang J, Wong GK, Li S, et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 79-92
3.6 Mb intron full of microsatellites AM Reugels, et al. 2000. Genetics 154: 759-769. 3.6 Mbp intron in the dynein gene DhDhc7(Y) on the heterochromatic Drosophila hydei Y chromosome
3 June 2003: Ewan Birney of the European Bioinformatics Institute (EBI) at Cambridge (UK) estimated that there are 24,500 protein coding genes, of which 3000 are most likely pseudogenes that do not count. Apparently only a few bettors placed their money on a number that low. Lee Rowen at the Institute for Systems Biology (Seattle) was the closest, betting in 2001 on a gene count of 25,947. She got half of the $1200 pool money. Sharing the other half were Paul Dear of the UK Medical Research Council, and Olivier Jaillon of Genoscope in France, with the next closest bets. human gene count betting pool everybody over-estimated the number of genes
??? many protein sequences are conserved even in the prokaryotes pie chart based on IHGSC’s 2001 paper; conclusions have not fundamentally changed with improvements in gene annotation
many protein “functions” are unknown (from homology searches) pie chart based on Celera’s 2001 paper; conclusions have not fundamentally changed with improvements in gene annotation
why were the gene estimates so much higher than the final answer? pre-genome era count was based on gross under estimate of average gene size, i.e. 3 Gb genome divided by 30 kb average predicted 100,000 genes we do not know how the splicing machinery works in sufficient detail to identify exon-intron boundaries from genome sequence alone, especially for the largest genes; hence in the interim we require additional evidence to argue that an open reading frame encodes a biologically functional protein that evidence might include mRNA sequences or evolutionary signatures of amino-acid-preserving substitutions and reading-frame-preserving deletions; M Clamp, et al. 2007. Proc Natl Acad Sci USA 104:19428
annotation by hybridizationof mRNAs to tiling microarrays Shoemaker DD, et al. 2001.Nature 409: 922-927 nonrepetitive half of human genome requires 150 million probes if tiled at 10 bp steps ALTERNATIVELY, you could sequence mRNAs outright, but the costs can be prohibitively high if you want to sequence the lowest abundance genes
most of the human genome is converted into primary transcripts June 2007 1% of the human genome was experimentally assessed using state-of-the-art methods largely based on microarrays. In particular GENCODE annotations, RACE-array experiments, and PET tags were used to assess the presence of a nucleotide in a primary transcript. Because high throughput methodologies tend to be less reliable the authors show how many different technologies and how many observations (with the same technology) each claim is based on.
gene number estimates as a function of time and methodology genome is sequenced observed transcripts genes dark matter 21k protein coding loci time dark matter is reproducible, but it’s poorly transcribed, poorly conserved, non protein coding, and outnumbers validated microRNAs by ~1000 fold
dark matter raised the possibility that there was an enormous number of non-coding RNA genes that were much larger in size than microRNAs, that were species specific, and that could potentially explain many important evolutionary differences why so few proteins?!? QUESTION: should evolutionary innovation be found in genome elements that are specific to species or specific to phylogenetic/taxonomic lineage?
The Nobel Prize in Physiology or Medicine 2006 Andrew Z. Fire Stanford University School of Medicine b. 1959 Craig C. Mello University of Massachusetts Medical School b. 1960 for their discovery of RNA interference gene silencing by double-strandedRNA readily available reagents to silence almost any gene in any model organism
what most likely started as an antiviral defense mechanism has evolved into an important gene regulatory mechanism S Griffiths-Jones, et al. 2008. Nucleic Acids Res 36: D154-D158.
dark matter likely to be artifact of hybridization based tiling microarrays relation between fraction of detected transcript fragments on tiling arrays (transfrags) or in RNA-Seq data (seqfrags) that overlap known RefSeq exons (i.e. precision) and the total fraction of exons recovered (i.e. recall) taken from figure 1 of H van Bakel, et al. 2010. PLoS Biol 8: e1000371
hybridization is imprecise; microarrays were always the poor man’s solution for what should ultimately be done by sequencing 1 gcacctcgct gctccagcct ctggggcgca ttccaacctt ccagcctgcg acctgcggag 61 aaaaaaaatt acttattttc ttgccccata cataccttga ggcgagcaaa aaaattaaat 21 tttaaccatg agggaaatcg tgcacatcca ggctggtcag tgtggcaacc agatcggtgc 181 caagttctgg gaggtgatca gtgatgaaca tggcatcgac cccaccggca cctaccacgg 241 ggacagcgac ctgcagctgg accgcatctc tgtgtactac aatgaagcca caggtggcaa 301 atatgttcct cgtgccatcc tggtggatct agaacctggg accatggact ctgttcgctc 361 aggtcctttt ggccagatct ttagaccaga caactttgta tttggtcagt ctggggcagg 421 taacaactgg gccaaaggcc actacacaga gggcgccgag ctggttgatt ctgtcctgga 481 tgtggtacgg aaggaggcag agagctgtga ctgcctgcag ggcttccagc tgacccactc 541 actgggcggg ggcacaggct ctggaatggg cactctcctt atcagcaaga tccgagaaga 601 ataccctgat cgcatcatga ataccttcag tgtggtgcct tcacccaaag tgtctgacac 661 cgtggtcgag ccctacaatg ccaccctctc cgtccatcag ttggtagaga atactgatga 721 gacctattgc attgacaacg aggccctcta tgatatctgc ttccgcactc tgaagctgac 781 cacaccaacc tacggggatc tgaaccacct tgtctcagcc accatgagtg gtgtcaccac
three distinct modes of evolution common in plants common in animals SB Carroll. 2005. Evolution at two levels: on genes and form. PLoS Biol 3: e245 protein diversity like computer hardware; gene regulation like computer software
why spend a lot more to sequence a genome when most annotated genes are based on transcriptome sequences? based on genomes & transcriptomes we know 21k protein coding genes identified of which 18k have been cloned for the Mammalian Gene Collection alternative splicing on genome sequence produces multiple isoforms of essentially every human gene BUT only from the genomes did we also learn 1.5% of genome encodes for proteins but at least 6% encodes for conserved non-coding elements that have been repeatedly implicated in regulation of gene expression
PhastCons conserved genome sequences for vertebrate species A Siepel, et al. 2005. Genome Res 15:1034-1050. conserved genome sequences are mostly NOT in protein coding regions; so we call them conserved non-coding elements (CNEs)
Exons 7–11 of the RNA-edited human gene GRIA2 are shown. Peaks in the conservation plot generally correspond to exons and valleys to noncoding regions, but a 158-bp conserved noncoding element can be seen near the 3′ end of exon 11. This region includes the editing complementary sequence (ECS) of the RNA editing site in exon 11.
evolution of species dependsmore on innovation in regulatory CNE sequence than proteins At least 20% of the CNEs conserved among placental mammals are absent in marsupial mammals, as compared to only 1% for protein-coding sequences. Compared to birds about 70% of placental CNEs are absent. By the time we get to fishes everything is lost. TS Mikkelsen, et al. 2007. Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447: 167-177. fish bird marsupial placental 90 Mya 180 Mya 310 Mya pairwise comparison of any two genomes does not identify many CNEs; most of the statistical power comes from simultaneous alignment of multiple genomes Q: to identify human specific CNEs, how many human genomes are required? 450 Mya