320 likes | 338 Views
Why Is Sequence Comparison Useful?. Lipman, David (NIH/NLM/NCBI). Almost 100 Trillion BLAST comparisons per quarter (10/01). Rapid similarity searches of nucleic acid and protein data banks. Wilbur WJ, Lipman DJ. Proc Natl Acad Sci U S A 1983 Feb;80(3):726-30
E N D
Why Is Sequence Comparison Useful? Lipman, David (NIH/NLM/NCBI)
Rapid similarity searches of nucleic acid and protein data banks.Wilbur WJ, Lipman DJ. • Proc Natl Acad Sci U S A 1983 Feb;80(3):726-30 • With the development of large data banks of protein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar to a given sequence has become evident. We present an algorithm for the global comparison of sequences based on matching k-tuples of sequence elements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separate implementation, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the entire Protein Data Bank of the National Biomedical Research Foundation with a 350-residue query sequence in less than 3 min and carry out a similar analysis with a 500-base query sequence against all eukaryotic sequences in the Los Alamos Nucleic Acid Data Base in less than 2 min.
Cancer Gene Meets Its MatchNY Times July 3, 1983“…a serendipitous computer search…” Waterfield MD et al., Nature 1983 Jul 7;304(5921):35-39 Doolittle RF et al., Science 1983 Jul 15;221(4607):275-277 v-sis: 6 QGDPIPEELYKMLSGHSIRSFDDLQRLLQGDSGKEDGAELDLNMTRSHSGGELESLARGK 65 QGDPIPEELY+MLS HSIRSFDDLQRLL GD G+EDGAELDLNMTRSHSGGELESLARG+ PDGF : 10 QGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDGAELDLNMTRSHSGGELESLARGR 69 v-sis: 66 RSLGSLSVAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 125 RSLGSL++AEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ PDGF : 70 RSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 129 v-sis: 126 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCEIVAAARAVTRSPGTSQEQR 185 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCE VAAAR VTRSPG SQEQR PDGF : 130 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQR 189 v-sis: 186 AKTTQSRVTIRTVRVRRPPKGKHRKCKHTHDKTALKETLGA 226 AKT Q+RVTIRTVRVRRPPKGKHRK KHTHDKTALKETLGA PDGF : 190 AKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA 230 V-sis and Platelet-Derived Growth Factor (PDGF)
An earlier, more subtle discovery… (for Slide Animation please Click the area of slide or Slide Show button) • Viral src gene products are related to the catalytic chain of mammalian cAMP-dependent protein kinase Barker WC, Dayhoff MO. PNAS 1982 May;79(9):2836-2839 Query: 113 YAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGFAKR---VKGRTWT---LC 166 Y+ +V +LHS +++ DLKP N+LI +Q +++DFG +++ ++GR + + Sbjct: 125 YSLDVVNGLLFLHSQSILHLDLKPANILISEQDVCKISDFGCSQKLQDLRGRQASPPHIG 184 Query: 167 GTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVR 223 GT + APEI+ + D ++ G+ +++M P ++ +P + +V+ +R Sbjct: 185 GTYTHQAPEILKGEIATPKADIYSFGITLWQMTTREVP-YSGEPQYVQYAVVAYNLR 240 • Biology not Algorithms • - compare proteins, not DNA • must detect similar amino acids not just identities
(for Slide Animation please Click the area of slide or Slide Show button) In 1983, there were only a small percentage of genes from the genomes of a number of evolutionarily distant organisms ( e.g. human, fly, yeast, e.coli ). How often would one find matches? How many protein families would there be? Unexpected similarities should be extremely rare.
(for Slide Animation please Click the area of slide or Slide Show button) Estimating number of protein families
Zuckerkandl,E. (1974) Accomplissement et perspectives de la paleogenetique chimique. In: Ecole de Roscoff –1974, p. 69. Paris:CNRS. “The appearance of new structures and functions in proteins during evolution”, J. Mol. Evol. 7, 1-57 (1975). Dayhoff, M.O. (1974) Federation Proceedings 33, 2314. “The origin and evolution of protein superfamilies”, Fed.Proc. 35, 2132-2138 (1976). Earliest Estimates of Number of Protein Families - ~1000
“It has been estimated that in humans there are approximately 50,000 proteins of functional or medical importance. … A landmark of molecular biology will occur when one member of each superfamily has been elucidated. At the present rate of 25 per year, this will take less than 15 years.” Atlas of Protein Sequence and Structure, Vol. 5, Supplement 3 (1978) pg. 10:
Chothia, C. (1992). One thousand families for the molecular biologist. Nature, 357, 543-544. (for Slide Animation please Click the area of slide or Slide Show button) Hubris, the Genome Project, and Protein Families Green P, Lipman D, Hillier L, Waterson R, States,D, and Claverie JM (1993). Ancient Conserved Regions in New Gene Sequences and the Protein Databases. Science, 259, 1711-1716. ACR = similarity detected between sequences from distantly related organisms
(for Slide Animation please Click the area of slide or Slide Show button) 1992: What new families do we get from the genome projects?
Cumulative growth in number of proteins & number of conserved domains(from Geer, L., Bryant, S., & Ostell, J.) Green et al. 85% of ACRs Dayhoff 10% of superfamilies (for Slide Animation please Click the area of slide or Slide Show button) 6 100 1.2*10 6 1.0*10 80 5 8.0*10 60 Conserved Domain Families 5 % Families Hit 6.0*10 Number of Proteins 40 5 4.0*10 Protein Sequences 20 5 2.0*10 0 0.0 1960 1965 1970 1975 1980 1985 1990 1995 2000
(for Slide Animation please Click the area of slide or Slide Show button) Why so few families and why do they evolve slowly? Structural View Thermodynamics: Finkelstein, AV, “Why are the same protein folds used to perform different functions?” FEBS 325, pp. 23-28 (1993)
Compare pairs of sequences from related classes of proteins One gene Functional divergence Gene duplication Last universal common ancestor (for Slide Animation please Click the area of slide or Slide Show button) Constraints Due To Biological Function May Be More Important • All sequences should at least share structural similarity • Divergence times for all sequences should be approximately the same prokaryotes • Sequences within a class share function but sequences between classes have differing function eukaryotes Degree within-class similarity > between-class similarity indicates importance of constraints due to biological function.
(for Slide Animation please Click the area of slide or Slide Show button) Example from the Aminoacyl-tRNA synthetases (aaRS) (from E. Koonin & Y. Wolf)essential enzymes responsible for incorporation of amino acids into proteins • Two unrelated classes of aaRS, each includes 10 aaRS related to each other • The last universal common ancestor (LUCA) of modern life forms already had at least 17 aaRS • The duplication leading to aaRS of different specificities must have occurred during a relatively short period of early evolution • The post-LUCA evolution of aaRS took much longer than the early phase when the specificities were established. However, the changes that occurred after the aaRS were locked in their specificities are small compared to the changes traced to the early phase
Example from the Aminoacyl-tRNA Synthetases (aaRS) (from E. Koonin & Y. Wolf) Exceptions - glutamine/glutamate,asparagine/aspartate & tryptophan/tyrosine
80,000Antequera F & Bird A, “Number of CpG islands and genes in human and mouse”, PNAS 90, 11995-11999 (1993). (for Slide Animation please Click the area of slide or Slide Show button) How many human genes? 120,000Liang F et al., “Gene Index analysis of the human genome estimates approximately 120,000 genes”, Nat. Gen., 25, 239-240 (2000) 35,000 Ewing B & Green P, “Analysis of expressed sequence tags indicates 35,000 human genes”, Nat. Gen. 25, 232-234 (2000) 28,000-34,000 Roest Crollius, H. et al., “Estimate of human gene number Provided by genome-wide analysis using Tetraodon nigroviridis DNA Sequence”, Nat. Gen. 25, 235-238 (2000). 41,000-45,000Das M et al., “Assessment of the Total Number of Human Transcription Units”, Genomics 77, 71-78 (2001)
(for Slide Animation please Click the area of slide or Slide Show button) How many human genes with ACRs? (from S. Resenchuk, T.Tatusov, L. Wagner, A. Souverov) 12,245 characterized mRNAs from RefSeq 78% have ACR, i.e., hit outside vertebrates at E <10e-6( 9,496/12,245) 90% of these have corresponding GenomeScan predictions which also have ACR (8501/9496) 20,245 GS models for entire human genome have ACR 15,573 GS models after correction for splitting (20,245/1.3) 17,300 estimated human genes with ACRs ( ~15,573/.9)
(for Slide Animation please Click the area of slide or Slide Show button) How many human genes? 17,303 estimated human genes with ACRs Now use comparative genomics… 17,303/.55 = ~31,500 Total Human Genes More complicated than that!
(for Slide Animation please Click the area of slide or Slide Show button) Conservation, expression level, protein length, & exon number 23,600revised est. human genes with ACRs (~15,573/.66) 43,000 upper bound on est. total human genes (23,600/.55) 35,000 is more reasonable bound with this approach
The relationship of protein conservation and sequence length • Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova TA • BMC Evol Biol. 2002 2:20
conserved nonconserved Structural domains 4279 proteins Salmonella Set
conserved nonconserved Structural domains Archaeoglobus fulgidus 100 80 2420 proteins 60 Number 40 20 0 0 200 400 600 800 1000 Length
Yeast 400 350 300 6305 proteins 250 200 Number 150 100 50 0 0 200 400 600 800 1000 Length conserved nonconserved Structural domains
Drosophila 50 40 2390 proteins 30 Number 20 10 0 0 200 400 600 800 1000 conserved nonconserved Structural domains Length
Human 300 conserved 250 nonconserved 14538 proteins Structural domains 200 Number 150 100 50 0 0 200 400 600 800 1000 Length
A conserved nonconserved B
Archaeoglobus fulgidus Escherichia coli Contact density
Acknowledgements & all my colleagues at NCBI and NIH