1 / 19

The Yoyo Has Stopped Reviewing the Evidence for a Low Basal Human Protein Number

The Yoyo Has Stopped Reviewing the Evidence for a Low Basal Human Protein Number. In Silico Analysis of Proteins: Celebrating the 20 th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006 Christopher Southan Molecular Pharmacology, AstraZeneca R&D, M ölndal. Presentation Outline.

huy
Download Presentation

The Yoyo Has Stopped Reviewing the Evidence for a Low Basal Human Protein Number

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Yoyo Has StoppedReviewing the Evidence for a Low Basal Human Protein Number In Silico Analysis of Proteins: Celebrating the 20th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006 Christopher Southan Molecular Pharmacology, AstraZeneca R&D, Mölndal

  2. Presentation Outline • The importance of gene number • Gene definition and detection • Genome inflation arguments • Post-completion changes in model eukaryotes • Ensembl pipeline numbers • The smORF question • Completed chromosomes • International Protein Index • Novel gene skimming • Updates • Conclusions

  3. So Who Cares About Human Protein Coding Gene Number? • Central to evolutionary questions of gene number expansion vs. protein diversity from alternative splicing and post-translational modifications • Mammalian gene totals expected to be similar but clade-specific genes may be important for speciation • Accurate ORF delineation essential for genetic association studies and transcript profilling • MS-based proteomics needs a complete ORFome for the peptide and protein identification search space • For Pharma and Biotech the numbers set finite limits for potential drug targets and therapeutic proteins • The Swiss-Prot Human Proteomics Initiative (HPI) team

  4. Definitions • The basal (unspliced) protein-coding gene number: “transcriptional units that translate to one or more proteins that share overlapping sequence identity and are products of the same unique genomic locus and strand orientation” • However, the Guidelines for Human Gene Nomenclature define a gene as: "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterised by sequence, transcription or homology" • The increasing complexity of the transcriptome makes the wider definition of “gene” more difficult e.g. micro and antisence RNA

  5. Identifying Protein Coding Genes In silico • Detection of protein identity in genomic DNA • Gene prediction with protein similarity support • Matches with ESTs that include ORFs and/or splice sites • Cross-species comparisons for orthologous exon detection • Presence of gene anatomy features e.g. CpG islands, promoters, transcription start sites, polyadenylation signals • Absence of pseudogene disablements or repeat elements In vitro • Cloning of predicted genes • Detection of active transcription by Northern blot, RT-PCR or microarray hybridisation • Loss-of-function approaches • High-throughput transcript sampling by EST, MPSS or SAGE tags • Heterologous expression of cDNAs • Direct verification of protein sequence by Edman sequencing, mass-mapping and/or MS/MS sequencing

  6. Historical Arguments and Estimates for High Gene Numbers • Initial eukaryote (yeast/worm/fly) numbers assumed to be underestimates • Gene prediction programs have a significant false-negative rate • The Ensembl gene annotation pipeline is conservative • Mammalian protein and transcript coverage is incomplete • Chromosome annotation teams find more genes than automated pipelines • Selective transcript skimming experiments have revealed new genes • Extensive mamallian genomic sequence conservation outside known exons • Postulated large numbers of undetected small proteins (“smORFs” or “dark matter”) • EST clustering and commecial “gene inflation” claims Genesweep 2000 Literature estimates

  7. Model Eukaryotes: No Significant Post-Completion Gene Increases • S.pombe: 3% increase since 2002 • S.cerevisiae: 8% decrease since 1997 • C.elegans: 5% increase since 1998 • D.melanogaster: 0.2% increase since 2001 Little increase in spite of global functional genomics focus

  8. Human Transcripts: Post-genomic mRNA Growth in UniGene • Rapid growth in redundant mRNA • But slow growth in clustered set ~ 9,000 over 2 years with plateau ~ 28,000 • Includes splice variants and some spurious ORFs

  9. Ensembl Human Gene Number • Only 22,218 genes, a decrease of 1826 over 4 years • Knowns: from 90% < 95% • Novel genes: 12,398 > 2,263 • Exons-per-gene: 6.5 < 9.6 • Alternative splicing: from 3,669 < to 8,078

  10. Addressing the smORF Question: Protein Size Distributions in Human SPTr Pre Oct-01 6.3% > 100aa Post Oct-01 5.5% > 100aa “Novel” in title 3.4% > 100aa

  11. Summarising the smORF Question • The “triple postulate” i.e. a combination of gene prediction failiure, no homology and absence of transcription data, seems unlikely • No database evidence for increased bsence smORF discovery mammals • The observation that only ~1% of mouse genes have no detectable human homology contradicts the idea of large order-specific gene expansion in mammals • Although small proteins evolve more rapidly there is no precedent for complete loss of ortholog simillarity signal • Those much shorter than 100 residues will fall below the threshold necessary to fold into the domain structures necessary for biological function • No evidence for de-novo gene “invention” in higher eukaryotes

  12. Release History of the International Protein Index:Only Slow Increases in the Non-redundant Protein Sets 56537 Entries

  13. Experimental Transcript Skimming as Evidence for High Protein Numbers • Exon arrays (Dunham et al. 1999) • Gene arrays (Penn et al. 2000) • RT-PCR (Das et al. 2001) • SAGE-tags (Saha et al. 2002, Chen et al. 2002) • Oligo tiling from 21 and 22 (Kapranov et al. 2002, Kampa, et al 2004) • Necessary to submit a full length ORF with the features of gene anatomy to the public databases before the discovery of novel proteins can be claimed – none of these publications submitted any • There is increasing evidence for significant amounts of non-ORF transcription in human and mouse

  14. Gene Numbers for Individual Completed Chromosomes • Averaging the completed chromosomes exceeds Ensembl genes by ~12% • Extrapolates to ~ 25,000 genes without “novel transcripts” or “putatives” • Extrapolates to ~ 28,000 genes without “putatives” • Extrapolates to ~ 31,000 genes with “putatives repeat elements • The chromosome reports were made at different times using different assemblies and different grades of gene definition and evidence support (e.g. different results for chromosome 7) • Difficult to explicitly cross-map VEGA vs. Ensembl chromosome gene numbers • Future status of novel transcripts and putative genes unclear – most will be non-coding

  15. Disappearing Novelty • EMBL hum cds March 2003 = 1491 • Plus “novel” = 159 • Plus PubMed 2003 • = 120 • Novel in title = 11 • Previous cds = 8 • Novel genes = 2 • Now both in RefSeq and Ens 18.34

  16. Human Proteome Sampling by MS/MS Identification: A Paucity of Novel Genes • 3778 from plasma (Muthusamy et al 2005) • 2486 from liver cells (Yan et al. 2006) • 615 from the human heart mitochondria (Taylor et al. 2003) • 500 from breast cancer cell membranes (Adams et al. 2003) • 491 from microsomal fractions (Han et al. 2001) • 311 from the splicesome (Rappsilber et al. 2002) • No verifiable data on gene prediction confirmation • One novel gene reported from a genome-only peptide match by Kuster et al in 2001 but this appeared from a high-throughput project later in the same year (Tr Q96DA0) • While there is no evidence of novel protein discovery there is a caveat on the availalable search space

  17. Conclusions • The model eukaryotes have shown no significant post-genomic rises in gene number • The Ensembl gene number has been essentially flat since 2001 • There is a set of ~2,000 predicted genes still eluding experimental verification – or may not be real ? • Putative genes from curated chromosmes could raise protein numbers but the status of this class of transcripts is in doubt • Early over-estimates explicable by non-ORF transcription • Post-genomic transcript coverage is predominantly re-sampling known genes • Database submissions of novel human genes have slowed to a trickle • No evidence for large numbers of cryptic smORFs • Proteomics has not revealed new proteins Maybe Swiss-Prot can pop the champane corks when HPI hits 20,000 ?

  18. Updates • October 2004 Nature paper on finished human genome “20-25,000 protein-coding genes” • December 2005 Nature paper “The dog gene count (19,300) is substantially lower than the 22,000-gene models in the current human gene catalogue (EnsEMBL build 26). For many predicted human genes, we find no convincing evidence of a corresponding dog gene. Much of the excess in the human gene count is attributable to spurious gene predictions in the human genome (M. Clamp, personal communication).” • March 2006 Ensembl 23,701 • June 2006 Swiss-Prot HPI 14,445

  19. Acknowledgments and Reference • Paul Kersey of the EBI for IPI figures • Lucas Wagner of the NCBI for the retrospective UniGene data • Numerous other people at NCBI, EBI, Swiss-Prot and Sanger Centre who graciously answered queries on their data collections • The Oxford Glycosciences Proteome Discovery Team Southan C. Has the Yo-yo stopped? An assesment of human protein-coding gene number (2004) Proteomics (6):1712-26. PMID: 15174140

More Related