190 likes | 325 Views
The Yoyo Has Stopped Reviewing the Evidence for a Low Basal Human Protein Number. In Silico Analysis of Proteins: Celebrating the 20 th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006 Christopher Southan Molecular Pharmacology, AstraZeneca R&D, M ölndal. Presentation Outline.
E N D
The Yoyo Has StoppedReviewing the Evidence for a Low Basal Human Protein Number In Silico Analysis of Proteins: Celebrating the 20th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006 Christopher Southan Molecular Pharmacology, AstraZeneca R&D, Mölndal
Presentation Outline • The importance of gene number • Gene definition and detection • Genome inflation arguments • Post-completion changes in model eukaryotes • Ensembl pipeline numbers • The smORF question • Completed chromosomes • International Protein Index • Novel gene skimming • Updates • Conclusions
So Who Cares About Human Protein Coding Gene Number? • Central to evolutionary questions of gene number expansion vs. protein diversity from alternative splicing and post-translational modifications • Mammalian gene totals expected to be similar but clade-specific genes may be important for speciation • Accurate ORF delineation essential for genetic association studies and transcript profilling • MS-based proteomics needs a complete ORFome for the peptide and protein identification search space • For Pharma and Biotech the numbers set finite limits for potential drug targets and therapeutic proteins • The Swiss-Prot Human Proteomics Initiative (HPI) team
Definitions • The basal (unspliced) protein-coding gene number: “transcriptional units that translate to one or more proteins that share overlapping sequence identity and are products of the same unique genomic locus and strand orientation” • However, the Guidelines for Human Gene Nomenclature define a gene as: "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterised by sequence, transcription or homology" • The increasing complexity of the transcriptome makes the wider definition of “gene” more difficult e.g. micro and antisence RNA
Identifying Protein Coding Genes In silico • Detection of protein identity in genomic DNA • Gene prediction with protein similarity support • Matches with ESTs that include ORFs and/or splice sites • Cross-species comparisons for orthologous exon detection • Presence of gene anatomy features e.g. CpG islands, promoters, transcription start sites, polyadenylation signals • Absence of pseudogene disablements or repeat elements In vitro • Cloning of predicted genes • Detection of active transcription by Northern blot, RT-PCR or microarray hybridisation • Loss-of-function approaches • High-throughput transcript sampling by EST, MPSS or SAGE tags • Heterologous expression of cDNAs • Direct verification of protein sequence by Edman sequencing, mass-mapping and/or MS/MS sequencing
Historical Arguments and Estimates for High Gene Numbers • Initial eukaryote (yeast/worm/fly) numbers assumed to be underestimates • Gene prediction programs have a significant false-negative rate • The Ensembl gene annotation pipeline is conservative • Mammalian protein and transcript coverage is incomplete • Chromosome annotation teams find more genes than automated pipelines • Selective transcript skimming experiments have revealed new genes • Extensive mamallian genomic sequence conservation outside known exons • Postulated large numbers of undetected small proteins (“smORFs” or “dark matter”) • EST clustering and commecial “gene inflation” claims Genesweep 2000 Literature estimates
Model Eukaryotes: No Significant Post-Completion Gene Increases • S.pombe: 3% increase since 2002 • S.cerevisiae: 8% decrease since 1997 • C.elegans: 5% increase since 1998 • D.melanogaster: 0.2% increase since 2001 Little increase in spite of global functional genomics focus
Human Transcripts: Post-genomic mRNA Growth in UniGene • Rapid growth in redundant mRNA • But slow growth in clustered set ~ 9,000 over 2 years with plateau ~ 28,000 • Includes splice variants and some spurious ORFs
Ensembl Human Gene Number • Only 22,218 genes, a decrease of 1826 over 4 years • Knowns: from 90% < 95% • Novel genes: 12,398 > 2,263 • Exons-per-gene: 6.5 < 9.6 • Alternative splicing: from 3,669 < to 8,078
Addressing the smORF Question: Protein Size Distributions in Human SPTr Pre Oct-01 6.3% > 100aa Post Oct-01 5.5% > 100aa “Novel” in title 3.4% > 100aa
Summarising the smORF Question • The “triple postulate” i.e. a combination of gene prediction failiure, no homology and absence of transcription data, seems unlikely • No database evidence for increased bsence smORF discovery mammals • The observation that only ~1% of mouse genes have no detectable human homology contradicts the idea of large order-specific gene expansion in mammals • Although small proteins evolve more rapidly there is no precedent for complete loss of ortholog simillarity signal • Those much shorter than 100 residues will fall below the threshold necessary to fold into the domain structures necessary for biological function • No evidence for de-novo gene “invention” in higher eukaryotes
Release History of the International Protein Index:Only Slow Increases in the Non-redundant Protein Sets 56537 Entries
Experimental Transcript Skimming as Evidence for High Protein Numbers • Exon arrays (Dunham et al. 1999) • Gene arrays (Penn et al. 2000) • RT-PCR (Das et al. 2001) • SAGE-tags (Saha et al. 2002, Chen et al. 2002) • Oligo tiling from 21 and 22 (Kapranov et al. 2002, Kampa, et al 2004) • Necessary to submit a full length ORF with the features of gene anatomy to the public databases before the discovery of novel proteins can be claimed – none of these publications submitted any • There is increasing evidence for significant amounts of non-ORF transcription in human and mouse
Gene Numbers for Individual Completed Chromosomes • Averaging the completed chromosomes exceeds Ensembl genes by ~12% • Extrapolates to ~ 25,000 genes without “novel transcripts” or “putatives” • Extrapolates to ~ 28,000 genes without “putatives” • Extrapolates to ~ 31,000 genes with “putatives repeat elements • The chromosome reports were made at different times using different assemblies and different grades of gene definition and evidence support (e.g. different results for chromosome 7) • Difficult to explicitly cross-map VEGA vs. Ensembl chromosome gene numbers • Future status of novel transcripts and putative genes unclear – most will be non-coding
Disappearing Novelty • EMBL hum cds March 2003 = 1491 • Plus “novel” = 159 • Plus PubMed 2003 • = 120 • Novel in title = 11 • Previous cds = 8 • Novel genes = 2 • Now both in RefSeq and Ens 18.34
Human Proteome Sampling by MS/MS Identification: A Paucity of Novel Genes • 3778 from plasma (Muthusamy et al 2005) • 2486 from liver cells (Yan et al. 2006) • 615 from the human heart mitochondria (Taylor et al. 2003) • 500 from breast cancer cell membranes (Adams et al. 2003) • 491 from microsomal fractions (Han et al. 2001) • 311 from the splicesome (Rappsilber et al. 2002) • No verifiable data on gene prediction confirmation • One novel gene reported from a genome-only peptide match by Kuster et al in 2001 but this appeared from a high-throughput project later in the same year (Tr Q96DA0) • While there is no evidence of novel protein discovery there is a caveat on the availalable search space
Conclusions • The model eukaryotes have shown no significant post-genomic rises in gene number • The Ensembl gene number has been essentially flat since 2001 • There is a set of ~2,000 predicted genes still eluding experimental verification – or may not be real ? • Putative genes from curated chromosmes could raise protein numbers but the status of this class of transcripts is in doubt • Early over-estimates explicable by non-ORF transcription • Post-genomic transcript coverage is predominantly re-sampling known genes • Database submissions of novel human genes have slowed to a trickle • No evidence for large numbers of cryptic smORFs • Proteomics has not revealed new proteins Maybe Swiss-Prot can pop the champane corks when HPI hits 20,000 ?
Updates • October 2004 Nature paper on finished human genome “20-25,000 protein-coding genes” • December 2005 Nature paper “The dog gene count (19,300) is substantially lower than the 22,000-gene models in the current human gene catalogue (EnsEMBL build 26). For many predicted human genes, we find no convincing evidence of a corresponding dog gene. Much of the excess in the human gene count is attributable to spurious gene predictions in the human genome (M. Clamp, personal communication).” • March 2006 Ensembl 23,701 • June 2006 Swiss-Prot HPI 14,445
Acknowledgments and Reference • Paul Kersey of the EBI for IPI figures • Lucas Wagner of the NCBI for the retrospective UniGene data • Numerous other people at NCBI, EBI, Swiss-Prot and Sanger Centre who graciously answered queries on their data collections • The Oxford Glycosciences Proteome Discovery Team Southan C. Has the Yo-yo stopped? An assesment of human protein-coding gene number (2004) Proteomics (6):1712-26. PMID: 15174140