Discovery and Characterization of protein-coding genes in D. melanogaster

Discovery and Characterization of protein-coding genes in D. melanogaster Mark Yandell HHMI Berkeley Drosophila Genome project

We have just completed a large-scale genome-wide search for additional protein coding genes. What we found: Most fly protein coding genes have been at least provisionally identified. Looking for new protein-coding genes means searching very large collections of predictions for only a few additional genes. Doing so in a coordinated and cost effective manner is essential. Validation & Coordination

genic genic genic intergenic intergenic Where would missing genes lie? ~50 % of genome is intergenic (61,971,014)

Distribution of intergenic lengths Genes distributed randomly within Genome

Distribution of intergenic lengths Genes distributed randomly within Genome Actual distribution 26,346,479 million bp

~62 mega bases of DNA run Genscan on every intergenic region 11671 predictions 1,167 non-overlapping FgenesH predictions (V. Solovyev) 1266‘new’ genes from Hild et al 159 control annotations 14263 new gene predictions How many are real?

~62 mega bases of DNA 14,263 new gene predictions Standardized validation procedure

GENE PREDICTION 1. Pool mRNA from 6 different stages 2. RVT with T15 TAGGED primer 3. PCR w/exon specific primers 4. Sequence PCR product 5. Realign to genome 6. Examine in browser validation procedure genome browser PCR PRODUCT GENE PREDICTION

~62 mega bases of DNA 14263 new gene predictions sub-categorization seemed advised homology seemed a logical criterion

2% 9% 7% Split the gene models in to 5 different sets ‘One or none set’ 1 293 (9,276) D. p. genome ‘two or more set’ 339 (339) 2 About 800 protein coding genes remain to be identified*. ~95%* of all fly protein-coding genes have at least provisional annotations. D. p. genome AG GT ‘splice junction conserved set’ 3 AG 207 (207) GT D. p. genome 34% ‘Heidelberg set’ Why are there so many predictions & so few genes? 4 196 (1266) ‘new’ genes from Hild et al. ‘control set’ 5 159 96% Platinum annotations

A negative control A=T=G=C=0.25 AATGCGGATTTGCGGGATTAGGCGTTGAAAAAAAAAGATTCG~ Genscan, CpG island finder Random sequence generator Examine results

Random DNA contains genes and CpG islands… CpG Genscan Genscan thus we argue that an abundance of predictions is itself not evidence for missed genes.

This fact means that validation methodology is a real issue. As far as Genscan is concerned D. melanogaster intergenic regions look like random DNA. It now appears that much of the genome is transcribed. We believe that in many cases spurious predictions overlap transcribed regions simply by chance.

Confirmation of expression is not • confirmation of existence. • At the very least show that its spliced, • or failing that discrete. • Determining the true structure of the • transcriptome is the next logical step • for annotation. For protein-coding genes: • accurate annotation of each protein-coding gene’s intron-exon structure • accurate annotation of every alternate transcript. • extend in-situ information to individual alternative transcripts.

Conclusions We have just completed a large-scale genome-wide search for Additional protein coding genes. What we conclude: Most fly protein coding genes have been at least provisionally identified. Looking for new protein-coding genes means searching very large collections of predictions for only a few additional genes. -- finding more will require new/retrained gene-finders; casting a wider net. -- this will make for even larger collections of predictions. RE: protein-coding genes the real issue is ‘finalizing’ provisional annotations. This will a computationally & experimentally complex task! Doing so in a coordinated and cost effective manner is essential. Ditto for annotation ‘finalization’ and non-coding RNA genes

Why doing this responsibly will require a common software infrastructure. group B group A gene-finder 2 gene-finder 1 validation results wet-lab gff3 • Coordination and • Standardization will • be key! • of validation procedures • of data exchange formats • Some centralized coordination primers results gff3 wet-lab group C gene-finder 3

Acknowledgements Sima Misra Adina Bailey Colin Wiel ShengQiang Shu Joe Carlson Martha Evans-Holm Pavel Tomancak Sue Celniker Suzi Lewis Gerald M. Rubin

Discovery and Characterization of protein-coding genes in D. melanogaster

Discovery and Characterization of protein-coding genes in D. melanogaster

Presentation Transcript

Protein Purification and Characterization

E-Discovery and Predictive Coding

Protein Characterization

Protein purification and characterization

An extensive map of RNA-protein interactions in Drosophila melanogaster

Phylogenetic inference on the evolution of protein-coding genes

Phylogenetic inference on the evolution of protein-coding genes

Functional Non-Coding DNA Part I Non-coding genes and non-coding elements of coding genes

Genes and Protein Synthesis

Genes and Protein Synthesis

Number of substitutions between two protein-coding genes

Protein/Peptide characterization

Analysis: Discovery of coregulated genes

CLASSIFICATION AND CHARACTERIZATION OF NATURAL PROTEIN INHIBITORS OF PROTEIN KINASES

Basic Protein Characterization

Protein/Peptide characterization

Polytene Chromosomes of D. melanogaster

Transcription of Protein-Coding Genes and Formation of Functional mRNA

Protein Drug Characterization in Biopharmaceuticals