290 likes | 428 Views
Building a Unified Gene Catalog for the Mouse Reference Genome. Carol Bult The Jackson Laboratory. Mouse Genome Annotation Summit Bethesda, Maryland March, 2008. How similar are the results of different gene prediction pipelines for Build 37 of the reference mouse genome?. Gene Unification.
E N D
Building a Unified Gene Catalog for the Mouse Reference Genome Carol Bult The Jackson Laboratory Mouse Genome Annotation Summit Bethesda, Maryland March, 2008
How similar are the results of different gene prediction pipelines for Build 37 of the reference mouse genome?
Gene Unification • Compare genome annotations from: • NCBI (31,711 annotations) • Ensembl (28,167 annotations) • VEGA (14,919 annotations) • Determine: • Equivalent gene models • Gene models unique to Ensembl • Gene models unique to NCBI • Gene models unique to VEGA • Etc.
Method:Genome feature overlap analysis • Assess genome coordinate overlaps for annotated exons • NCBI, Ensembl and Vega provided their annotations in a standardized file format w/B37 genome coordinates • Richardson, J. “fjoin: Simple and Efficient Computation of Feature Overlaps” J. Comp Biol 13:1457-64 (2006). • Overlap of a single nucleotide between two exons is sufficient to call two gene models “equivalent” • Overlap parameter is adjustable • Features to use to detect overlaps is configurable
Caveats • Equivalent does not mean identical gene structure • Analysis does not evaluate which gene model is “best” --only that the annotations from different sources likely represent the same gene or transcriptional unit • Unique does not mean novel • Some known genes are present in one annotation file but not the other
Example: Ensembl and NCBI 31711 28167 Unification (Exon Overlap Detection) Equivalent Unique to NCBI Unique to Ensembl 23650 8678 5248 1:1 1:n n:1 n:m 21528 629 788 705
0:1 1:0 1:1 1:n n:1 n:m E vs V 4764 17923 9322 333 505 433 N vs E 5248 8678 21528 629 788 705 V vs N 20208 3409 10606 405 410 535 Build 37 Summary E = Ensembl (28167) V = Vega (14919) N = NCBI (31711) E unique = 4707 N unique = 6953 V unique = 2986
Equivalent (1:1:1) 11:84331455..84340462 Screenshots from MGI Mouse GBrowse
Equivalent (1:n) 1:58765343..58820514
Equivalent (n:1) Clec2g Clec2f 6:128876095..128986094 Some annotations masked out to improve clarity of example
Equivalent (n:m) 2:155895575..155939706
Unique to Ensembl and Vega Some annotations in this region are masked to enhance clarity of the example. Csmd2 Chr4:136463772..137119871
Common Issues • Gene duplications/gene family • Read through transcripts • Shared first exons
Gene Duplication/Gene Family Rex2 Reduced expression 2 Zinc finger protein 4:145845084..145895083
Rex2?? 4:146339646..146439645
4:145845084..145895083 Rex2 4:146339646..146439645 Rex2
Read through Transcripts Raver1 and Fdx1l 9:20862521..20912520
Shared Exons Defb41 and novel defensin gene 1:18240353..18255926
10:21849916..22136785 Raet1a,b,c,d,e Some annotations masked out to improve clarity of example
Importance of Annotation Coordination • Genome feature identity • Functional annotation associations • Experimental genetics • KOMP
Gene Identity 16:96582252..96792251 Pcp4 and Igsf5
Pcp4 – Purkinje cell protein 4 (MGI:97509) Igsf5 – Immunoglobulin superfamily, member 5 (MGI:1919308) There is no Igsf5 in Ensembl, but Igsf5 appears to be used as a synonym for Pcp4
Clec2g Clec2f Clec2f
Functional Annotations Clec2f (MGI:3522133) Clec2g (MGI:1918059)
KOMP 10:51199649..51217200 Gp49a and Lilrb4 www.knockoutmouse.org In Ensembl, this gene model is associated only with Lilrb4. In MGI we associated it with Gp49a.
11:62630999..62696530 Trim16 and Fbxw10 www.knockoutmouse.org
Joel Richardson Yunzia “Sophia” Zhu Ken Frazer TBK Reddy Bob Sinclair Deb Reed Richard Baldarelli Paul Flicek Steve Searle Acknowledgements • Deanna Church • Donna Maglott • Laurens Wilming NIH HG00330-P1
Smgc and Muc19 15:91663946..91769797