540 likes | 668 Views
Many to 1 Gene Associations. The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from another group. Some of the examples that follow also illustrate issues related to
E N D
Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from another group. Some of the examples that follow also illustrate issues related to - differences in annotation type (e.g., pseudogene versus gene),and -in confusing nomenclature (e.g., different genes assigned the same official gene name).
One gene or two? 2:110788585..110968584 Orientation issue for OTT15152?
8:4238129..4254528 One gene or two?
One gene or two? 3:105659594..105759593
11:69491920..69516919 One gene or two or three?
5:106920574..107155573 One gene or two or three?
One gene or two? 6:145313224..145563223 The VEGA gene model seems to unite two separate gene models in NCBI see mRNA
One gene or two? 9:15109186..15189185
7:127057560..127247315 One gene or two?
7:52670474..52680473 One gene or two? Has 257 aa upstream CDS EST CX236436 Another joining variant (rat mRNA U25653, mouse EST CF172660), displaying upstream CDS (not actually annotated like that) None of the evidence (mouse or rat) shows distinct upstream gene at the moment.
4:146600055..146731054 One gene or two? It’s a heavily duplicated region: 10666 is more or less duplicate of 10670
n:m ENSMUSG00000050714 and ENSMUSG00000066798 overlap OTTMUSG00000012648 and OTTMUSG00000012652 2:37243166..37343165 Zbt26 and Zbtb6 share 5’ UTR exon but have non-overlapping CDSs
n:m ENSMUSG00000074643 overlaps ENSMUSG00000038171 OTTMUSG00000016087, OTTMUSG00000016088, and OTTMUSG00000019746 2:155895575..155939706 already part of Cpne1 (19746) coding regions don’t overlap moved to Cpne1 (19746)
6:113326848..113366847 n:m OTTMUSG00000017554 overlaps OTTMUSG00000016376 and EG68089 and EG101100. overlaps limited to UTR/non-coding
8:47538900..47638899 Are EG667337 and EG14081 different genes? can’t see any evidence for that structure
7:126992968..127042967 Are EG233805 and EG1000043396 different genes?
6:122655579..122665578 Are EG71950 and EG100038891 different genes? can’t see any evidence for that micro-intron
16:84828048..84836547 Are EG11957 and EG100039950 different genes?
7:87385985..87410984 Are EG61000042379 and EG269954 different genes?
9:43622106..43900000 Are EG61000042548 and EG21838 different genes?
13:22073239..22080498 Are OTT00466 and OTT13227 different genes? A new mRNA BC097347 has appeared which extends the 5’ end to include the ATG of 00466 (similar to human POM121). So now they’re variants of same locus 00466 even though they don’t share a splice.
4:122937497..122988738 Are OTT08975 and OTT08978 different genes? Yes. They share a splice, so 08975 is now a variant of 08978.
3:94933437..94938148 Are OTT22306 and OTT19657 different genes? Yes. I’ve put them all under 22306 (Scnm1). But there’s more to this picture: in BL/6 this gene is possibly a pseudogene because of a strain-specific premature stopcodon about 30bp from the end of the penultimate exon supported by mRNA AK013948.
3:107728458..107736457 Are OTT25890 and OTT07101 the same gene? Yes. Made 25890 part of 07101.
1:172123537..172148536 Are OTT21542 and OTT21543 different genes? Yes. But made 21542 an artefact. 7bp of the mRNA is repeated on genomic sequence around the ggc and cag “splices”.
1:173164903..173177002 Are OTT21571 and OTT21573 different genes? No. Already fixed
2:90744183..90753182 Are OTT14319 and OTT14315 different genes? I’ve made OTT14315 part of 14319. Normally when the transcripts don’t share a splice, they’re kept separate. 14315 is based on EST AV283747. I’ve found it’s companion BY716767. Aligning it against the BAC, it matches the first exon of transcript 33789 and very vaguely the second exon as well. Oddly the homology is very weak, while AV283747 is 100% match ????
4:42236997..42261996 Are ENS78738 and ENS78736 different genes? Are the genes predicted new members of the chemokine (C-C motif) ligand family? In Ensembl multiple gene predictions are assigned to the same gene symbol/MGI id.
15:79611961..79691960 One gene or two or three? Case to be made for all three options! Currently annotated merged transcripts as part of Nptxr as the proportion of that CDS is bigger. Option to make it three genes is attractive. Are Nptxr and Cbx6 Overlapping? artefact (has two non-splices)
2:120535197..120698446 One gene or two? Are Cdan1 and Ttbk2 Overlapping? cDNA AK220258, retained intron (in Cdan1 portion) and apart from that the CDSs do not join up anyway. Both loci got their own CpG and pA features.
One gene or two? X:9598695..9848694 Srpx and Rpgr Overlapping? cDNA BC036959 and AK046821; last exon is in frame with 2nd coding exon of Srpx, but continues beyond exon to end in pA features.
One gene or two? 2:181092767..181132366 Zgpat and Lime1 Overlapping? EST BQ552943; CDSs in-frame not shown: cDNA BC034599; contains all exons of both genes but because joining splice is beyond Zgpat 3’ UTR (in Lime1 5’ UTR), it is NMD. both loci have pA features mRNA AK173276; retained intron
One gene or two? 5:31435474..31485473 Mpv17 and Gtf3c2 overlapping? CDSs are in-frame; based on cDNA AK138760. pA features and CpG island. Mpv17 very conserved in human, rat, cow, frog, zebrafish (same length +/- 1 aa; >70% id). But no own CpG. CDSs are in-frame but additional variation in a downstream exon would cause NMD; based on EST AA111369.
Next slide 16:96582252..96792251 One gene or two? Are Pcp4 and Igsf5 two different genes? also in rat, human cDNA AK164699 100%
In Ensembl currently it looks as though Pcp4 and Igsf5 are considered synonyms for the same gene?
6:87895874..87954921 One gene or two? NCBI gene is a pseudogene, Ensembl gene is a protein coding gene. Pseudogene Protein coding gene
13:75781991..75782990 Protein coding gene Pseudogene Protein coding gene
14:3046445..3080444 Pseudogene Protein coding gene
6:128882645..128993644 Retrotransposed vs pseudogene Pseudogene Retrotransposed Pseudogene
Gene Family Challenges Gene families present many challenges to determining equivalency among gene predictions and for nomenclature. Examples from two gene families are shown in the following slides…. killer cell lectin-like receptor (Klra) family UDP glucuronosyltransferase 1 family cysteine-rich perinuclear theca C-type lectin domain family 2
Next slide 6:129837719..130337718 killer cell lectin-like receptor (Klra) family
stopcodon Next slide stopcodon supported by 100% cDNA 6:130198815..130298814 killer cell lectin-like receptor (Klra) family Gene identity crisis! Protein coding gene Protein coding gene pseudogene transcript Pseudogene
6:130275414..130375413 • Overlapping NCBI annotation • Overlapping features of different types 2. Pseudogene 1. Protein coding gene currently a pseudogene in otter ?!
Next slide 1:89943192..90125441 UDP glucuronosyltransferase 1 family Ensembl maintains a single gene id for all of the members of the family.
9:24428665..24431164 cysteine-rich perinuclear theca Gene identity crisis!
6:128882645..128993644 C-type lectin domain family 2 Ensembl and VEGA predict only a single gene with multiple transcripts rather than two genes Clec2g and Clec2f.
Vega hasn’t annotated Clec2f, period. In actual fact that gene doesn’t exist as such. The “Clec2f” locus is a partial duplication of the Clec2g locus (last four exons). Though the duplicate exons have diverged from the parent, they still are open. However, there is no trace of the first exon and no locus-specific transcriptional evidence. We would annotate this as an unprocessed pseudogene. The three-exon gene between Clec2g and “Clec2f” actually overlaps another Clec2 pseudogene (in this case a duplication of the last three exons). And just a 200 bp further there’s another Clec2 pseudogene consisting of a duplication of the penultimate exon broken into two fragments plus part of the last exon. This pseudogene overlaps the big termal exon. Cleg2g pseudo pseudo “Cleg2f” pseudo
Clec2g “Clec2f”
Unique to MGI MGI does not have a high-throughput computational genome annotation pipeline. However, we integrated the results of high throughput cDNA sequencing projects into the database prior to the availability of the mouse genome. Many of these genes have remained unique to MGI. The following slides illustrate several cases where MGI has a gene that has not been predicted by one of the three major annotation groups. Many (most) of these MGI-unique genes are from the RIKEN cDNA sequencing initiative. Many of them likely represent non-protein coding genes.