120 likes | 279 Views
ENCODE Pseudogene Annotation Subgroup: Summary of Thurs. 16-Sept Call. summarized by M Gerstein 16-Sept Participating groups Havana, IMIM, UCSC, Yale, GIS, Affy. Overall Goals of Pseudogene Subgroup. Create consensus ENCODE pseudogene annotation Agree on defining elements of a pseudogene
E N D
ENCODE Pseudogene Annotation Subgroup:Summary of Thurs. 16-Sept Call summarized by M Gerstein16-Sept Participating groups Havana, IMIM, UCSC, Yale, GIS, Affy
Overall Goals of Pseudogene Subgroup • Create consensus ENCODE pseudogene annotation • Agree on defining elements of a pseudogene • What is the degree to which pseudogenes confound gene annotation? How many are close or distal to genes? • Cross-reference this annotation against ENCODE experiments • How many pseudogenes have some functional "activity"? How many are transcribed ? • How many are associated with TARs & transfrags? CAGE & ditags? ChIP-chip binding sites ? • Cross hybridization problem
Intersection of Pseudogenes from 3 Groups 42 45 Havana-Gencode:167 pseudogenes 35 21 86 Yale: 184 pseudogenes 87 87 18 17 18 16 22 UCSC retrogenes: 15 expressed (7-8 pseudogenes) + 143 not expressed (all pseudogenes) 86 havana peudogenes overlap with any Yale pseudogene and 87 Yale pseudogenes overlap with any havana pseudogene (idem for retrogenes). This is a global result: maybe in some loci three havana pseudogenes overlap with only one yale pseudogene, but in other loci, several yale pseudogenes overlap with one havana pseudogene. [ Provided by France Denoeud (IMIM) ]
48 49 30 15 87 87 87 17 18 16 11 29 >15 ENm002 831244 831480 237 IPI:IPI00442001 259 ..337 pexons: 1 235FHALVVLSWPHVLELLPQRNPSLHVASLTRQLQHCMAGHQLLQFKGSTLALVIITLELERLMPGWCAPISDLLKKAQV FHALVVLSWPHVLELLPQRNPSLHVASLTRQLQHCMAGHQLLQFKGSTLALVIITLELERLMPGWCAPISDLLKKAQV FHALVVLSWPHVLELLPQRNPSLHVASLTRQLQHCMAGHQLLQFKGSTLALVIITLELERLMPGWCAPISDLLKKAQVFHALVVLSWPHVLELLPQRNPSLHVASLTRQLQHCMAGHQLLQFKGSTLALVIITLELERLMPGWCAPISDLLKKAQV "Yale-only" Pseudogenes:5 Examples No disablement, overlap exon >70 ENm007 381109 381518 410 IPI:IPI00448927 239 ..330 frameshift=1 ENm007 381109 381518 pexons: -404 -147SKKPSLSVQPGPVMAPGESLTLHCVSDVGYDRFVLYKEGERDLRQLPGRQPQAGLSQANFTLGPVSRSYGGQYRCYGAHNLSSECSAPSDP SPQPSLSAQPGSPVLSGDSLTPQHHSEAGFDSSALTR-----TR!LPARQRLDGQHLLDVPLGHASHPPGGQHRCCGGHNASCPRSVPRRP PGVSKKPSLSVQPGPVMAPGESLTLHCVSDVGYDRFVL-YKEGERDLRQLPGRQPQAGLSQANFTLGPVSRSYGGQYRCYGAHNLSSECSAPG-SPQPSLSAQPGSPVLSGDSLTPQHHSEAGFDSSAL/YQD-----KGLPARQRLDGQHLLDVPLGHASHPPGGQHRCCGGHNASCPRSV PSDPLDILITGQIRGT-----PFISVQPG PRRPHPTSWL-QVRGPYPDPIPFSALDPG Frameshift >122 ENm009 367441 368389 949 IPI:IPI00465221 1 ..305 ENm009 367441 368389 pexons: -949 -37MALPITNGTLFMPFVLTFIGIPGFESVQCWIGIPFCATYVIALI.........WILYPIICTYHLVQSLPTGPTIPQPLYLWVKDQTH MALPITNGTLFMPFVLTFIGIPGFESVQCWIGIPFCATYVIALI.........WILYPIICTYHLVQSLPTGPTIPQPLYLWVKDQTH MALPITNGTLFMPFVLTFIGIPGFESVQCWIGIPFCATYVIALI.........WILYPIICTYHLVQSLPTGPTIPQPLYLWVKDQTHMALPITNGTLFMPFVLTFIGIPGFESVQCWIGIPFCATYVIALI.........WILYPIICTYHLVQSLPTGPTIPQPLYLWVKDQTH No disablement, overlap exon Remove 12, but some tricky issues-- i.e. 12,99,152,169,108 >205 ENr223 185680 201963 16284 IPI:IPI00023543 110 ..588 2.78 intron=4 stop=2 frameshift=5 pexons: 3383 3620 3826 4462 12565 12865 12917 13099 13459 13551 Disablements, have introns, probable duplicated, overlap exon >177 ENr122 359278 362468 3191 IPI:IPI00029222 980 ..1118 0.87 intron=0 stop=0 frameshift=2 ENr122 359278 362468 2768 3191 pexons: 2768 3191LGNTIQDIGMGKDFMTKTPKAMATKVKIDRWDLIKLKSFCTAKETTIRVNRQPTKWEKIFAIYSSDKGLISRIYNE---LKQIYKKKTNNPIKKWAKDMNRHPSKEDIYAAKKHMKKCSSSLAIREMQIKTTMRYHLTPVR LGNNILDTGFGKYFMTKMPKAIATETKIEIWDISKLK!FCRAKETINSVNRQPIEMEKIFANYASDRGLISRIY!KKTNLNLQAKTKQHNSIKKWPKDMDRHFSKDDICVANKPRKTLPTSLIIREIQIKTMMRYHLTPFR IKTLEKNLGNTIQDIGMGKDFMTKTPKAMATKVKIDRWDLIKLK-SFCTAKETTIRVNRQPTKWEKIFAIYSSDKGLISRIY---NELKQIYKKKT-NNPIKKWAKDMNRHPSKEDIYAAKKHMKKCSSSLAIREMQIKTTMRYHLTPVRVRLLYALLGNNILDTGFGKYFMTKMPKAIATETKIEIWDISKLK/SFCRAKETINSVNRQPIEMEKIFANYASDRGLISRIY*KKNKLKFTSKNQT\NNSIKKWPKDMDRHFSKDDICVANKPRKTLPTSLIIREIQIKTMMRYHLTPFR Multiple Frameshifts, overlap exon
48 49 30 15 87 87 87 17 18 16 11 29 >12 ENm002 242882 243044 163 IPI:IPI00017094 2359 ..2399 0.26 ENm002 242882 243044 FARASKEQKDKFLKNRGFSLLANQLYLHRGTQELLECFIE FSRPSKKQKDKFLK-YSFSLLANQLFLHQEIQELTDSFIK LDAYFARASKEQKDKFLKNRGFSLLANQLYLHRGTQELLECFI-EMFFGRHIGLDEFEA*FSRPSKKQKDKFLK-YSFSLLANQLFLHQEIQELTDSFI/EMFFG*CTGLDE "Havana-only" Pseudogenes:5 Examples >56 ENm006 1293946 1313338 19393 IPI:IPI00384823 1 ..1276 8.3 intron=0 stop=7 frameshift=9 >125 ENm009 424525 425472 948 IPI:IPI00022766 1 ..282 MYIVAVAGNIFLIFLIMTERSLHEPLYLFLSMLASANFLLAAAAAPEVLAILWFH.........KQIKDRVILLFSPISVCC MYIVAVAGNIFLIFLIMTERSLHEPMYLFLSMLASADFLLATAAAPKVLAILWFH.........KQIKDRVILLFSPISVCC MYIVAVAGNIFLIFLIMTERSLHEPLYLFLSMLASANFLLAAAAAPEVLAILWFH.........KQIKDRVILLFSPISVCCMYIVAVAGNIFLIFLIMTERSLHEPMYLFLSMLASADFLLATAAAPKVLAILWFH.........KQIKDRVILLFSPISVCC Similar discussion for "UCSC only" >103 ENm008 153121 155155 2035 IPI:IPI00217473 8 ..143 intron=2 stop=0 frameshift=0 pexons: 25 96 1360 1566 1906 2033TIIVSMWAKISTQADTIGTETLE LFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSI TIIVSMWAKISTQADTIGTETLE R:R[agg] LFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSI DDIGGALSKLSELHAYILRVDPVNFK LLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR DDIGGALSKLSELHAYILRVDPVNFK LLSHCLLVTLAARFPADFTAEAHAAWAKFLSVVSSVLTEKYR RLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLRLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKV >174 ENr121 322430 366341 43912 IPI:IPI00384823 1 ..1276 intron=0 stop=2 frameshift=2 ENr121 322430 366341 38663 42482 pexons: 38663 42482MTGSNSHITILTLNINGLNSAIKRHRRASWIKSQDPSVCCIQET...
Pseudogenes Overlapping Gencode Exons 122 28 30 124 Havana-Gencode:167 pseudogenes Yale: 184 pseudogenes 13 12 20 2 Havana-Gencode Exons: 17603
49 GIS Pseudogenes, Not Yet Fully Compared • The 49 non-redundant ENCODE processed pseudogenes were used • for comparison with pseudogenes from Yale, Vega, and Ensembl groups. • 4 pseudogenes were uniquely found in the two libraries. GIS-PET (4) Yale (12) Vega (5) 20 Ensembl (3) 2 2 1 [From GIS]
Browser Tracks [R Baertsch, UCSC] Pseudogene track A processed pseudogene at chr21: 33775699 -33776428 genome-test.cse.ucsc.edu/ENCODE/encode.html
Overall short-term goal for next call:Come up with a consensus list of pseudogenes suitable for carefully checking for transcription (perhaps by RT-PCR)
Immediate ToDo's for Next Call • Classify pseudogenes as processed & non-processed (with a third "not sure" category) • Venn diagrams in each category • Need to add to our current 87 consensus • Among duplicated pseudogenes:Determine Yale/Havana consensus, add to 87 • Among processed pseudogenes: • Merge in 49 from GIS • Each group should determine which of its pseudogenes not in the consensus it still wants to keep and repost them to list • Update list summary and UCSC browser • list summary web page (maintained by Deyou, http://homes.gersteinlab.org/people/zhengdy/cgi-bin/encode-pgene.cgi ) • Flag truly tricky ones as questionable to be returned to later (e.g. #169, OR ex. truncated at 6TM )
Browser ToDo's for Next Call • Send alignments to Rob so he can link to browser • A clear coloring scheme for differentiating processed vs non-processed pgenes • UCSC will index by names used by the different groups • Create an additional fourth sub-track for consensus pseudogenes • Perhaps an additional track for prominent disagreements i.e. questionable pseudogenes (or another color) • Small fix on Gencode "pseudogene" track
Remaining Issues • Of the consensus pseudogenes, determine unique sequences for RT-pcr or matching against probes • Remaining questions: • How are we going to arrive at agreed upon boundaries for pseudogenes (start and stop)? • What is the best for alignments, cDNA or protein? • (Given that complete cDNA info is not available for everything, perhaps best to stick to proteins initially.)