190 likes | 330 Views
Comparing the EuGene annotation of the JAZZ & ARACHNE assemblies. Jan WUYTS jan.wuyts@vib.be. Poplar proteins. Other At proteins. Other Plant proteins. SwissProt. Content potential for coding, intron and intergenic. Poplar IMM. PIR. Arabidopsis FLcDNA supported proteins. Poplar
E N D
Comparing the EuGene annotation of the JAZZ & ARACHNE assemblies Jan WUYTS jan.wuyts@vib.be
Poplar proteins Other At proteins Other Plant proteins SwissProt Content potential for coding, intron and intergenic Poplar IMM PIR Arabidopsis FLcDNA supported proteins Poplar RepBase Poplar cDNA & EST join(9265..9395,9749..99342). complement(join(10164..10295,10349..10420,10467..10514,10566..10626,10681..10770,10823..10949,11001)) TBlastx Blastn Blastx RepeatMasker SpliceMachine Extrinsic modules Genome Sequence Gene Models Arabidopsis genome ATCCGTAAGATGGTGCGATGCCCTAAATGGGTCGGTTTATAAAGGCGCGTAGGTAAGTGCAATTTATTCTTCAAGTTCCGAATTTTATATGCGCATATCGTCAGTTCTTCTGTTGCAGTTGGCGCACTTGGACTACCTGCAATTTATTCTTCAAGTTCCGAATTTTATAT EuGene DAG Splice Sites Start ATG Translation Start Site prediction Output Input Intrinsic modules
EuGene annotation • same parameters as last version of previous assembly • TE masking: • 84 TE in previous version • 290 now (thanks Hadi Quesneville!) • annotated genes: 20614 (Jazz) 18578 (Arachne)
fraction of nucleotides in window of 25000 nt. annotated as coding 100% 0% 0 6,8M
size distribution • compare Jazz and Arachne assemblies • fraction confirmed by ESTsCDS covered by at least 200(100) nt. with %id >= 95% • fraction confirmed by BLASTp to uniprot protein covered for >= 75% of it’s length with blastp hit (e<=1e-5)
Small peptides • ~50% have corresponding ESTs, but no homolog in uniprot • what are they? • split genes? • Pseudogenes? • mis-assembled regions (artificial duplications)? • TE-derived sequences? • non-coding RNAs (anti-sense regulation)?(would these be sequenced as ESTs? poly-A?) • other??
best reciprocal BLAST hits other Homo sapiens Arabidopsis thaliana Dictyostelium discoideum Schizosaccharomyces pombe Coprinus cinereus Cryptococcus neoformans Yarrowia lipolytica Magnaporthe grisea Neurospora crassa Emericella nidulans Aspergillus fumigatus Gibberella zeae total:5323 Ustilago maydis
Collinearity of the 2 assemblies • !! next slides have no biological relevance !! • how do both assemblies compare? • jazz (pasting) arachne • jazz (cutting & pasting) arachne • identify collinear region using ADHoRe • don’t take orientation into account
pasting Jazz Arachne block length ~ #annotated genes
cutting & pasting Jazz Arachne block length ~ #annotated genes
cutting & pasting Arachne Jazz Jazz Arachne block length ~ #annotated genes
Acknowledgements • Stephane Rombauts • Piere Rouzé • Yves Van de Peer • everybody in the bioinformatics group in Gent • Francis Martin • everybody from the research group in Nancy • All manual annotators