240 likes | 391 Views
Analysis of the RNAseq Genome Annotation Assessment Project. by Subhajyoti De. . . . The RNAseq Genome Annotation Assessment Project. The RGASP aims to assess the current progress of automatic gene building using RNAseq as its primary dataset.
E N D
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De
The RNAseq Genome Annotation Assessment Project • The RGASP aims to assess the current progress of automatic gene building using RNAseq as its primary dataset. • More specifically we aim to evaluate the status of computational methods to • map human RNAseq data, • assemble them into transcripts and • quantify the abundance of that transcript in particular datasets. • Promising transcript predictions not covered by Gencode annotation will be validated by experimental methods Introduction and a summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Introduction and a summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes 3 species: human, worm and fly. Multiple RNA-seq daasets for each organism. 15 submitters. 304 submissions
The RNAseq Genome Annotation Assessment Project • Analysis methodology • we carried out independent evaluation for the coding portions of the mRNA transcripts (CDS focused) and the mRNA transcripts as a whole (mRNA focused). • Analysis was carried out at multiple levels: • Nucleotide level • Exon level • Transcript level • For each of the levels, we calculated the sensitivity and specificity of the predictions (as discussed later). As a summary measure we also reported the average of the two statistic. Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
Number of annotated nucleotides correctly predicted Sensitivity = Number of annotated nucleotides in the annotation set Number of predicted nucleotides correctly also annotated Specificity = Number of predicted nucleotides in the annotation set The RNAseq Genome Annotation Assessment Project Nucleotide level analysis Annotation set Introduction and summary of submissions Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Nucleotide level analysis Introduction and summary of submissions Points to note: Nucleotide predictions had to be on the same strand as the annotations to be considered as correct. Individual nucleotides present in multiple transcripts in either the annotation or the predictions are considered only once. As a summary measure, we also calculated the arithmetic average of specificity and sensitivity. Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Nucleotide level analysis (H. sapiens) Introduction and summary of submissions 93.308 Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Nucleotide level analysis (D.melanogaster) Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Nucleotide level analysis (C.elegans) Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
Number of annotated exons correctly predicted Sensitivity = Number of annotated exons in the annotation set Number of predicted exons correctly also annotated Specificity = Number of predicted exons in the annotation set The RNAseq Genome Annotation Assessment Project Exon level analysis Introduction and summary of submissions Annotation set Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Exon level analysis Introduction and summary of submissions Points to note: An exon in the prediction must have identical start and end coordinates and also the same strand as an exon in the annotation to be counted correct. If an exon is present in multiple transcripts in either the annotation or the predictions, it is counted only once. As a summary measure, we also calculated the arithmetic average of specificity and sensitivity. Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Exon level analysis (H.sapiens) Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Exon level analysis (D.melanogaster) Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Exon level analysis (C.elegans) Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
Number of annotated transcripts correctly predicted Sensitivity = Number of annotated transcripts in the annotation set Number of predicted transcripts correctly also annotated Specificity = The RNAseq Genome Annotation Assessment Project Transcript level analysis Introduction and summary of submissions Annotation set Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes Number of predicted transcripts in the annotation set
The RNAseq Genome Annotation Assessment Project Transcript level analysis Introduction and summary of submissions Points to note: We consider a transcript accurately predicted if the number of exons in a transcript and their boundaries match exactly between the annotation and the prediction. for the CDS-focused evaluation if the beginning and end of translation are correctly annotated and each of the 5' and 3' splice sites for the coding exons are correct we consider the transcript to be correctly predicted. for the mRNA evaluation, a transcript is counted correct if all of the exons from the start of transcription to the end of transcription match perfectly between the annotation and prediction sets. Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Transcript level analysis Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Human, (CDS-focused)
Number of annotated transcripts correctly predicted Sensitivity = Number of annotated transcripts in the annotation set Number of predicted transcripts correctly also annotated Specificity = Number of predicted transcripts in the annotation set The RNAseq Genome Annotation Assessment Project Relaxed Transcript level analysis Introduction and summary of submissions Annotation set Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Relaxed Transcript level analysis Introduction and summary of submissions Points to note: We consider a transcript ‘accurately’ predicted if the number of exons in a transcript match exactly between the annotation and the prediction, and their boundaries differ by no more than 5bp. All other criteria remain same as that of Transcript-level analysis. Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes
Number of annotated transcripts correctly predicted Sensitivity = Number of annotated transcripts in the annotation set Number of predicted transcripts correctly also annotated Specificity = Number of predicted transcripts in the annotation set The RNAseq Genome Annotation Assessment Project Very relaxed Transcript level analysis Introduction and summary of submissions Annotation set Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Very relaxed Transcript level analysis • Points to note: • We consider a transcript ‘accurately’ predicted if • the number of exons in a transcript differ by no more than two (terminal exons only) between the annotation and prediction, and • the boundaries of all equivalent exons differ by no more than 5bp between the annotation and the prediction. • All other criteria remain same as that of Transcript-level Analysis. Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Worm, (exon-focused)
The RNAseq Genome Annotation Assessment Project 'missing exons' (MEs:): the annotated exons that have no overlap with predicted exons by at least 1 bp 'wrong exons' (WEs): the predicted exons not overlapping annotated exons by at least 1 bp. Introduction and summary of submissions Analysis methodology Nucleotide level analysis Annotation set Prediction set Exon level analysis Missed exons Transcript level analysis Wrong exons Missing and wrong genes 'wrong exons' (WEs) that are predicted independently by more than two predictors are recorded, and some of them will be tested experimentally.
Screen shot of the list of dubious wrong exons. 15704 dubious wrong exons in the whole human genome. 17678 dubious wrong exons in the whole worm genome. The RNAseq Genome Annotation Assessment Project ’Dubious wrong exons' (WEs) that are predicted independently by more than two predictors are reported. Introduction and summary of submissions Analysis methodology Annotation set Prediction set Nucleotide level analysis Exon level analysis Dubious wrong exons Transcript level analysis Missing and wrong genes
The RNAseq Genome Annotation Assessment Project Acknowledgement Introduction and summary of submissions Jen Harrow Felix Kokocinski Tim Hubbard The RGASP community Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes