270 likes | 544 Views
Overview:. What is genome annotation? In which format can a genome annotation be saved to files? Definition of the gff genome annotation format Other genome annotation formats Application: evaluating the performance of a gene prediction program Exercises. What is genome annotation ?. genom
E N D
1. Computational Biology:Genome annotation formats October 2004
Ian Holmes
Department of Bioengineering
University of California, Berkeley
From an original lecture by Irmtraud Meyer
2. Overview: What is genome annotation?
In which format can a genome annotation be saved to files?
Definition of the gff genome annotation format
Other genome annotation formats
Application: evaluating the performance of a gene prediction program
Exercises
3. What is genome annotation ? genome annotation is the localisation of functional elements in a genomic sequence
For example: the location of
protein coding genes
tRNA and other RNA genes
promoters
...
4. Example 1: protein coding genes
5. Formats for saving annotations: Motivation:
To save information on a gene, a format should be able to record:
the location of the gene in the genome
the position of its exon-intron boundaries
the strand of DNA on which the gene lies
the source of annotation
the completeness of the gene structure
6. The GFF format: GFF = Genefinding File Format
a format used to save gene structures
idea: divide gene into its constituents
Exon transcribed sections of a gene
CDS translated sections of a gene
Start_Codon
Stop_Codon
7. The GFF format:
8. The GFF format: Format of each gff-line:
name source feature start end score strand frame group
where:
name: the name of sequence (string)
source: the name of the source of annotation (string)
feature: feature type: Exon, CDS, Start_Codon, Stop_Codon (string)
start: start position of feature (integer)
end: end position of feature (integer)
score: score (rational number) associated with feature, set to . if score not used
strand: strand on which feature lies, possible values are + or -
frame: 0, 1 or 2 for CDS, Start_Codon and Stop_Codon, . for Exon
9. The GFF format: remarks
the fields in a gff line are tab delimited
start < end (important to keep in mind when dealing with genes on the reverse strand !)
the start and end positions are the corresponding positions on the + strand
definition of frame for CDS, Start_Codon and Stop_Codon features:
0: first nucleotide in feature has codon position 0
1: first nucleotide in feature has codon position 2
2: first nucleotide in feature has codon position 1
=> note that the frame of a CDS is NOT its length modulo 3 and that the frame of a Start_Codon and Stop_Codon always has to be 0 (Why ?)
Exons do not have a frame, use . as the value of their frame
if there is no score associated with a feature, use .
10. The GFF format: more remarks
the terminal CDS does not comprise the positions of the Stop_Codon as the Stop_Codon is not translated
the initial CDS does comprise the positions of the Start_Codon as it is translated
the order of lines in a gff file is irrelevant although it makes sense to group them by genes
11. The GFF format: Example 2:
A valid description of this gene in gff format is for example:
Chr1 src Exon 150 200 . + . gene_id 1; transcript_id 1; exon_number 1
Chr1 src Exon 300 401 . + . gene_id 1; transcript_id 1; exon_number 2
Chr1 src CDS 380 401 . + 0 gene_id 1; transcript_id 1; exon_number 2
Chr1 src Exon 501 650 . + . gene_id 1; transcript_id 1; exon_number 3
Chr1 src CDS 501 650 . + 2 gene_id 1; transcript_id 1; exon_number 3
Chr1 src Exon 700 800 . + . gene_id 1; transcript_id 1; exon_number 4
Chr1 src CDS 700 707 . + 2 gene_id 1; transcript_id 1; exon_number 4
Chr1 src Exon 900 1000 . + . gene_id 1; transcript_id 1; exon_number 5
Chr1 src Start_Codon 380 382 . + 0 gene_id 1; transcript_id 1; exon_number 2
Chr1 src Stop_Codon 708 709 . + 0 gene_id 1; transcript_id 1; exon_number 4
12. The GFF format: Example 3: a gene on the reverse strand
The valid description of this gene in gff format is for example:
Chr22 src Exon 649 700 . - . gene_id 1; transcript_id 1; exon_number 1
Chr22 src CDS 649 700 . - 0 gene_id 1; transcript_id 1; exon_number 1
Chr22 src Exon 351 500 . - . gene_id 1; transcript_id 1; exon_number 2
Chr22 src CDS 351 500 . - 2 gene_id 1; transcript_id 1; exon_number 2
Chr22 src Exon 150 250 . - . gene_id 1; transcript_id 1; exon_number 3
Chr22 src CDS 153 250 . - 2 gene_id 1; transcript_id 1; exon_number 3
Chr22 src Start_Codon 698 700 . - 0 gene_id 1; transcript_id 1; exon_number 1
Chr22 src Stop_Codon 150 152 . - 0 gene_id 1; transcript_id 1; exon_number 3
13. Other genome annotation formats: DAS = XML version of GFF
uses tags to delimit fields, not whitespace
a lirrle more structured
GAME = Genome Annotation Markup Elements
The format definition can be found at: http://www.bioxml.org/Projects/game
14. Uses of a genome annotation format: exchanging annotation information
checking an annotation
comparing differrent annotations
visualising an annotation, see for example www.ensembl.org
15. Evaluating the performance of a gene prediction program:
16. Evaluation on different levels:
17. Evaluation on different levels (cont'd):
18. Measures of performance:
19. Exercises: 1.) Check that you can reproduce the frames of the CDS lines in example 3 knowing the positions of the CDSs, the start codon and the stop codon.
2.) What do the terms (# tp + # fp) and (# tp + # fn) stand for ?
3.) Looking at a gff entry of a gene, can you deduce if the annotation of the gene is complete ?
4.) In which interval of numbers do the values of sensitivity and specificity fall ?
20. Exercises: 5.) This exercise prepares you for the practicals following this lecture: You are collaborating with colleagues abroad who send you a gff file with the genes of their genome annotation as well as a fasta file with the corresponding genome sequence.
a) How do you check the gff file for errors ? Which checks can you think of ?
b) Outline the structure of (i.e. write the pseudocode for) a program which checks the gff file for errors.
6.) You are given a gff file with an annotation predicted by a gene prediction program.
a) Which information do you require to evaluate the performance of the gene prediction program ?
b) Outline the structure of a program which evaluates the performance of a gene prediction program by comparing the predicted genes (contained in gff format in file 1) to the known genes (contained in gff format in file 2) (see example 4).
21. Answers to exercises: 1.) look at gff lines with features CDS and start codon in example 3:
- CDS with exon_number 1 is the initial i.e. 5'-most CDS of the gene as it starts with a start codon
- the initial CDS has length 700 649 + 1 = 52 = 17 * 3 + 1
=> the next CDS with exon_number 2 starts with codon position 1
=> the next CDS has frame 2
- the second CDS has length 500 351 + 1 = 150 = 50 * 3
=> the next CDS with exon_number 3 start with the same codon position
=> the next CDS has frame 2
2.) (# tp + # fp) is the number of predicted features
(# tp + # fn) is the number of annotated features
3.) A gff entry to a gene only tells you if the protein coding part of the gene is complete. If the gff entry comprises start and stop codon of the gene, its protein coding part is complete. A gff entry does not show if the information
on the untranslated exons is complete.
22. Answers to exercises: 4.) The values for sensitivity ((# tp) / (# tp + # fn)) and specificity ((# tp) / (# tp + # fp))
lie between 0 and 1. The sensitivity is 1 only if (# fn) = 0 and the specificity is 1 only if (# fp) = 0.
23. Answer to exercise 5: Note: This exercise is about checking the annotation given in gff format, NOT the gff
format itself !
a) checking the annotation in the gff file is best done if the corresponding DNA sequences are available as this allows more checks to be performed,
so for the practical you can assume that you are given a gff and the corresponding
fasta file containing the DNA sequences
- possible checks of the annotation are:
-Is the start codon correct (if it exists) ?
- Is the stop codon correct (if it exists) ?
- Are there no in-frame stop codons within the CDS ?
- Do the splice sites look fine ?
- For complete genes: Is the sum of CDS lengths a multiple of 3 ?
24. Answer to exercise 5 (cont'd): b) For the program which checks the annotation you may assume the following which
you do not have to check:
- sequences names in the fasta file are unique
- use of gff format is correc
You may assume in your program, but should check the following:
- DNA sequences consist of A,C,G,T letters only
- all genes are complete, ie comprise a start and stop codon
- splice sites are either GTAG (consensus) or GCAG
- there is exactly one gene associated with each fasta file sequence
Some things to keep in mind:
- genes can lie on the forward + or the reverse - strand
- the DNA sequences in the fasta file are the + strand sequences
- the coordinates in the fasta and the gff file are absolute coordinates, but in your program you may prefer to make some calculatations in relative coordinates (ie the
first sequence position being 1 and the last being length_of_sequence
25. Pseudocode (outline of the program): 1.) read all of the fasta file and get all DNA sequences and headers
2.) for each entry in the fasta file:
a) check fasta entry:
i) length of DNA sequence equals length indicated in header ?
If not, report error and go to next sequence (=: rerr&gonext)
ii) DNA sequence consists of A, C, G, T letter only ? If not, rerr&gonext.
b) read gff lines for that sequence name:
i) check gff lines exist: if not, rerr&gonext
ii) check there is exactly one gene associated with fasta entry: if not, rerr&gonext
Iii) check if gene is complete: if not, rerr&gonext
iv) check if sum of CDS lengths multiple of 3: if not, rerr&gonext
v) check if start codon correct: if not, report error
vi) check if stop codon correct: if not, report error
vii) check that there are no in-frame stop codons: if there are any, report error
viii) if relevant, check if splice sites are ok: if not, report error
26. Info on input files and functions:
27. Remark about the fasta header lines:
28. Answer to exercise 5b: