1 / 37

Genome Annotation

Genome Annotation. Rosana O. Babu. Sequence to Annotation. Input1-Variant Annotation. Input2- Structural Annotation. Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model

adlai
Download Presentation

Genome Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Annotation Rosana O. Babu

  2. Sequence to Annotation

  3. Input1-Variant Annotation

  4. Input2- Structural Annotation • Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model • However, we have to develop genome model for Oomycete to obtain accurate result

  5. Input3-Functional Annotation

  6. Genome Annotation • The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do • Finding and attaching the structural elements and its related function to each genome locations

  7. Genome Annotation gene function prediction Attaching biological information to these elements- eg: for which protein exon will code for gene structure prediction Identifying elements (Introns/exons,CDS,stop,start) in the genome

  8. Eukaryote genome annotation Find locus Genome Transcription Primary Transcript RNA processing Find exons using transcripts ATG STOP Processed mRNA m7G AAAn Translation Find exons using peptides Polypeptide Protein folding Folded protein Find function Enzyme activity A B Functional activity

  9. Prokaryote genome annotation Find locus Genome Transcription Primary Transcript RNA processing Find CDS START STOP START STOP Processed RNA Translation Polypeptide Protein folding Folded protein Find function Enzyme activity A B Functional activity

  10. Genome annotation - workflow Genome sequence Masked or un-masked genome sequence Repeats Structural annotation-Gene finding nc-RNAs, Introns Protein-coding genes Functional annotation Viewed & Released in Genome viewer

  11. Genome Repeats & features Polymorphic between individuals/populations • Percentage of repetitive sequences in different organisms • Microsatellite • Minisatellite • Tandem repeat • Short tandem repeat • SSR

  12. Finding repeats as a preliminary to gene prediction • Repeat discovery • Literature and public databanks • Homology based approaches • Automated approaches (e.g. RepeatScout or RECON) • Tandem repeats: Tandem, TRF • Use RepeatMasker to search the genome and mask the sequence

  13. Masked sequence • Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s • Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set >my sequence atgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct >my sequence (repeatmasked) atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct Positions/locations are not affected by masking

  14. Types of Masking- Hard or Soft? • Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked >my sequence ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT >my sequence (softmasked) ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT >my sequence (hardmasked) atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

  15. Genome annotation - workflow Genome sequence Masked or un-masked Map repeats Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation Viewed & Released in Genome viewer

  16. Structural annotation Identification of genomic elements • Open reading frame and their localization • Coding regions • Location of regulatory motifs • Start/Stop • Splice Sites • Non coding Regions/RNA’s

  17. Methods • Similarity • Similarity between sequences which does not necessarily infer any evolutionary linkage • Ab- initio prediction • Prediction of gene structure from first principles using only the genome sequence

  18. Genefinding ab initio similarity

  19. Gene_finding resources for Homology based methods • Transcript • cDNA sequences • EST sequences • Peptide • Non-redundant (nr) protein database • Protein sequence data, Mass spectrometry data • Genome • Other genomic sequence

  20. ab initio prediction Genome Coding potential ATG & Stop codons Splice sites ATG & Stop codons Coding potential

  21. Genefinding - ab initio predictions • Use compositional features of the DNA sequence to define coding segments (essentially exons) • ORFs • Coding bias • Splice site consensus sequences • Start and stop codons • Methods • Training sets are required • Each feature is assigned a log likelihood score • Use dynamic programming to find the highest scoring path for accuracy • Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh

  22. Genefinding - similarity • Use known coding sequence to define coding regions • EST sequences • Peptide sequences • Problem to handle fuzzy alignment regions around splice sites • Examples: EST2Genome, exonerate, genewise Gene-finding - comparative • Use two or more genomic sequences to predict genes based on conservation of exon sequences • Examples: Twinscan and SLAM

  23. Genefinding - non-coding RNA genes • Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples • tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes • Rfam - a suite of HMM’s trained against a large number of different RNA genes

  24. Gene-finding omissions • Alternative isoforms • Currently there is no good method for predicting alternative isoforms • Only created where supporting transcript evidence is present • Pseudogenes • Each genome project has a fuzzy definition of pseudogenes • Badly curated/described across the board • Promoters • Rarely a priority for a genome project • Some algorithms exist but usually not integrated into an annotation set

  25. Practical- structural annotation Eukaryotes- AUGUSTUS (gene model) ~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=true --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_griseaour_genome.fasta>structural_annotation.gff Prokaryotes – PRODIGAL (Codon Usage table) ~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt

  26. Structural Annotation- • Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model • However, we have to develop genome model for obtaining accurate result

  27. Functionalannotation

  28. Functional annotation Attaching biological information to genomic elements • Biochemical function • Biological function • Involved regulation and interactions • Expression • Utilise known structural information to predicted protein sequence

  29. Genome annotation - workflow Genome sequence Masked or un-masked Map repeats Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation Viewed & Released in Genome viewer

  30. Genome annotation Genome Transcription Primary Transcript RNA processing ATG STOP Processed mRNA m7G AAAn Translation Polypeptide Protein folding Folded protein Find function Enzyme activity A B Functional activity

  31. Functional annotation – Homology Based • Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities • Visually assess the top 5-10 hits to identify whether these have been assigned a function • Functions are assigned

  32. Functional annotation - Other features • Other features which can be determined • Signal peptides • Transmembrane domains • Low complexity regions • Various binding sites, glycosylation sites etc. • Protein Domain See http://expasy.org/tools/ for a good list of possible prediction algorithms

  33. Functional annotation - Other features (Ontologies) • Use of ontologies to annotate gene products • Gene Ontology (GO) • Cellular component • Molecular function • Biological process

  34. Practical - FUNCTIONAL ANNOTATION • Homology Based Method • setup blast database for nucleotide/protein • Blasting the genome.fasta for annotations (nucleotide/protein) • sorting for blast minimum E-value (>=0.01) for nucleotide/protein • Further filtering for best blast hit (5-15) and assigning functions • Removing Positive strand blast hits • Removing negative strand blast hits

  35. Functional annotation- output Bioinformatics tools for Comparative Genomics of Vectors

  36. Conclusion • Annotation accuracy is only as good as the available supporting data at the time of annotation- update information is necessary • Gene predictions will change over time as new databecomes available (ESTs, related genomes) that are much similar than previous ones • Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins)

  37. Thank You

More Related