Savita Shrivastava Feb 25 th , 2005 Lab Presentation

BASys A Web Server for Automated Bacterial Annotation Savita Shrivastava Feb 25th, 2005 Lab Presentation

BASys-Introduction • A web server for automated, in-depth annotation of bacterial genomic sequence. • BASys uses more then 30 programs to determine ~60 annotation subfields for each gene • BASys also generates colorful, clickable and fully zoomable maps of each query chromosome • Annotation and map can be generated in ~24 hrs for an average bacterial chromosome (5Mb) or 3000 genes • BASys annotations may be viewed or downloaded anonymously or through a password protected access system • BASys server and databases can also be downloaded and run locally • BASys is available at : http://wishart.biology.ualberta.ca/basys

Automated genome annotation-why? • Complete published genomes • 21 Archaeal • 205 Bacterial • 32 Eukaryal • Ongoing projects • 655 Prokaryotic genomes • 474 Eukaryotic genomes

Challenges • Sheer volume of data • The heterogeneous and growing types of annotations • The time sensitivity of searches • Computing power can be expensive • The need to present the information in an integrated graphical fashion

Existing automated genome annotation systems • GeneQuiz(from protein sequence to biochemical function using variety of search and analysis methods) • PEDANT(focuses on protein based annotation as well as on many DNA based analysis) • Genotator(gene prediction, and searches for homologs, promoters, splice sites, and ORFs) • MAGPIE and Bluejay(gene description, gene taxonomic information, similarity searches, metabolic pathways, GO, etc.) • GAIA(Structural annotation) • TIGR CMR(gene and protein name, GO, M.W. , pI and taxonomic information of organism)

BASys in detail • Data submission and scheduling • The BASys annotation engine • The BASys report generator

Data submission and scheduling • A front-end web interface for :- • Submitting the raw genomic data • Scheduling the annotations • Monitoring and reporting the annotation progress

Submitting the raw genomic data • Anonymous access • For anonymous submission, the user is emailed a secure URL for monitoring and retrieving the progress of their annotations • For Single chromosome submission • Login-based access • Register with BASys • Password-protected • Allows users to submit and monitor multiple chromosome and plasmid annotations

Submitting the raw genomic data • BASys provides a web based form for submitting • Chromosome data as a FASTA-formatted file • Chromosome topology (circular or linear) • Gram stain subtype • Chromosome identifier

Submitting the raw genomic data • Gene prediction using “Glimmer”, a popular gene prediction program • If gene positions are already known, they can be supplied to BASys in a simple TAB-delimited format or as an NCBI’s “.ffn” formatted FASTA file • “.ffn” includes the nucleotide coding sequences along with the location and direction along the chromosome.

Email genome data Host the web server and runs the queuing and scheduling system Can also issue directives to suspend, resume, restart, and remove the genome annotation jobs on the slave nodes. Overview BASys annotation engine Reference DB Similarity Search user • BASys is a distributed system operating in a clustered computing environment accommodates multiple users simultaneously performing long running , resource intensive genome annotations Master node SwissProt CCDB Similarity Searches G.D. Model Organisms Sequence Analysis Etc. predictSPTM Pfam PROSITE Slavenode Slavenode Structure Analysis Slavenode Homodeller PDB VADAR

Monitoring and reporting the annotation progress • Each slave node continually communicates its progress to the master node while generating the annotations and reports. • Upon completion of the annotation job submitter is notified by email that the annotations are ready • MySQL client server protocol to communicate directives and status • Apache web server/HTTP protocol to transfer the sequence data and reports

BASys annotation engine • Function prediction • Comparative annotations • Structural annotations • Secondary structure analysis • Metabolic annotations • General properties prediction

Structural Analysis Proteomic Sequence Data SwissProt CCDB Exact Homolog BASys annotation pipeline KEGG (metabolic information) Genomic Sequence Data PROSITE predictSPTM COG Information Pfam Gene Identification BLAST againstnr database for protein function prediction PSORTB Translation Annotations from other sources • BLASTPDB database • Homodeller • VADAR • PsiPred • Modification of • secondary structure if • transmembrane regions • are present • Structure class BLAST e-10 Check for missing annotations No hit No hit General Properties Operon Structure Hypothetical Protein < orf number> Homologues & Paralogues Annotation Parser Preceding and Following Gene Annotations + Features Annotations TargetDB Status and Availability Annotation Collection CCDB format Evidence cards HTML format

Annotations from multiple sources Example: Sub cellular location • SwissProt • If gene ontology is associated with hydrolase, nuclease, endonuclease or ribonuclease activity or nucleic acid or RNA binding properties then the sub cellular locations is "Cytoplasmic“ • If protein name is related with transcriptional activities then the sub cellular location is "Cytoplasmic” • CCDB • If transmembrane regions are present then the sub cellular location is "Membrane“ • PSORTB • If above cases are not true then the sub cellular location is assigned as "Cytoplasmic"

Annotations from multiple sources Example: Enzyme Classification (EC) number and it’s related field • SwissProt. • CCDB • KEGG database • Metabolic information from CCDB is transferred • When EC number from SwissProt/KEGG is matching with EC number from CCDB or • If EC number isnot available from SwissProt/KEGG.

Annotation parsing • CCDB format (Annotations) • Text format (Annotations and evidence) • HTML format ( Annotation table)

CCDB format • Clean view • Annotations are marked with • [S] if exact match to SwissProt • [H] if homology to a SwissProt entry • [C] if homology to a CCDB entry • Annotations are linked to online sources i.e. Pfam, PROSITE, InterPro accession no, GI numbers from Homologues etc.

Text Format • Provides evidence • Source of annotation, i.e. database name and version • Evidence used to support the annotation, i.e. BLAST report in case of similarity search • Quality indicator such as “marginal”, “strong” or “clear” • Time of generation of annotations

Table format • For a quick view of annotations • Shows start and end position and direction of the gene, accession no., gene name, COG id and protein function

BASys annotation pipeline • Each analysis program is written in Object Oriented Perl also uses Bioperl library. • The annotation API is fully compatible with the Bioperl project • Currently the BASys system contains nearly 54 Perl modules and many small scripts with more than 60,000 lines of code defining classes and fully object-oriented code. • Tried to write a fully documented code

BASys annotation pipeline • ~8 external tools to analyze the data • Glimmer, HMMER, BLAST, Homodeller, VADAR, predictSPTM, ps_scan, etc. • ~20 databases as a source of annotation • SwissProt, CCDB, nr, COG, KEGG, PROSITE, reference database of model organisms, PDB, PSORTB, TargetDB, gene ontology etc.

BASys and BacMap • BASys annotation engine is used in BacMap to generate annotation of bacterial genomes • Successfully completed annotation of 200 bacterial & archaeal genomes in NCBI

The BASys report generator • A navigable circular genome map automatically generated after the annotation are done for genome visualization and exploration. • BASys uses CGView application to produce the navigable circular genome map. • BASys passes annotations to CGView in the form of an XML document. • CGView then renders this information as a series of hyperlinked PNG images files. • Map shows annotated genes and COG category classification.

The BASys report generator • Each identified gene is displayed and labeled on the map. • Each gene is hyperlinked to gene cards containing the annotations for the gene • Each gene card contains hyperlinks to evidence card for more detailed description of source and quality of the annotation and an annotation table for brief annotations.

Future work • BLAST and text searching • Manual annotation • TIGRFAMs • BLOCKS • PRINTS

Publications • G. H. Van Domselaar, P. Stothard, S. Shrivastava, J. Cruz, A. Guo, X. Dong, P. Lu, D. Szafron, R. Greiner, and D. S. Wishart (2005) BASys: A web server for automated bacterial genome annotation. Nucleic Acids Research (accepted). • P. Stothard, G. Van Domselaar, S. Shrivastava, A. Guo, B. O'Neill, J. Cruz, M. Ellison, and D. S. Wishart (2005) BacMap: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Research 33: D317-D320.

Acknowledgements • Prof. David Wishart • Dr. Gary Van Domselaar • Dr. Paul Stothard • Anchi Guo • Joseph Cruz • Xiaoli Dong • Nelson Young • All the lab members and • Dr. Warren Gallin

Savita Shrivastava Feb 25 th , 2005 Lab Presentation

Savita Shrivastava Feb 25 th , 2005 Lab Presentation

Presentation Transcript

FMECA Lab Last revised 08/25/2005

Happy Monday! Feb. 25 th

Entry Task: Feb 25 th Monday

Bell Ringers: 14 Feb – 25 Feb

Abhishek K. Shrivastava September 25 th , 2009

Tuesday, Feb 25

28 Feb 2005

Year 9 Cover – Friday 25 th Feb

Year 8 Cover – Friday 25 th Feb

CIFAL Presentation 18 th May 2005

HS Announcements: Monday, Feb. 25 th

Feb 2005 Sweeps

ABCD – TIE Presentation Feb 4 th , 2013

Manu Shrivastava

Laniu S. B. Pope Feb. 24 th , 2005

Feb 2005 Sweeps

Tuesday , Feb. 25

Feb 25, 2011

Savita Shrivastava Feb 25 th , 2005 Lab Presentation

Zhuyin (laniu) Ren Feb. 1 th , 2005

28 Feb 2005

FMECA Lab Last revised 08/25/2005