Genome analysis and annotation Part II

Genome analysis and annotation Part II

Modeling a gene S.mansoni PASA assemblies S. japonicum EST alignments Genewise alignments(predictions) nr Protein Alignments Caenorhabditis sp. Protein Alignments Brugia malayi Protein Alignments Evidence View

Attributes of individual annotated genes Sequence Database Hits Top: Protein matches Bottom: EST matches Not shown graphically: gene name, nucleotide and protein sequence, MW, pI, organellar targeting sequence, membrane spanning regions, other domains. Gene Predictions Annotated Gene Top: editing panel Bottom: final curation Splice site predictions: red: acceptor sites blue: donor sites Screenshot of a component within Neomorphic’s annotation station: www.neomorphic.com

Assigning function to predicted gene products

E.coli E.coli H. influenzae H. influenzae H. influenzae H. influenzae M. genitalium M. genitalium Assigning function to predicted gene products The primary tool for assigning function is homology to well characterized proteins …however transitive annotation can lead to errors that propagate.

The modular nature of proteins can provide the basis for functional annotation • Proteins may share features that give clues to their structure and/or function • A domain is a region of a protein that can adopt a particular three-dimensional structure. Together a group of proteins that share a domain is called a family. There are several databases of protein families such as Pfam (http://www.sanger.ac.uk/Software/Pfam/) • Motifs are short, conserved regions of proteins, typically consisting of a pattern of amino acids that characterizes a prrotein family (http://www.expasy.org/prosite/) EF-hand: D-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)- [DE]-[LIVMFYW] • HMM domains can also be defined and used to group proteins into families

Protein domain frequencies can yield insights into the biology of an organism Top 20 PFAM domains in A. fumigatusCounts in A. nidulans and A. oryzae Afu Ana Aoa

Domain based Paralogous Families can be genrated Domain Content of Entire Proteome can be computed All the proteins from a genome HMM search against Pfam profiles Alignment search against homology-based domain alignments The search results are stored in the database in the form of domain-based alignments Organize the proteins into domain-based paralogous families • Related families share one or more domains with other families • Many putative novel domains are extensions of existing domains

Hidden Markov Models (HMMs) Statistical representations of sequence patterns. A query sequence is scored by how likely is it that the HMM would produce it. Seed: ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Model:

Procedure for Preparing a HMM Seed • Inspect and edit a pairwise aligned group of gene products: - Eliminate fragments - Correct the alignment - Remove sequence outside domain - Eliminate redundancy - BLAST, annotate and possibly expand the seed.

Homology-Based Alignment: HMM Seed: Trusted Hits:

What is Gene Ontology (GO)? The Gene Ontology is a set of dynamic controlled vocabularies used to describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner (www.geneontology.org) The Three Ontologies Molecular function, biological process and cellular component are considered attributes of gene products. • Biological Process (a) • A biological objective • has more than one distinct step • Molecular Function (b) • what the gene product does • Think ‘activity’ • Cellular Component (c) • location in the cell (or smaller unit) • or part of a complex

Assigning GO IDs Each GO ID is qualified with an evidence code. Evidence codes are: IMP – inferred from mutant phenotype IGI—inferred from genetic interactionIPI—inferred from physical interaction IDA—inferred from direct assay IEP—inferred from expression pattern ISS—inferred from structural similarity IEA—inferred from electronic annotation IC—inferred by curator TAS—traceable author statement NAS—non-traceable author statement ND—no biological data available NR—no longer used • Experimental evidence • Sequence similarity • Calculated by algorithm • Author statement The “with/to” field ISS, IPI, IGI require the accession of the similarity hit, the interacting entity

Gene ontologies can help interpret large scale datasets K-means clustering using TIGR Multi-Experiment Viewer (TMEV)

Cluster 4 Cluster 10 Translation, transcription methanogenesis

Genome analysis and annotation Part II