Practically Genomic … A hands-on Bioinformatics IAP - Protein Analysis

Practically Genomic…A hands-on Bioinformatics IAP - Protein Analysis • Accessing Protein Information • Sequence Alignment • Pairwise and local with BLAST • Multiple sequences and global with ClustalX • Phylogenetic Analysis (ClustalX) • Protein Domain and Motif Analysis (SMART, Interpro)

Accessing Protein Sequences and Information The large number of different databases and resources can make this difficult. • Different resources: • contain different data • use different identifier schemes • use different definitions of redundancy • Ensembl (genomes), NCBI protein (genbank), IPI and UniProt. • UniProt may be the best place to begin. • Useful X_Y ID scheme • Species at least, possibly protein name and species. • Widespread Usage (SMART, GO) • Abundant manual annotation and cross-referencing tools • Database is mirrored at multiple locations UniProt: http://www.pir.uniprot.org/

Local Sequence Alignment (BLAST) • Searching is done in a pair-wise fashion and reported alignments are restricted to the best parts of the query-target relationship. • Multiple BLAST “flavors” allow alignments of protein and DNA in all different combinations. • Relatively fast and sensitive making BLAST the standard tool for searching large datasets using sequence similarity. • Ubiquitous - Virtually all online protein resources have some kind of BLAST implementation. • NCBI may have the best on-line version of the tool. • http://www.ncbi.nlm.nih.gov/blast/

Global Sequence Alignment (MSA) Portion of a multiple, global alignment created with ClustalX The goal is to stack in columns amino acids that derive from an ancestral residue. The quality of pair-wise and group-wise alignments are scored using substitution matrices.

Protein Substitution Matrices Both Local and Global alignments use substitution matrices to quantify relationships between proteins.

Phylogenetic Trees • Clustal uses the Neighbor-Joining Method (NJ) • NJ is a distance-based method that repeatedly groups the 2 most closely related sequences. • The Phylip package is freely available and implements a wide range of different methods. http://evolution.genetics.washington.edu/phylip.html • Tree Reliability • The bootstrap method is used to add confidence levels to the groupings. • Visualization of the tree • NJ Plot • Draws unrooted phylogenetic trees in phenogram format • Other methods allow more control of format: http://iubio.bio.indiana.edu/treeapp/treeprint-form.html

Assessing Tree Reliability using Bootstrapping X X Actual Alignment: A Bootstrap Replicate: • Positions within the original alignment are randomly resampled to create a “pseudo replicate”. • Large numbers of pseudo replicates are generated. • The distances between species within each pseudo replicate are calculated and trees are drawn for each. • The stability of clades within the sets are calculated to identify clades that are present in most pseudo replicates.

Phylogenetic Tree Examples ITAL_HUMAN ITA2_HUMAN ( ( ( ITA1_DROME:0.67741, ( ITA6_HUMAN:0.42032, ITA7_HUMAN:0.31161) :0.29176[1000]) :0.11947[992], ( ITA2_DROME:0.72000, ( ITA5_HUMAN:0.37147, ITAV_HUMAN:0.43034) :0.25993[1000]) :0.09502[976]) :0.12118[954], ( ITA5_DROME:1.07810, ( ( ITA10_HUMAN:0.70421, ITA2_HUMAN:0.73710) :0.10612[857], ITAL_HUMAN:0.86603) :0.18550[986]) :0.02936[434], ( ITA4_HUMAN:0.49064, ITA9_HUMAN:0.45807) :0.35160[1000]); 857 ITA10_HUMAN ITA9_HUMAN 1000 ITA4_HUMAN 986 ITAV_HUMAN ITA5_HUMAN 1000 976 ITA2_DROME ITA7_HUMAN 954 ITA6_HUMAN 1000 992 ITA1_DROME ITA5_DROME

Homolog, Ortholog and Paralog A Ancestral Organism Speciation Event Orthologs xA yA Gene Duplication Paralogs Homologs xA yA’ yA’’ • There is no such thing as percent homology. • When there is any doubt, use the term homolog. • Domain composition is useful in the identification of homologs?

Protein Domains and Motifs • Protein domains are modular units of sequence with consistent structure and function. • Evolution can produce both new domains and novel combinations of domains. • Protein motifs are short sequence patterns with functional implications. Pan-Bilaterian Subgroup B Thrombospondin Deuterostome-specific Subgroup A Thrombospondin CSVTCG CD36-Binding Motif

Protein Domain and Motif Analysis • Models (HMMs) that describe domains are created from alignments. Those models are then used to scan proteins for the presence of domains. • Domains do not need to be characterized or understood to be detected (DUFs). • Motifs are analyzed in a similar way or using simpler methods involving text pattern matching. • Proteins in public databases have already been analyzed for domain content and these data are available from a number of sources.

SMART - http://smart.embl-heidelberg.de/ • SMART is an excellent resource for domain analysis • Integrates data from multiple sources • SMART and pfam domain models • Gene Ontology • Taxonomic data • Genomic data (Ensembl) • Powerful Search Tools • Excellent Graphics

Interproscan - http://www.ebi.ac.uk/InterProScan/ • Includes some of the things found in SMART plus additional models and methods. • Software and data are freely available allowing batch analysis of proteins on local computers.

Scansite - http://scansite.mit.edu/ • Search tool designed to identify substrates of a variety of protein kinases. • Other useful utilities are also available

Practically Genomic … A hands-on Bioinformatics IAP - Protein Analysis

Practically Genomic … A hands-on Bioinformatics IAP - Protein Analysis

Presentation Transcript

Protein signatures, classification and functional analysis

Protein Metabolism

Pairwise and multiple sequence alignments

On the Road to Genomic Predictive Medicine An Interim Analysis

Protein Identification via Database searching

Protein structure prediction: The holy grail of bioinformatics

CS 5263 Bioinformatics

Bioinformatics

Lecture 5 Microarray Data Analysis Bioinformatics Data Analysis and Tools

Introduction to Bioinformatics

Introduction to Bioinformatics

Proper structural fold of protein molecule is essential to execute its precise functional mission

Protein structure comparison and contact maps

Secondary Structure Prediction

Bioinformatics Programming

Introduction to Bioinformatics

Lecture 19 aCGH and Microarray Data Analysis Introduction to Bioinformatics

Bioinformatics Pipelines for RNA- Seq Data Analysis

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops