140 likes | 355 Views
Practically Genomic … A hands-on Bioinformatics IAP - Protein Analysis. Accessing Protein Information Sequence Alignment Pairwise and local with BLAST Multiple sequences and global with ClustalX Phylogenetic Analysis (ClustalX) Protein Domain and Motif Analysis (SMART, Interpro).
E N D
Practically Genomic…A hands-on Bioinformatics IAP - Protein Analysis • Accessing Protein Information • Sequence Alignment • Pairwise and local with BLAST • Multiple sequences and global with ClustalX • Phylogenetic Analysis (ClustalX) • Protein Domain and Motif Analysis (SMART, Interpro)
Accessing Protein Sequences and Information The large number of different databases and resources can make this difficult. • Different resources: • contain different data • use different identifier schemes • use different definitions of redundancy • Ensembl (genomes), NCBI protein (genbank), IPI and UniProt. • UniProt may be the best place to begin. • Useful X_Y ID scheme • Species at least, possibly protein name and species. • Widespread Usage (SMART, GO) • Abundant manual annotation and cross-referencing tools • Database is mirrored at multiple locations UniProt: http://www.pir.uniprot.org/
Local Sequence Alignment (BLAST) • Searching is done in a pair-wise fashion and reported alignments are restricted to the best parts of the query-target relationship. • Multiple BLAST “flavors” allow alignments of protein and DNA in all different combinations. • Relatively fast and sensitive making BLAST the standard tool for searching large datasets using sequence similarity. • Ubiquitous - Virtually all online protein resources have some kind of BLAST implementation. • NCBI may have the best on-line version of the tool. • http://www.ncbi.nlm.nih.gov/blast/
Global Sequence Alignment (MSA) Portion of a multiple, global alignment created with ClustalX The goal is to stack in columns amino acids that derive from an ancestral residue. The quality of pair-wise and group-wise alignments are scored using substitution matrices.
Protein Substitution Matrices Both Local and Global alignments use substitution matrices to quantify relationships between proteins.
Phylogenetic Trees • Clustal uses the Neighbor-Joining Method (NJ) • NJ is a distance-based method that repeatedly groups the 2 most closely related sequences. • The Phylip package is freely available and implements a wide range of different methods. http://evolution.genetics.washington.edu/phylip.html • Tree Reliability • The bootstrap method is used to add confidence levels to the groupings. • Visualization of the tree • NJ Plot • Draws unrooted phylogenetic trees in phenogram format • Other methods allow more control of format: http://iubio.bio.indiana.edu/treeapp/treeprint-form.html
Assessing Tree Reliability using Bootstrapping X X Actual Alignment: A Bootstrap Replicate: • Positions within the original alignment are randomly resampled to create a “pseudo replicate”. • Large numbers of pseudo replicates are generated. • The distances between species within each pseudo replicate are calculated and trees are drawn for each. • The stability of clades within the sets are calculated to identify clades that are present in most pseudo replicates.
Phylogenetic Tree Examples ITAL_HUMAN ITA2_HUMAN ( ( ( ITA1_DROME:0.67741, ( ITA6_HUMAN:0.42032, ITA7_HUMAN:0.31161) :0.29176[1000]) :0.11947[992], ( ITA2_DROME:0.72000, ( ITA5_HUMAN:0.37147, ITAV_HUMAN:0.43034) :0.25993[1000]) :0.09502[976]) :0.12118[954], ( ITA5_DROME:1.07810, ( ( ITA10_HUMAN:0.70421, ITA2_HUMAN:0.73710) :0.10612[857], ITAL_HUMAN:0.86603) :0.18550[986]) :0.02936[434], ( ITA4_HUMAN:0.49064, ITA9_HUMAN:0.45807) :0.35160[1000]); 857 ITA10_HUMAN ITA9_HUMAN 1000 ITA4_HUMAN 986 ITAV_HUMAN ITA5_HUMAN 1000 976 ITA2_DROME ITA7_HUMAN 954 ITA6_HUMAN 1000 992 ITA1_DROME ITA5_DROME
Homolog, Ortholog and Paralog A Ancestral Organism Speciation Event Orthologs xA yA Gene Duplication Paralogs Homologs xA yA’ yA’’ • There is no such thing as percent homology. • When there is any doubt, use the term homolog. • Domain composition is useful in the identification of homologs?
Protein Domains and Motifs • Protein domains are modular units of sequence with consistent structure and function. • Evolution can produce both new domains and novel combinations of domains. • Protein motifs are short sequence patterns with functional implications. Pan-Bilaterian Subgroup B Thrombospondin Deuterostome-specific Subgroup A Thrombospondin CSVTCG CD36-Binding Motif
Protein Domain and Motif Analysis • Models (HMMs) that describe domains are created from alignments. Those models are then used to scan proteins for the presence of domains. • Domains do not need to be characterized or understood to be detected (DUFs). • Motifs are analyzed in a similar way or using simpler methods involving text pattern matching. • Proteins in public databases have already been analyzed for domain content and these data are available from a number of sources.
SMART - http://smart.embl-heidelberg.de/ • SMART is an excellent resource for domain analysis • Integrates data from multiple sources • SMART and pfam domain models • Gene Ontology • Taxonomic data • Genomic data (Ensembl) • Powerful Search Tools • Excellent Graphics
Interproscan - http://www.ebi.ac.uk/InterProScan/ • Includes some of the things found in SMART plus additional models and methods. • Software and data are freely available allowing batch analysis of proteins on local computers.
Scansite - http://scansite.mit.edu/ • Search tool designed to identify substrates of a variety of protein kinases. • Other useful utilities are also available