500 likes | 508 Views
Explore a comprehensive guide to bioinformatics tools and methods for proteomics research, including databases, analysis techniques, and predictive modeling. Learn about sequence alignment, pattern searches, and structural predictions in protein analysis.
E N D
Tutorial: Bioinformatics Resources BIO-TRAC 25 (Proteomics: Principles and Methods) March 28, 2003 NIH, Bethesda, MD Zhang-Zhi Hu, M.D. Bioinformatics Scientist, Protein Information Resource National Biomedical Research Foundation
What is Bioinformatics? • NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2002) - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. • Bioinformatics is the application of information technology to the analysis, organization and distribution of biological data in order to answer complex biological questions.
Bioinformatics Resources • The Molecular Biology Database Collection: An Online Compilation of Relevant Database Resources • 2003 update: http://www3.oup.co.uk/nar/database/ • Nucleic Acids Research Database Issues (January Annually) (2003 - http://nar.oupjournals.org/content/vol31/issue1/) • DBcat: A Catalog of > 500 Biological Databases • http://www.infobiogen.fr/services/dbcat/
Molecular Biology Database Collection (http://nar.oupjournals.org/cgi/content/full/31/1/1#GKG120TB1)
The Molecular Biology Database Collection: 2003 update (Baxevanis, A.D.)--An online resource of 386 key databases of 18 categories • Major sequence repositories • Comparative Genomics • Gene Expression • Gene Identification and Structure • Genetic and Physical Maps • Genomic Databases • Intermolecular Interactions • Metabolic Pathways and Cellular Regulation • Mutation Databases • Pathology • Protein Sequence Motifs • Proteome Resources • Retrieval Systems and Database Structure • RNA Sequences • Structure • Transgenics • Varied Biomedical Content
Overview • Protein Sequence Analysis I. Sequence Similarity Search and Alignment II. Family Classification Methods III. Structure Prediction Methods • Molecular Biology Databases IV. Protein Family Databases V. Database of Protein Functions VI. Databases of Protein Structures • Proteomic Resources VII. 2D-gel databases VIII. Proteomic analyses
I. Sequence Similarity Search • Find a protein sequence: text search • Based on Pair-Wise Comparisons • BLOSUM scoring matrix • PAM scoring matrix • Dynamic Programming Algorithms • Global Similarity: Needleman-Wunsch (GAP/BestFit) • Local Similarity: Smith-Waterman (SSEARCH) • Heuristic Algorithms (Sequence Database Searching) • FASTA: Based on K-Tuples (2-Amino Acid) • BLAST: Triples of Conserved Amino Acids • Gapped-BLAST: Allow Gaps in Segment Pairs (NREF) • PHI-BLAST: Pattern-Hit Initiated Search (NCBI) • PSI-BLAST: Iterative Search (NCBI)
Sequence Search by Text or Unique ID (http://www.ncbi.nlm.nih.gov/Entrez/) (http://pir.georgetown.edu/pirwww/search/textsearch.html)
Pair-Wise Comparisons • Scoring matrix • Global and local • Similarity: Dynamic Programming • (Needleman-Wunsch, • Smith-Waterman) (http://www.ebi.ac.uk/emboss/align/)
(http://pir.georgetown.edu/pirwww/search/fasta.html) FASTA Search (http://www.ebi.ac.uk/fasta33/)
(http://pir.georgetown.edu/pirwww/search/pirnref.shtml) Gapped-BLAST Search (http://www.ncbi.nlm.nih.gov/BLAST/)
PSI-BLAST Iterative Search (http://www.ncbi.nlm.nih.gov/BLAST/)
II. Family Classification Methods • Multiple Sequence Alignment and Phylogenetic Analysis • ClustalW Multiple Sequence Alignment • Alignment Editor & Phylogenetic Trees • Based on Family Information • PROSITE Pattern Search • Motif and Profile Search • Hidden Markov Model (HMMs)
Multiple Sequence Alignment • ClustalW (http://pir.georgetown.edu/pirwww/search/multaln.html)
Alignment Editor (Jalview) (http://www.ebi.ac.uk/clustalw/)
Alignment Editor (GeneDoc) (http://www.psc.edu/biomed/genedoc/)
Tree Programs: (http://evolution. genetics.washington.edu/phylip.html) Phylogenetic Analysis Tree Searches: (http://pauling. mbu.iisc.ernet.in/~pali/index.html)
PROSITE Pattern Search (http://pir.georgetown.edu/pirwww/search/patmatch.html)
(http://bmerc-www.bu.edu/bioinformatics/profile_request.html)(http://bmerc-www.bu.edu/bioinformatics/profile_request.html) Profile Search
(http://www.sanger.ac.uk/Software/Pfam/search.shtml) Hidden Markov Model Search (http://smart.embl-heidelberg.de)
III. Structural Prediction Methods • Signal Peptide (e.g. http://www.cbs.dtu.dk/services/) • Transmembrane Helix (e.g. http://www.cbs.dtu.dk/services/) • 2D Prediction (e.g. http://cubic.bioc.columbia.edu/ predictprotein/, http://www.compbio.dundee.ac.uk/WWW_Servers/JPred/jpred.html) • 3D Modeling (e.g. http://guitar.rockefeller.edu/modeller/ modeller.html)
StructurePrediction:A Guide (www.bmm.icnet.uk/people/rob/CCP11BBS/flowchart2.html)
Protein Prediction Server (http://www.cbs.dtu.dk/services/)
(http://www.stepc.gr/~synaptic/sigfind.html) Signal Peptide Prediction (http://www.cbs.dtu.dk/services/SignalP)
Transmembrane Helix (http://www.cbs.dtu.dk/services/TMHMM/)
(http://cmgm.stanford.edu/WWW/www_predict.html) Protein Structure Prediction (http://restools.sdsc.edu/biotools/biotools9.html)
(http://cubic.bioc.columbia.edu/predictprotein/) Structure Prediction Server (http://www.compbio.dundee.ac.uk/WWW_Servers/JPred/jpred.html)
(http://guitar.rockefeller.edu/modeller/modeller.html) 3D-Modelling (http://www.expasy.ch/swissmod/SWISS-MODEL.html)
IV. Protein Family Databases • Whole Proteins • PIR: Superfamilies and Families • COG (Clusters of Orthologous Groups) of Complete Genomes • ProtoNet: Automated Hierarchical Classification of Proteins • Protein Domains • Pfam: Alignments and HMM Models of Protein Domains • SMART: Protein Domain Families • Protein Motifs • PROSITE: Protein Patterns and Profiles • BLOCKS: Protein Sequence Motifs and Alignments • PRINTS: Protein Sequence Motifs and Signatures • Integrated Family Databases • iProClass: Superfamilies/Families, Domains, Motifs, Rich Links • InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART
(http://www.ncbi.nlm.nih.gov/COG/) Protein Clustering
Pfam (http://www.sanger.ac.uk/Software/Pfam/) Protein Domains • SMART (http:// smart.embl-heid elberg.de/smart/ show_motifs.pl)
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles.(http://www.expasy.ch/prosite/) Protein Motifs
Integrated Family Classification • InterPro: Anintegrated resource unifying PROSITE, PRINTS, ProDom, Pfam, SMART, and TIGRFAMs. (http://www.ebi.ac.uk/interpro/search.html)
V. Databases of Protein Functions • Metabolic Pathways, Enzymes, and Compounds • Enzyme Classification: Classification and Nomenclature of Enzyme-Catalysed Reactions (EC-IUBMB) • KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways • LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes • EcoCyc: Encyclopedia of E. coli Genes and Metabolism • MetaCyc: Metabolic Encyclopedia (Metabolic Pathways) • WIT: Functional Curation and Metabolic Models • BRENDA: Enzyme Database • UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways • Klotho: Collection and Categorization of Biological Compounds • Cellular Regulation and Gene Networks • EpoDB: Genes Expressed during Human Erythropoiesis • BIND:Descriptions of interactions, molecular complexes and pathways • DIP: Catalogs experimentally determined interactions between proteins • RegulonDB:Escherichia coli Pathways and Regulation
KEGG is a suite of databases and associated software, integrating our current knowledge • on molecular interaction networks, the information of genes and proteins, and of chemical • compounds and reactions. (http://www.genome.ad.jp/kegg/kegg2.html) KEGG Metabolic & Regulatory Pathways (http://www.genome.ad.jp/dbget-bin/show_pathway?hsa00590+874)
The BioCyc Knowledge Library is a collection of Pathway/Genome • Databases (http://biocyc.org/) BioCyc (EcoCyc/MetaCyc Metabolic Pathways)
Protein-Protein Interactions: DIP (http://dip.doe-mbi.ucla.edu/)
(http://www.bind.ca/) Protein-Protein Interaction: BIND
(http://www.biocarta.com/index.asp) BioCarta Cellular Pathways
VI. Databases of Protein Structures • Protein Structure and Classification • PDB: Structure Determined by X-ray Crystallography and NMR • CATH: Hierarchical Classification of Protein Domain Structures • SCOP: Familial and Structural Protein Relationships • FSSP: Protein Fold Family Database • Protein Sequence-Structure Relationship • PIR-NRL3D: Protein Sequence-Structure Database • PIR-RESID: Protein Structure/Post-Translational Modifications • HSSP: Families and Alignments of Structurally-Conserved Regions
(http://www.rcsb.org/pdb/) PDB Structure Data
PDBsum: Summary and Analysis(http://www.biochem.ucl.ac.uk/bsm/pdbsum)
CATH: Hierarchical domain classification of protein structures (http://www.biochem.ucl.ac.uk/bsm/cath_new/) Protein Structural Classification
The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the PDB. Protein Structural Classification (http://scop.mrc-lmb. cam.ac.uk/scop/)
Proteomic Resources • GELBANK (http://gelbank.anl.gov): 2D-gel patterns from completed genomes; SWISS-2DPAGE (http://www.expasy.org/ch2d/) • PEP: Predictions for Entire Proteomes: (http://cubic.bioc.columbia.edu/ pep/): Summarized analyses of protein sequences • Proteome BioKnowledge Library: (http://www.proteome.com): Detailed information on human, mouse and rat proteomes • Proteome Analysis Database (http://www.ebi.ac.uk/proteome/): Online application of InterPro and CluSTr for the functional classification of proteins in whole genomes • Expression Profiling databases: GNF (http://expression.gnf.org/cgi-bin/index.cgi, human and mouse transcriptome), SMD (http://genome-www5.stanford.edu/MicroArray/SMD/, Stanford microarray data analysis), EBI Microarray Informatics (http://www.ebi.ac.uk/microarray/ index.html , managing, storing and analyzing microarray data)
(2D-gel of human ventricle proteins) (http://gelbank.anl.gov/2dgels/index.asp) VII. 2D-Gel Image Databases (http://www-lecb.ncifcrf.gov/2dwgDB)
(http://www.ebi.ac.uk/proteome) VIII. Proteome Analysis
Human and Mouse Transcriptome Expression Profiling (http://expression.gnf.org/cgi-bin/index.cgi) (http://genome-www. stanford.edu/serum/)
Lab: • Visit selected websites and analyze some protein sequence of • your own choices. • List of Bioinformatics Resources of this tutorial available: • http://pir.georgetown.edu/~huz/bioinfo_resource.html • Try some of the following sequences for analysis: • 1) well characterized proteins: PIR:A26366(CYP17), JS0747(Sp1) • 2) less characterized proteins: PIR:A59000(MATER) • TrEMBL:Q9QY16(GRTH) • 3) hypothetical protein: PIR:T12515, T00338 , T47130 • SWISS-PROT:Q9BWT7