460 likes | 561 Views
Analyzing Families of Sequences. MARC: Developing Bioinformatics Programs July 13, 2009 Alex Ropelewski ropelews@psc.edu Hugh Nicholas nicholas@psc.edu Ricardo Gonzalez Mendez ricardo.gonzalez7@upr.edu. Bioinformatics .
E N D
Analyzing Families of Sequences MARC: Developing Bioinformatics Programs July 13, 2009 Alex Ropelewski ropelews@psc.edu Hugh Nicholas nicholas@psc.edu Ricardo Gonzalez Mendez ricardo.gonzalez7@upr.edu
Bioinformatics The interdisciplinary science of using computational approaches to analyze, classify, collect, represent and store biological data with the goal of accelerating and enhancing the understanding of DNA, RNA and Protein sequences.
Sequence Analysis Process of applying computational methods to a biological molecule represented as a character string. The goal is to infer information about the structure, function, or evolutionary history of the sequence.
What is a Sequence? A sequence is a way to represent a protein, DNA, or RNA molecule as a character string. Phospholipase A2 - Bos taurus (Bovine). MRLLVLAALLTVGAGQAGLNSRALWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPV DDLDRCCQTHDNCYKQAKKLDSCKVLVDNPYTNNYSYSCSNNEITCSSENNACEAFICNC DRNAAICFSKVPYNKEHKNLDKKNC
A - Alanine R - Arginine N - Asparagine D - Aspartic acid C - Cysteine E - Glutamic acid Q - Glutamine G - Glycine H - Histidine I - Isoleucine L - Leucine K – Lysine M – Methionine F - Phenylalanine P - Proline S - Serine T - Threonine W - Tryptophan Y - Tyrosine V - Valine B - Asparagine or aspartic acid Z - Glutamine or glutamic acid J - Leucine or Isoleucine X - Any Amino Acid U - Selenocysteine O - Pyrrolysine Representing Proteins N Q P G I C L C Y Image from Wikipedia Commons: http://en.wikipedia.org/wiki/File:Oxytocin.jpg
Why study families of sequences? Families share a common function, structure, and are related through evolution Aldehyde Dehydrogenase Family Members
The Goal CURATED FAMILY: • All related sequences sharing a common function (Homologous Sequences) • All substantial motifs • Evolutionary history • Structural information • Experimental information
Structural Libraries Evolutionary Analysis Hidden Markov Model Classification Libraries Multiple Sequence Alignment Initial Query Profile & PSSM Sequence Libraries Local Patterns The Process Homology Modeling CURATED DATASET
The Toolkit GenBank Blast Clustalw Meme EMBL Fasta T-Coffee Mast UniProt Smith-Waterman MSA hmmer Pfam Needleman-Wunsch Probcons Profile-ss PDB Figtree Phylip Notung PDB Python BioPython Genedoc
The Project Part I: Submit three candidate families for your course project. Part II: Collect an initial set of sequences Part III: Generate a multiple sequence alignment, identify patterns and motifs and use them to improve the quality of your alignment, and identify additional distantly related family members Part IV:Integrate the sequence analysis results to the structure and function and evolution of the family Part V: Write a draft paper, or research grant and develop an oral presentation for a conference
Structural Libraries Evolutionary Analysis Hidden Markov Model Classification Libraries Multiple Sequence Alignment Initial Query Profile & PSSM Sequence Libraries Local Patterns Part I Homology Modeling CURATED DATASET
Part 1 – Selecting Query Learning Objectives: Teach students ability select an appropriate subject for experimentation. Teach students how to use PubMed: Find reviews, background information, and prior work to understand what is known about the subject Teach students how to concisely summarize and properly cite prior research works
PubMed URL: http://www.pubmed.gov/ National Library of Medicine’s database of articles published in biomedical journals Currently contains over 18 million citations, dating from 1948 About 90% of records are English-language sources or have English abstracts About 80% of the citations include the published abstract About 5,200 Journals Some links to full-text articles at participating publishers web sites
Data in PubMed Title of the journal article Names of the authors Abstract published with the article MeSH (Medical Subject Headings) tags Journal source First author affiliation Language of the article Publication type (review, letter, etc.)
Simple PubMed Search Enter Search Term Click Go Search Results
Basic PubMed Search • Pubmed Feature Tabs: • Limits: Limit search to certain dates, languages, etc. • Preview: Allows viewing and selecting of search fields • History: Log of recent searches • Clipboard: Allows items to be temporarily saved • Details: Shows how PubMed ran the search Search Database Selection Go to advanced search page Display Format Click on tab for all articles Click on tab for review articles Select to Sort results Select to save or email results Page through results
PubMed Boolean Logic Salmonella and Eggs Salmonella orEggs Salmonella notEggs Salmonella and Eggs and Hamburger Salmonella andEggs or Hamburger Salmonella and (Eggs or Hamburger)
MeSH tags Medical Subject Headings Controlled vocabulary/key word system Used to help locate appropriate articles Articles in PubMed usually have between 5 to 15 MeSH tags associated with them. MeSH Tutorial at: http://www.nlm.nih.gov/bsd/disted/mesh
MeSH Search 1) Select MeSH 2) Enter Search Term 3) Click Go 4) Select MeSH Term 5) Select Search Box
MeSH Search Search Box Click Search PubMed
MeSH Search Click tab to see review articles Click tab to see all articles
MeSH Search Multiple MeSH Terms
Evolutionary Analysis Hidden Markov Model Multiple Sequence Alignment Initial Query Profile & PSSM Local Patterns Part II Structural Libraries Homology Modeling CURATED DATASET Classification Libraries Sequence Libraries
Part II - Libraries Learning Objective: Be able to search major libraries of biomolecules to collect sequences of interest Understand information contained in the major sequence, structure and classification libraries Understand searching methods and their limitations Understand the effect of search parameters Be able to select appropriate methods and parameters for a variety of sequences
Structural Libraries Evolutionary Analysis Hidden Markov Model Classification Libraries Initial Query Part III Homology Modeling CURATED DATASET Multiple Sequence Alignment Sequence Libraries Local Patterns Profile & PSSM
Part III – Multiple Alignment Learning Objective: Be able to construct a biologically correct alignment for a family of sequences Understand what makes an alignment biologically correct Be able construct and refine multiple sequence alignments Be able to create abstract representations of multiple alignments and search databases with them. Be able to tie local patterns(motifs) found back to the biology of the sequences Understand the methods used to abstract an alignment and the advantages and disadvantage of commonly used methods. Understand the effect of search parameters Be able to select appropriate methods and parameters for a variety of sequences
Similarity of Amino Acids Valine – Val – V Leucine – Leu – L Isoleucine – Ile – I
Understanding Motifs Functional Residues
Substrate Binding NAD Binding
Hidden Markov Model Classification Libraries Initial Query Profile & PSSM Sequence Libraries Local Patterns Part IV Structural Libraries Evolutionary Analysis Homology Modeling CURATED DATASET Multiple Sequence Alignment
Part IV – Structure and Phylogeny Learning Objective: Understand and integrate the sequence analysis results to the structure and function of the protein family Understand the evolutionary patterns of gene and species Integrate evolutionary information with structural information to understand how the function has evolved within the protein family Predict or design experiments to be carried out in-vitro Design drugs Mutate the proteins Mutate regulatory areas within genome to change expression
IntegratingAlignment, Motifs & Structure Active Site
Integrating Alignment, Motifs & Structure Conserved Asn Binds Substrate
Integrating Alignment, Motifs & Structure Catalytic Thiol (Cys)
Part V – Prepare Work for Publication Learning Objectives: Teach students ability to concisely summarize and properly cite relevant prior research works Teach students ability to concisely summarize their research works Teach students to revise papers based on reviewers comments Teach students how to write a research grant Teach students how to give and prepare an oral research presentation.
Workshop Projects Biologists: You will be working through the same five step project that your students will during your class. By the time that you leave here, you should have a good start on a research publication, grant or have ideas for in-vitro experiments.
Workshop Projects Computer Scientists: Take your favorite string matching algorithm and apply it to biological sequence data. Compare your algorithms performance with some of the algorithms discussed in this workshop in terms of speed, selectivity, or sensitivity. Feel free to use a parallel algorithm.