430 likes | 518 Views
Computational Virology. Lectures in. Bioinformatic Studies on the Evolution Structure and Function of RNA-based Life Forms. Marcella A. McClure, Ph.D. Department of Microbiology and the Center for Computational Biology Montana State University, Bozeman MT mars@parvati.msu.montana.edu. B I
E N D
Computational Virology Lectures in Bioinformatic Studies on the Evolution Structure and Function of RNA-based Life Forms Marcella A. McClure, Ph.D. Department of Microbiology and the Center for Computational Biology Montana State University, Bozeman MT mars@parvati.msu.montana.edu
B I O I N F O R M A T I C S McClure Lab M c C L U R E Living on the edge of statistics! L A B O F
What is Bioinformatics? Bioinformatics is the creation of new knowledge from existing data. This type of research takes place in silico and includes the development and testing of the software tools necessary to analyze the data. McClure, 2000
The Practice of Bioinformatics is an interplay between knowledge of empirically derived data, bioinformatic tools and human decision making. Exactly which information and tools are to be accessed is dependent on the nature of the question of interest. McClure, 2000
Recent and Current Projects 1)Potential Multiple Endonuclease Functions and a Ribonuclease H Encoded in Retroposon Genomes 2) Hypothesis:The Reverse Transcriptase Domain Shares Common Ancestry with the RNA-dependent RNA polymerase of both positive and negative-stranded RNA viruses: a Test of Protein Motif Finding Methods. 3) A Functional Genomics Challenge: the Transcription/ Replication Complex of the Order Mononegavirales. Rabies, measles and Ebola viruses are true RNA-based life-forms that have no DNA stage belonging to the order Mononegavirales. To date, little has been learned about the distribution of functions within or the actual structure of the replication /transcription complex (three proteins and the RNA template). The goal is to elucidate potential regions and residues Of protein:protein interactions of the replication /transcription complex without structural information. The studies will proceed along three paths: prediction of disorder; determination of compensatory mutation; and the assessment of evolutionary dynamics. Correlation of the results of these methods will provide high probability candidates for the protein:protein contacts.
Recent and Current Projects, cont. • 4)Mapping of All Genomic Retroid Agents: Prototype Human Genome. • The Retroid Agents (e.g., HIV, Hepatitis B, retrotransposons, etc.) encode the reverse transcriptase thereby providing the interface for the transfer of genetic information from RNA-based to DNA-based replication systems. The goal of this project is to identify, classify and map of all Retroid Agents of a specific genome. The Genome Parsing Suite is the prototype software that not only identifies and classes these agents, but also determines Retroid genome boundaries, architecture, and gene complement, and also assesses the host environment of each agent. These data are then used to create a browseable database that will be available for display through the UCSC Genome Browser. The creation of this database is necessary for hypothesis testing regarding the roles that Retroid Agents play in the reproduction, development, evolution and diseases processes in Eukaryotes, including humans.
Summary Lecture I 1) Introduction to RNA-based life forms Methods to test the hypothesis. Testing the hypothesis. Predicting protein contacts.
The World of Viruses DNA viruses RNA viruses RdDp ssRNA dsRNA ssDNA dsDNA RdRp host Pol II + ssRNA - ssRNA Does the RT domain of the RdDp share common ancestry with the RdRp of negative and positive polarity, single-stranded viruses?
Rhabdoviridae Paramyxoviridae Filoviridae Retroviridae Picornaviridae
Retroid Agents Retroviruses, retrotransposons, pararetroviruses, retroposons, retroplasmids, retrointrons, and retrons reverse transcriptase mediated replication or transposition RNA viruses e.g., Ebola, rabies, influenza, polio All cellular systems & most DNA Viruses RNA DNA transcription Replication by DNA-dependent DNA polymerase Replication by RNA-dependent RNA Polymerase translation snRNAs, ribozymes, tRNA, rRNA PROTEIN SYNTHESIS McClure, 2000
Mononegavirales “OLD” FOES rabies (Rhabdoviridae) measles, RSV, mumps (Paramyxoviridae) “EMERGING” THREATS Ebola, Marburg (Filoviridae) equine morbillivirus, Nipah virus (Paramyxoviridae) MODEL AGENT vesicular stomatitis virus (Rhabdoviridae)
Roles of Retroid Agents: 1) Disease: a) retroviruses: 1) exogenous infectious: HIV HTLV 2) endogenous associations: breast cancer, testicular tumors, insulin dependent diabetes, multiple sclerosis, rheumatoid arthritis, schizophrenia and systemic lupus erythematosus b)LINEs insertional mutagenesis: 1) Hemophilia A 2) muscular dystrophies; Duchenne and Fukuyama- congenital type 3) X-linked disorders; Alport Syndrome-Diffuse Leiomyomatosis and Chronic Granulomatous Disease 2) Regulation of cellular genes and reproduction 3) Telomere maintenance 4) Repair of broken dsDNA 5) Exchange of genetic information among and between organisms
Plus-strand RNA Virus Families and Human Diseases Togaviridae- Riff Valley Fever Flaviviridae- Dengue Fever virus, West Nile virus Coronaviridae- Infectious Bronchitis Caliciviridae- Hepatitis E virus Picornaviridae- Human poliovirus, Hepatitis A
Rhabdoviridae Genome N P M G RdRp N P/C/V M F HN RdRp 5’LTR GAG RdDp ENV 3’LTR PRO RT/RH INT Paramyxoviridae Genome Filoviridae Genome N VP35 VP40 G VP30 VP24 RdRp MMLV Genome Picornaviridae Genome RdRp VPg Poly(A) L P4 P2 P3 P1 2A 2B 2C 3A 3B 3C 3D
VSV Transcription leader N 3' L P P P P P P L L P P P P N n N VSV Transcription 5' 5' read through 3' P P P P P P VSV Replication L L CO-ASSEMBLY N ? P P
Model of a poliovirus polymerase-dsRNA complex HIV-1 Reverse Transcriptase Poliovirus Polymerase Poliovirus Polymerase Oligorner Model of a poliovirus polymerase-dsRNA complex based on the structure of HIV-1 RT complexed to dsDNA (Huang etal., 1998).
Basic Strategy Search Databases Annotate and Preparation of Sequences Multiple Alignment of Sequences Refined Multiple Alignment Analysis of Multiple Alignment McClure, 2000
Biological Patterns • “Whether randomness can be measured is a difficult problem. One cannot judge the absence of pattern without specifying which pattern, and what is a pattern to you may not be a pattern to me.” McClure, 2000
What is an ordered series of motifs (OSM)? An OSM, which may span hundreds of residues, is defined as a set of conserved or semi-conserved motifs (1-9 contiguous amino acid residues) found in the same arrangement relative to one another in all sequences of a protein family. The amino acids of these patterns are involved in catalysis or structural integrity. The spacing between motifs or motif intervening regions (MIRs) can be highly variable, reflecting the regions of a protein that are less restricted by functional or structural constrains. MIRs may evolve more rapidly and be more subject to insertion/deletion events, and duplications that the OSM. Why is OSM identification important? The OSM of a protein family can be used to predict function. The identification of an OSM common among protein sequences with as little as 8% amino acid identity has led to successful prediction of function. If a multiple alignment method, (be it global or local) cannot correctly identify the highly conserved residues of a given sequence that are critical for function and structure, then it is of little value. McClure 2002
Levels of Sequence Comparisons McClure, 2000
Example of local subsequences or OSM McClure, 2000
RdRp of Plus strand viruses GDD RdRp of Mononegavirales GDNQ RdDp FADDM RT RH HYPOTHESIS: The Reverse Transcriptase domain of the RNA-dependent DNA Polymerase shares common ancestry with the RNA-dependent RNA Polymerase of the Order Mononegavirales and Plus Strand RNA viruses.
Strategy for Assessing Protein Sequence Homology OSM absent = unlikely homologue Protein Sequence Data SEQUENCE COMPARISON >30% identical = homology <30% identical MOTIF DETECTION Support for homology: Statistical tests OSM present = functionally equivalent = likely homologue Functional identification, Phylogenetic analysis, Structural prediction Support for homology: Gene order and size, common function McClure, 2000
Comparison Of Hmm/sam To Classical Multiple Alignment Methods McClure, 2000
Experimental Design for Testing Motif Detection Methods Methods: Appropriateness Availability Assumptions Limitations User specific parameters Bench Mark Sequences: Biologically informative markers Sequence length distribution Evolutionary distribution Set size Parameter Range Tests Types of Test Data Evaluate Results for Correct Identification of Biologically Informative Marker Method (s) that Accurately Identify Biologically Informative Marker RdRp and RdDp sequences Test hypothesis: RdRp share common ancestry with RdDp
Search Databases:Sequence, Literature, Structural Other?? Search Databases:Sequence, Literature, StructuralOther?? Data: Retrieve, Annotate, Manage Determine Methodological Limitations Analyze Data: Multiple Alignment of Sequences OSM/MIR Determination 2D and 3D Modeling Phylogenetic Reconstruction Gene and Genome Architecture McClure, 2001
Motif-detection Programs Blockmaker Matchbox Meme Pima Pralign SAM
Motif Detection Programs McClure, 2002
Sequence Length, Percent Identity and Distance Values McClure 2002
Summary of small data set analysis Program Data Sets AVG GLOB(12) KIN(12) PRO(12) RT(20) RH(12) BLOCKMAKER 80 63 53 31 31 52 INTERALIGN 98 94 22 49 23 57 MATCHBOX 38 85 61 67 37 58 MEME 90 96 67 93 73 84 PIMA 98 99 55 71 87 82 PROBE 93 95 81 94 83 89 Scores reported as percentage of sequences in which Motifs were correctly identified. Values in parenthesis are the number of sequences in each data set.
Summary of Large Data Set Analysis Program Data Sets AVG GLOB(174) KIN(186) PRO(114) RT(178) RH(169) PIMA 43 46 69 47 43 50 12 35 19 16 22 21 MEME 85 97 87 84 76 86 PROBE 98 98 91 85 93 93 Two sets of scores are reported for the results of testing the PIMA method. In each case this method finds two subsets of alignments with the OSM correctly identified, but fails to merge these two into a final multiple alignment. Scores are reported as percentages of sequences in which the OSM is correctly identified. Values in parentheses are the number of sequence in each dataset.
RdRp of Plus strand viruses GDD RdRp of Mononegavirales GDNQ RdDp FADDM RT RH HYPOTHESIS: The Reverse Transcriptase domain of the RNA-dependent DNA Polymerase shares common ancestrywith the RNA-dependent RNA Polymerase of the OrderMononegavirales and Plus Strand RNA viruses.
New work A Functional Genomics Approach to Inferring Amino Acid Contacts Among the L, P and N proteins of the Replication/Transcription Complex of the Order Mononivavirales • Protein disorder • Low hydrophobicity and high mean net charge are good indicators of natively unfolded proteins • Predictors of Natural Disordered Regions (PONDR)-- utilizes neural networks to distinguish disordered from ordered regions 2) Evolutionary Dynamic Approaches A) Intermolecular compensatory mutations Pazos and Valencia 1) predicting interacting partners 2) detecting correlated mutations between two interacting proteins 3) extending to three interacting partners B) Evolutionary-Structure Function (EFS) -- Simon and Sidow Determines numbers amino acid replacements given a fixed phylogenetic topology, ranking constrained regions C) Intramolecular compensatory mutations -- Pollack calculates likelihood estimates of allowing for rate variation and robustly discriminates coevolution of intra-sites versus random effects. 3) Use experimental results to model and validate expectations 4) Test the predicted structure for the Ebola
VSV Transcription/Replication leader N 3' L P P P P P P L L P P P P N n N VSV Transcription 5' 5' read through 3' P P P P P P VSV Replication L L CO-ASSEMBLY N P ? P
Rhabdoviridae Genome N P M G RdRp N P/C/V M F HN RdRp VSV Paramyxoviridae Genome Sendai
N, P and Proteins VI required for replication N protein RNA-BS 1 RNA-BS 524 Sendai RNA-BS PPBS PCS PPBS VSV 1 422 & PPBS P protein NPBS Oligomerization domain RES RSR 1 LPBS Sendai NPBS NPBS 568 * * * * ** ** NPBS LPBS NPBS GTP binding VSV 1 265 **** * ** * * L protein I II III IV V Sendai 1 2228 PPBS RSR MT RNA-BS VI I II III IV V VSV + 1 2109 + + + MT PPBS
Update Mononegavirales Sequence Update Mononegavirales Sequence and Literature Database Annotated N, P, L protein maps with ALL information regarding positions of experimentally determined functions and interactions N, P and L sequences Multiple Alignment Evolutionary Dynamics Analysis Predict regions of disorder Inter-CM analysis Phylogenetic reconstruction Calculate H/R PONDR ESF-analysis Intra-CM analysis
The McClure Lab The McClure Lab Dr. Marcella McClure, P.I. (Marcie) Dr. Ruth Angeletti Hogue, Adjunct Professor (visiting from Albert Einstein School of Medicine) Dustin Lee, M.S., Bioinformatics Programmer Brad Crowther, B.S., Bioinformatician I/Lab Manager Aaron Juntunen, Undergraduate programmer Kelly Burningham, Undergraduate