370 likes | 610 Views
Bioinformatics. Outline of Presentation Overview of Bioinformatics Introduce methods to access HGP. Goals of HGP. Sequence Human Genome Sequence Genomes from other Species Prokaryotes: Eubacteria & Archaebacteria Eukaryotes: Yeast-Nematode-Fruit Fly-Plant-Humans.
E N D
Bioinformatics Outline of Presentation Overview of Bioinformatics Introduce methods to access HGP
Goals of HGP • Sequence Human Genome • Sequence Genomes from other Species • Prokaryotes: Eubacteria & Archaebacteria • Eukaryotes: Yeast-Nematode-Fruit Fly-Plant-Humans
Bioinformatics = Computational Genomics • New discipline to utilize information from HGP • Name is not determined • Represents the third major shift in Biology • 1st Systematics; indexing and naming species • 2nd Molecular Biology (1953); understanding DNA/RNA/Protein
-nomics is the “new suffix” • Genomics • Proteonomics • Pharmacogenomics • Subpopuations that benefit from a drug • Toxicogenomics • Subpopulations that have adverse reactions to a drug • Vaccinomics • Derive epitopes of virulent antigens from MHC Ag processing
& the Latest • A new speciality: ecogenomics • “Horizontal gene transfer is not a fundamental force in microbial evolution, but it is a fundamental force in the evolution of particular loci” • John Paul, marine ecologist • University of South Florida.
Goal of Bioinformatics • Interpret the language (grammar) of DNA • DNA to RNA to Protein • Similarity, Alignment, & Homology • Predict Protein Structure/Function from DNA • Homology Structures of Proteins (tree of life-evolution)
Strategies • Search for DNA matches • 4 bases (ATCG); code degenerate • multiple codes per amino acids • Search for Protein (amino acids) • 20 amino acids; less flexibility • Compare Proteins or DNA to each other • Start with DNA or Proteins, then convert
Homology Studies • Proteins that share: • significant sequences • functional groups (families-Prosite database) • Have Common Ancestor • Only part of protein important • Signature, Families, Motifs, Conserved Sequence
Pearson, WR., Protein sequence comparison and Protein evolution. 11/99
Major Databases • NCBI: National Center for Biotechnology: • BLAST, OMIM, PUBMED, TAXONOMY, STRUCTURES • PDB: Protein Data Base at Brookhaven National Laboratories • 3-d visualization of protein structures • ExPASy: Swiss Institute of Bioinformatics • Protein and Enzyme function and homology • KEGG: Kyoto Encyclopedia of Genes and Genomes • Metabolic maps and functions of enzymes • TIGR: The Institute of Genetic Research Microbial Database • Microbial gene maps
National Center for Biotechnology (NCBI) • www.ncbi.nlm.nih.gov • BLAST (Basic Local Alignment Search Tool); sequence similarities • Entrez; Nucleotide or protein retrieval • OMIM (Online Mendelian Inheritance in Man); information on genetic disorders • Mutations, DNA, Protein, & other links • PubMed; 9 million citations in MEDLINE • Taxonomy; names of all organisms that have >1 reported DNA base • Structures; 3-d structures
BLAST Menu • 5 types of programs; 3 DNA/RNA queries, 2 amino acid queries • Databases-20 (general or specialized) • Filtered; leave checked (remove ALU sequences or highly repeated DNA) • Fasta Format; 1st line starts with >text, next line is bases (or 1-letter amino acids), no commas or periods, <80 characters per line. • Web or e-mail queries • Cruncher and Muncher: name of computers
Nucleotide Sequence Queries • BLASTN: nucleotide to nucleotide • BLASTX: nucleotide to protein • TBLASTX: 6-frame translation to protein • uses a lot of CPU time
FASTA Format for Nucleic Acids • A --> adenosine M --> A C (amino) • C --> cytidine S --> G C (strong) • G --> guanine W --> A T (weak) • T --> thymidine B --> G T C • U --> uridine D --> G A T • R --> G A (purine) H --> A C T • Y --> T C (pyrimidine) V --> G C A • K --> G T (keto) N --> A G C T • - gap of indeterminate length
Protein Sequence Queries • BLASTP: protein to protein • TBLASTN: protein to 6 frame translation • requires extensive CPU time • Goal: similarity, alignment, homology
FASTA Format for Proteins • A alanine P proline • B aspartate or asparagine Q glutamine • C cystine R arginine • D aspartate S serine • E glutamate T threonine • F phenylalanine U selenocysteine • G glycine V valine • H histidine W tryptophan • I isoleucine Y tyrosine • K lysine Z glutamate or glutamine • L leucine X any • M methionine * translation stop • N asparagine - gap of indeterminate length
BLAST Databases • 20 different databases to search • NR (non-redundant), MITO (mitochondria), MONTH (new records), YEAST, E. coli, ALU, VECTOR, etc. • NR: Nucleotide or Protein -non-redundant from all known data sources • default on BLAST
BLAST Output Graphical overlay of matched sequences Genome sequence link (gb, emb, jp) Name of sequence Score (computational score of hits) Statistical Probability (p <0.05 for significance) see problems for example of output
BLAST Output Part 1. of 5. Click for results
Advanced Subjects • Substitution Matrices; allow substitutions of amino acids • Masking; remove low complexity sequences • Filters; remove redundant sequences • Types of searches • DNA to DNA • DNA to Protein (6 frame translation) • Protein to Protein • Homology studies
Substitution Matrices • For alignment of 2 proteins • Scores given each amino acid substituted in the comparison. • Scores calculated by comparing distantly related proteins • Examples of Scores • Amino acids the same chemically • Glutamic Acid vs. Aspartic Acid (acid to acid) • Leucine vs. Isoleucine (-phobic to –phobic) • Large positive value (+4) • Small to large • Glycine to Tryptophan (small neutral to large –phobic) • Large negative value (-4) • Some substitutions will effect structure and would be detrimental
Substitution Matrices; cont’d • BLOSUM62; default on BLAST Programs • PAM40 and other choices available • Low; detecting very strong but localized sequence similarities • High; detecting long but weak alignments between distantly related sequences • Use the Higher number for more distant relationships
Masking • Low complexity regions represent locally biased amino acid composition; • sequences deviate from random model used for statistical significance. • Low complexity sequences are statistically but not biological important. • >25% or more residues in protein sequence database
Filtering • Reduce number of matches due to highly repetitive sequences • Low Complexity sequences • Poly A, ALU sequences (~10% of genome is ALU) • Don’t want these sequences reported; all report will be • Default in BLAST (SEG & XNU)
3-D Visualizations of Proteins • Software allows manipulation of structures; all freeware • pluggins for Browsers (Netscape and Explorer, need most current versions) • Backbone, wireframe, spacefilling, stereo, rotation, zoom • CHIME; best, tutorials written for biochemistry, organic chemistry, & inorganic chemistry • http://www.mdli.com/download/ • 3NCD; published by NIH-NCBI; allows overly of proteins • http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml
Reverb-Dna Binding Complex (1aby.pdb) Wireframe Cartoon Spacefilling