Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology

Bioinfo/Stat 545 Biostat646Data Analysis in Molecular Biology Lab 1: Bioinformatics Online Resources Dongxiao Zhu

Overview • Main types of biological data • Sequence Data • Interaction Data • Microarray and gene expression data • Others, macromolecule structure data, human genes and disease data • Information Retrieval Strategies

Part I. Online Biological Data Resources • 2004 Nucleic Acid Research database issue http://www3.oup.co.uk/nar/database/cap/ (database list) • Total 548 databases listed, 162 more than last year • Main types of biomedical data • Sequence Data (DNA and Protein Sequence) • Gene sequencing, “Whole genome shotgun” and Lander & Waterman Assembly Algorithm • Protein sequencing, de novo sequencing from tandem Mass Spectra • Gene Prediction, Sequence alignment and BLAST • Gene Annotation and Gene Ontology • Protein/RNA secondary/tertiary structure prediction • Interaction data – Biological pathway and network • Microarray and Gene Expression Data • Others, structure data, human genes and disease

Gene/Protein sequencing –data acquiring and data accuracy • Whole genome shotgun[1] • Double end sequencing • short reads off both ends of large inserts • additional information for assemble • Clone coverage vs. sequence coverage • Scaffolds • ordered and oriented contigs • sequence gaps • De novo protein sequencing from Tandem Mass Spectra[2] • Accuracy issues: • Large scale repeats • Missing and contaminating data • Plasmids and minichoromosomes • Signature of tandem repeats • Polymorphism

Gene Prediction, Annotation and Gene Ontology • Genescan webservice[3] • http://genes.mit.edu/GENSCAN.html • Sensitive in recognizing at least on exon • Biochemical Functional Annotation (Biochemical View) • Clone, expression and functional studies • Database homolog/ortholog search • Sequence alignment (similar seq -> similar function) • Structure alignment (similar structure -> similar function) • Protein sub-cellular location prediction using primary sequence alone (Cellular View) • Codon usage bias in differently localized protein • Signal peptide • Gene ontology – consistent descriptions of gene products in different databases

Sequence Alignment/BLAST and Literature Search – Bioinformatics approaches to gene annotation • Why BLAST? • Explosively increasing novel sequences, in arguable most characterized ~4200 E.coli proteins, half of them are not experimental studied. Moreover, every newly sequenced genome encodes hundreds to thousands novel proteins • There is a need to infer functional roles of these novel proteins. • compare novel sequences with previously characterized genes to annotate function • BLAST algorithm[4] • http://www.bioinformatics.med.umich.edu/Courses/526/lecturenotes.html • BLAST program selection guide • http://www.ncbi.nlm.nih.gov/BLAST/producttable.shtml • BLAST tutorial • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html • Literature Search (Part II)

Gene ontology (GO)[5] • Why GO? • Use of GO terms by several collaborating databases facilitates uniform queries across them • Hierarchical structured to allow query a vocabulary at different levels. • For example, you can use GO to find all the gene products in the mouse genome that are involved in signal transduction, or you can zoom in on all the receptor tyrosine kinases • Allows annotators to assign properties to gene products at different levels, depending on how much is known about a gene product http://www.geneontology.org/index.shtml#downloads

What GO[5] is? • GO is designed to be a structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products in any organism. GO is used to annotate genes and gene products • Three categories of GO • Biological Process: a biological objective to which the gene or gene product contributes. A process is accomplished via one or more ordered assemblies of molecular functions. E.g. “cell growth and maintenance” , “signal transduction”, “cAMP biosynthesis”. • Molecular Function: the biochemical activity of a gene product. E.g. “enzyme”, “ligand”, “Toll receptor ligand”. • Cellular Component: the place in the cell where a gene product is active. E.g. “ribosome” or “proteasome”, “nuclear membrane”.

An interesting analog of GO • Statistician’s view • A multivariate definition • DB developer’s view • A entity/attributes definition in a DB schema • Biologist’s view • A nomenclature accepted by Biochemist/Molecular Biologist, Cell Biologist, Geneticist, Neuroscientist and Development Biologist

What GO is NOT? • GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context. • GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not, in itself, sufficient. Reasons for this include the following. • Knowledge changes and updates lag behind. • Individual curators evaluate data differently. While we can agree to use the word 'kinase', we must also agree to support this by stating how and why we use 'kinase', and consistently apply it. Only in this way can we hope to compare gene products and determine whether they are related. • GO does not attempt to describe every aspect of biology. For example, domain structure, 3D structure, evolution and expression are not described by GO. • GO is not a dictated standard, mandating nomenclature across databases. Groups participate because of self-interest, and cooperate to arrive at a consensus

Protein/RNA secondary/tertiary structure prediction • Protein secondary/tertiary structure prediction • Server list, http://www.embl-heidelberg.de/predictprotein/doc/explain_meta.html#list • Prediction methologies: Sliding window based and Machine learning based • Easier and feasible at this moment: prediction of 2D topology for some functional important and simple patterned protein, e.g. Transmembrane protein [7]. • RNA secondary/tertiary structure prediction • Algorithms: Biological sequence analysis, R.Durbin et.al. Cambridge University Press, 1988 p267 • Michael Zuker’s prediction server [6] • http://www.bioinfo.rpi.edu/applications/mfold/old/rna/form1.cgi

Interaction Data – Biological Pathway and Network • Three main types of interaction data • Signal transduction or transcription regulation • Protein-protein interaction • Metabolic pathway (best in terms of studying network topology) • Interaction databases • KEGG database, metabolic pathways and signal transduction pathways in 107 organisms • http://dip.doe-mbi.ucla.edu/dip/Links.cgi • Network model (random vs. scale free, small world) • Network analysis and visualization software • http://www-personal.umich.edu/~mejn/courses/2004/cscs535/syllabus.pdf • Pajek, AT&T DOT etc.

Metabolic Network in Homo sapiens

Summary statistics of network analysis in 16 organisms

Microarray and Gene Expression Data • Assumptions • Measured signal is proportional to amount of corresponding cDNA/mRNA • Amount of mRNA determines amount of protein, i.e. there is no regulation on translation level • Both of assumptions have NOT been proven yet. • DNA microarray databases (useful links) • http://industry.ebi.ac.uk/~alan/MicroArray/ • http://genome-www5.stanford.edu/resources.html • http://www.ebi.ac.uk/microarray/ • A lot more, you explore it!

Download Gene Expression Data from SMD – An example • Stanford Microarray Database (SMD) • Retrieving public data from SMD • Retrieving data for an organism • ftp://genome-ftp.stanford.edu/pub/smd/organisms • One directory per organism, whose names are two-letter code used by SMD • Under each directory, one file per experiment • Three ways to retrieve • Web Client. i.e. IE, Netscape, etc. • Graphic ftp client, e.g. Flashget, etc • Command line ftp client • ftp –i genome-ftp.stanford.edu (-i get them all) • Name: anonymous Password: XX@ • cd pub/smd/organisms/SC • mget *gz

Continued • Retrieving all public data for an publication • Go to http://genome-www5.stanford.edu/cgi-bin/tools/display/listMicroArrayData.pl?tableName=publication • Click any entry in column “Data in SMD” • Click “view” to read brief experiment design description • Click “display data” to do experiment-wise query. • Click “Data Retrieval and Analysis” to filter data and retrieve data

Part II. Information Retrieval in Bioinformatics • Master effective information retrieval techniques can keep your research thinking and works up-to-date • My steps in doing biomedical research • Identify an interesting topic and rise a scientific hypothesis • Start from NCBI Entrez, the life science search engine. http://www.ncbi.nlm.nih.gov/Entrez/ • Input the keyword or phrase into the query box and click GO • Numbers of pieces of retrieved information are displayed • Briefly go through each kinds of resources • NCBI Entrez (Good starting point) • Common retrieval interface to many databases • Controlled links between databases • Maintained at the National Center for Biotechnology Information (NCBI) in the National Library of Medicine (NLM)

Pubmed and related IR Strategies - biomedical literature and books • What is pubmed? PubMed is a web-based database of bibliographic information drawn primarily from the life sciences literature • Pubmed tutorial: http://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html • Search Mechanisms • PubMed uses an Automatic Term Mapping feature • Look first in the MeSH Translation Table (Translate keywords into MeSH term, e.g. from “renal transplant” to “kidney transplant”) • Then look into journal translation table • Finally in author index • As soon as PubMed finds a match, the mapping stops. That is, if a term matches in the MeSH Translation Table, PubMed does not continue looking in the next table. Its absolutely necessary to specify the “Limit” in NCBI. E.g. “cell” is MeSH term and also a journal name

Pubmed - Continued • What if “no match” is found? • PubMed is unable to match a search term with either of the translation tables or the Author Index • PubMed will then search the individual words in All Fields. Individual terms will be combined (ANDed) together. • Example: TATA Box associated transcription factor • Phrase Searching • These formats for phrase searching instruct PubMed to bypass automatic term mapping. Instead PubMed looks for the phrase in its Index of searchable terms. If the phrase is in the Index, PubMed will retrieve citations that contain the phrase. • PubMed may fail to find a phrase because it is not in the Index. • Your phrase may actually appear in citation and abstract data, but may not be in the Index. If this is the case, the double quotes are ignored and the phrase is processed using Automatic Term Mapping.

MeSH database (“GO” in literature search) Database of indexing terms Entry example NF-kappa B Ubiquitous, inducible, nuclear transcriptional activator that binds to enhancer elements in many different cell types and is activated by pathogenic stimuli. The NF-kappa B complex is a heterodimer composed of two DNA-binding subunits: NF-kappa B1 and relA. Year introduced: 1991 Entrez => MeSH http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=mesh NLM => MeSH http://www.nlm.nih.gov/mesh/meshhome.html

Divisions Anatomy [A] Organisms [B] Diseases [C] Chemicals and Drugs [D] Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] Psychiatry and Psychology [F] Biological Sciences [G] Physical Sciences [H] Anthropology, Education, Sociology and Social Phenomena [I] Technology and Food and Beverages [J] Humanities [K] Information Science [L] Persons [M] Health Care [N] Geographic Locations [Z] Hierarchy with Multiple Inheritance Amino Acids, Peptides, and Proteins [D12] Proteins [D12.776] DNA-Binding Proteins [D12.776.260] NF-kappa B [D12.776.260.600] Amino Acids, Peptides, and Proteins [D12] Proteins [D12.776] Nuclear Proteins [D12.776.660] NF-kappa B [D12.776.260.600] Amino Acids, Peptides, and Proteins [D12] Proteins [D12.776] Transcription Factors [D12.776.930] NF-kappa B [D12.776.260.600] Structure of MeSH (Combination of EC and GO)

NF-kappa B Ubiquitous, inducible, nuclear transcriptional activator that binds to enhancer elements in many different cell types and is activated by pathogenic stimuli. The NF-kappa B complex is a heterodimer composed of two DNA-binding subunits: NF-kappa B1 and relA. Year introduced: 1991 Subheadings:administration and dosage agonists analysis antagonists and inhibitors biosynthesis blood cerebrospinal fluid chemistry classification deficiency diagnostic use drug effects genetics immunology isolation and purification metabolism pharmacokinetics pharmacology physiology radiation effects secretion therapeutic use toxicity ultrastructure Restrict Search to Major Topic headings only Do Not Explode this term (i.e., do not include MeSH terms found below this term in the MeSH tree). Entry Terms: NF-kB NF kB Nuclear Factor kappa B kappa B Enhancer Binding Protein Immunoglobulin Enhancer-Binding Protein Enhancer-Binding Protein, Immunoglobulin Immunoglobulin Enhancer Binding Protein Transcription Factor NF-kB Factor NF-kB, Transcription NF-kB, Transcription Factor Transcription Factor NF kB Ig-EBP-1 Ig EBP 1 Previous Indexing: DNA-Binding Proteins (1987-1990) Transcription Factors (1987-1990) See Also: I-kappa B All MeSH Categories Chemicals and Drugs Category Amino Acids, Peptides, and Proteins Proteins DNA-Binding Proteins NF-kappa B All MeSH Categories Chemicals and Drugs Category Amino Acids, Peptides, and Proteins Proteins Nuclear Proteins NF-kappa B All MeSH Categories Chemicals and Drugs Category Amino Acids, Peptides, and Proteins Proteins Transcription Factors NF-kappa B MeSH Full Listing

Tips for increasing your searching sensitivity and specificity • Chop query yourself with logic AND, OR, look a term up yourself in MeSH database, and use MeSH terms in your query • Use tags to do efficient search • [au],”author”, e.g. States DJ[au]. • [dp],”date of publication”,e.g. 2004[dp]. • [ad], “address”, e.g. Ann Arbor[ad], etc. • [MeSH], “MeSH term”, e.g. Transcription factor[MeSH] • Select “Limited to” option to prevent the search stopping prematurely • Use phrase searching “” if you don’t want your phrase to be partially searched.

Entrez Clipboard and Address Issue Send to “clipboard” • Place to save results collected from multiple searches • Saved for ~ 1hr Task: Find a local expert on NF kappa B “NF kappa B” AND (48109 [ad] OR “Ann Arbor” [ad] NOT Pfizer [ad]) (scan results for the most common senior author) Need to think about all the ways people write addresses “University of Michigan” fails to pick up “Univ. Mich.” or “UMMS” etc. Zipcodes are very specific, but only get about 70% Won’t catch co-authored articles with a remote collaborator

IR Strategies Term search • Simple search for term matches (exact or stemmed) “Find articles containing ‘p53’” Boolean • Logical combination of term matches “Find articles containing ‘p53’ AND ‘apoptosis’” Statistical neighboring • Assume that articles on the same subject will use similar words • Rank articles by similarity of word use “Find articles using vocabulary similar to the vocabulary in this title/abstract” Deeper parsing • Natural language processing and deeper understanding • The field is still in its infancy “Find articles describing the mechanism of p53 activation in apoptosis”

Boolean Searches • Entrez attempts to intelligently parse your query Query: dna binding transcription factor macrophage Details => (((("dna"[MeSH Terms] OR dna[Text Word]) AND (("pharmacokinetics"[MeSH Subheading] OR "pharmacokinetics“ [MeSH Terms]) OR binding [Text Word])) AND ("transcription factors“ [MeSH Terms] OR transcription factor [Text Word])) AND ("macrophages"[MeSH Terms] OR macrophage [Text Word])) • You can force a Boolean search Query: “dna binding” AND “transcription factor” AND macrophage Details => (("dna binding"[All Fields] AND "transcription factor"[All Fields]) AND ("macrophages"[MeSH Terms] OR macrophage[Text Word]))

Phrase Searching • Specify with quotes “transcription factor” vs. “transcription”“factor” • Precomputed • Fast • Often mapped to synonyms and MeSH terms • Just because you get a “phrase not found” message does not mean it is not present

Text Neighboring Related articles link (single or multiple articles) • Term usage similarity • Articles talking about the same thing are likely to use the same words • Good recall (sensitivity) • Precomputed and fast Limitations • Strictly algorithmic, no understanding • “Ras activates PI3K” vs. “PI3K activates Ras” • Historical and author biases in vocabulary • Poor precision (specificity) • Ranking can not satisfy everyone

Computational Issues in Statistical Text Retrieval • Stop words • Simple words like “the” and “and” are not worth scoring • Term weights • Should weight matches of rare words more heavily than matches of common words • Stemming and synonyms • Need to stem verbs and plural forms • May or may not be able to reduce to a normalized set of synonms • Normalizing for length • Don’t want to exclude short articles or articles without an abstract • All vs. all comparison is not feasible • 107 articles => 1014 comparisons, not feasible • Compute demands of the task are growing faster than Moore’s law

Acknowledgements • Some slides in Part II are taken from Dr.States’ Bioinfo 526 class http://www.bioinformatics.med.umich.edu/Courses/526 • Dr. Zhaohui (Steve) Qin for helpful discussion • All authors of references that I have cited

Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology