Introduction to Bioinformatics

Introduction to Bioinformatics Junhui Wang May 2004

outline • What’s bioinformatics? • introduction to biological database • Sequence Alignment

Why use bioinformatics ? • An explosive growth in the amount of biological information necessitates the use of computers for cataloguing and retrieval. • Impossible to analyze data by manual inspection • Data mining –functional/structural information is important for studying the molecular basis of diseases(and evolutionary patterns)

What is bioinformatics ? • A mixture of computer science, mathematics and biology. • Development of new algorithms and statistics to assess relationships among members of large data sets. • Analysis and interpretation of various types of data. • Development and implementation of tools to efficiently access and manage different types of information.

Database for bioinformatics ? • Nucleotide Database & Protein database • Primary database & Secondary database

DNA RNA protein

DNA RNA protein protein sequence databases cDNA ESTs genomic DNA databases

There are three major public DNA databases EMBL GenBank DDBJ Housed at EBI European Bioinformatics Institute Housed at NCBI National Center for Biotechnology Information Housed in Japan

www.ncbi.nlm.nih.gov

PubMed is… • National Library of Medicine's search service • 11 million citations in MEDLINE • links to participating online journals • PubMed tutorial (via “Education” on side bar)

Entrez integrates… • a search and retrieval system that integrates NCBI databases • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data;

Entrez

BLAST is… • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 80,000 searches per day

OMIM is… • Online Mendelian Inheritance in Man • catalog of human genes and genetic disorders • edited by Dr. Victor McKusick

Books is… • searchable resource of on-line books

TaxBrowser is… • browser for the major divisions of living organisms • ( bacteria, viruses) • taxonomy information such as genetic codes • molecular data on extinct organisms

Structure site includes… • Molecular Modeling Database (MMDB) • biopolymer structures obtained from • the Protein Data Bank (PDB) • a 3D-structure viewer

Four questions we can answer at NCBI (and elsewhere): [1] How can I do a literature search using PubMed? [2] How can WelchWeb help? [3] How can I use Entrez to find information about a particular gene or protein? [4] How can I find information about a particular disease?

Question #1: How can I use PubMed at NCBI to find literature information?

PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published in the United States and in 70 foreign countries. It has 12 million records dating back to 1966.

MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.

PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries AND ,OR, NOT Try using “limits” Try “LinkOut” to find external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/

Question #2: How can I use WelchWeb (from the Welch Medical Library) to do literature searches? WelchWeb is available at http://www.welch.jhu.edu

WelchWeb is available at http://www.welch.jhu.edu

E-mail gateway

PubMed gateway

Library catalog

Remote access to Welch services

Request literature

Browse journals

Browse databases

Question #3: How can I use NCBI (or other sites) to find information about a protein or gene?

Four ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez [3] UniGene [4] ExPASy Sequence Retrieval System (this is separate from NCBI)

4 ways to access protein and DNA sequences [1] LocusLink with RefSeq LocusLink is a great starting point: it collects key information on each gene/protein from major databases. It now covers 8 organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) [2] Entrez [3] UniGene [4] ExPASy SRS

4 ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez Entrez is divided into sites for nucleotide, protein, structure, genomes, OMIM, and more. You can use limits (such as RefSeq) to focus your Entrez search. [3] UniGene [4] ExPASy SRS

The Genebank flatfile: • the elementary unit of information • one of the most commonly used format • LOCUS: locus name/the length of the sequence/the molecule type/ • GenBank division code/the date • DEFINITION:summarize the biology of the record • genus species/product name/…. • ACCESSION:An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. • VERSION:accession version • GID: the gi(geninfo identifier)

The Genebank flatfile (cont): • KEYWORDS:identify the particular entry,not very useful • SOURCE:either have the common name for the organism or its scientific name • REFERENCE: at least one reference or citation,can be published or unpublished,MEDLINE and PUBMED identifier provide a link to the MEDLINE and PUBMED database. • COMMENT: refer to the whole record.

Graphics format

4 ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez [3] UniGene UniGene collects expressed sequence tags (ESTs) into clusters, in an attempt to form one gene per cluster. Use UniGene to study where your gene is expressed in the body, when it is expressed, and see its abundance. [4] ExPASy SRS

Introduction to Bioinformatics