Bioinformatics

Bioinformatics Jack Min Office 3012 Office hours: TR 12:15 – 4

How would you define "bioinformatics"? Can you distinguish between a gene and a genome? Do you know what a Blast search is? Do you know what any of the letters in "blast" represent? What is Genbank? What is NCBI? What is meant by "codon usage"?

Here are two nucleotide sequences: (i) AGTCGGTAACCTAAG (ii) GGCAAAUACUAAGGA Which of these is a DNA sequence? Explain your answer. Here is a DNA sequence: 5' ATTGGRTCCAATA 3‘ (i) What do the 5' and 3' mean? (ii) What does the R stand for? (iii) Write the sequence of the complementary DNA strand, using the same notation.

What is meant by the "template" and "coding" strands of DNA? Name three different kinds of RNA. Which amino acids are represented by the following symbols? A L Q K G W

International Union of Pure and Applied Chemistry

What is a PAM matrix? Can both nucleotide and amino acid sequences be used to build molecular phylogenies? Explain the method of parsimony for building sequence-based phylogenies. What is the program CLUSTAL used for? What do you know about microarrays? For instance, are there different types of microarrays, and what are they used for?

This question is to test your general knowledge of genetics and molecular biology. Can you define the following terms? • Gene • Locus • (c) Allele • (d) Linkage • (e) Linkage disequilibrium • (f) synonymous substitution • (g) Intron • (h) concerted evolution • (i) Pleiotropy • (j) PCR • (k) RFLP • (l) Haplotype

What are the goals of the course? • To provide an introduction to bioinformatics with • a focus on the National Center for Biotechnology • Information (NCBI) and EBI • To focus on the analysis of DNA, RNA and proteins • To introduce you to the analysis of genomes • To combine theory and practice to help you • solve research problems

Themes throughout the course Textbooks Web sites Literature references Gene/protein families Computer labs

Textbook Bioinformatics and Functional Genomics Second edition (Wiley, 2009). Reference Bioinformatics Third edition (Wiley, 2005) Baxevanis and Ouellette

Web sites The course website is reached via moodle: http://pevsnerlab.kennedykrieger.org/moodle (or Google “moodle bioinformatics”) --This site contains the powerpoints for each lecture. including color and black & white versions --The weekly quizzes are here The textbook website is: http://www.bioinfbook.org This has powerpoints, URLs, etc. organized by chapter

Literature references You are encouraged to read original source articles (posted on moodle). They will enhance your understanding of the material. Readings are optional but recommended. http://ghr.nlm.nih.gov/handbook.pdf http://www.ncbi.nlm.nih.gov/books/NBK21101/

Themes throughout the course: gene/protein families We will use beta globin and retinol-binding protein 4 (RBP4) as model genes/proteins throughout the course. Globins including hemoglobin and myoglobin carry oxygen. RBP4 is a member of the lipocalin family. It is a small, abundant carrier protein. We will study globins and lipocalins in a variety of contexts including --sequence alignment --gene expression --protein structure --phylogeny --homologs in various species

Computer labs You can use any computer you can find – Computer lab in the department, in your lab, Computers in my lab (3013)

Outline for today Definition of bioinformatics Overview of the NCBI website Accessing information about DNA and proteins --Definition of an accession number --Four ways to find information on proteins and DNA Access to biomedical literature Pairwise alignment: introduction

What is Bioinformatics? Broad definition: The use of computational tools for: acquiring, storing and analyzing biological information. Narrow definition: The use of computational tools for: acquiring, storing and analyzing molecular sequence information.

What is bioinformatics? • Interface of biology and computers • Analysis of proteins, genes and genomes • using computer algorithms and • computer databases • Genomics is the analysis of genomes. • The tools of bioinformatics are used to make • sense of the billions of base pairs of DNA • that are sequenced by genomics projects.

bioinformatics medical informatics public health informatics algorithms databases infrastructure Tool-users Tool-makers

Three perspectives on bioinformatics The cell The organism The tree of life Page 4

DNA RNA protein phenotype Page 5

Time of development Body region, physiology, pharmacology, pathology Page 5

After Pace NR (1997) Science 276:734 Page 6

Fig. 2.1 Page 17

Growth of GenBank + Whole Genome Shotgun (1982-November 2008) 250 200 150 Number of sequences in GenBank (millions) Base pairs of DNA in GenBank (billions) Base pairs in GenBank + WGS (billions) 100 50 0 1982 1987 1992 1997 2002 2007

genome transcriptome proteome Central dogma of bioinformatics and genomics Central dogma of molecular biology DNA RNA protein

DNA RNA protein phenotype protein sequence databases cDNA ESTs UniGene genomic DNA databases Fig. 2.2 Page 20

There are three major public DNA databases EMBL GenBank DDBJ The underlying raw DNA sequences are identical Page 16

There are three major public DNA databases EMBL GenBank DDBJ Housed at EBI European Bioinformatics Institute Housed at NCBI National Center for Biotechnology Information Housed in Japan Page 16

The Trace Archive at NCBI contains over 2 billion traces 11/08

Taxonomy at NCBI: ~200,000 species are represented in GenBank 11/08 http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi

The most sequenced organisms in GenBank Homo sapiens 13.1 billion bases Mus musculus8.4b Rattus norvegicus6.1b Bos taurus 5.2b Zea mays 4.6b Sus scrofa (wild boar) 3.6b Danio rerio(zebrafish) 3.0b Oryza sativa (japonica)1.5b Strongylocentrotus purpurata (sea urchins) 1.4b Nicotiana tabacum 1.1b Updated 11-6-08 GenBank release 168.0 Excluding WGS, organelles, metagenomics Table 2-2 Page 18

National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Page 24

Fig. 2.5 Page 25 www.ncbi.nlm.nih.gov

Fig. 2.5 Page 25

PubMed is… • National Library of Medicine's search service • 16 million citations in MEDLINE • links to participating online journals • PubMed tutorial (via “Education” on side bar) Page 24

Entrez integrates… • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes Page 24

Entrez is a search and retrieval system that integrates NCBI databases Page 24

BLAST is… • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 100,000 searches per day Page 25

OMIM is… • Online Mendelian Inheritance in Man • catalog of human genes and genetic disorders • created by Dr. Victor McKusick; led by Dr. Ada Hamosh • at JHMI Page 25

Books is… • searchable resource of on-line books Page 26

TaxBrowser is… • browser for the major divisions of living organisms • (archaea, bacteria, eukaryota, viruses) • taxonomy information such as genetic codes • molecular data on extinct organisms Page 26

Structure site includes… • Molecular Modelling Database (MMDB) • biopolymer structures obtained from • the Protein Data Bank (PDB) • Cn3D (a 3D-structure viewer) • vector alignment search tool (VAST) Page 26

Outline for today Definition of bioinformatics Overview of the NCBI website Accessing information about DNA and proteins --Definition of an accession number --Five ways to find information on proteins and DNA Access to biomedical literature Pairwise alignment: introduction

Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein Page 27

Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) [5] UCSC Genome Browser Page 27

5 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27

From the NCBI home page, type “beta globin” and hit “Go” revised 11/08 Fig. 2.7 Page 29

revised Fig. 2.7 Page 29

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics