Review of Biological Database Utilization

Review of Biological Database Utilization

Biological Databases We will discuss: • Usefulness to the bioinformaticist • Database types • Search methods and tools

Importance of the Public Databases • The data provide the basis for sequence-based biology • Open access is key • Supported by Human Genome Project, International Nucleotide Sequence Database Collaboration and others • The amount of biological data is enormous • Biologists are dependent on computers for storing, organizing, searching, manipulating, and retrieving the data/information

Why Search Biological Databases? • Generate new sequence • Is it already in bank? • Homologous sequences? • Find out about the gene • Annotation • Literature

Why Search Biological Databases? • Similar non-coding sequences • Repetitive elements • Regulatory regions • Homologous proteins;families • Identify and verify PCR priming sites

Biological Databases Types of Databases • Generalized databases (DNA, proteins and carbohydrates, 3D-structures) • Specialized databases (EST, STS, SNP, RNA, genomes, protein families, pathways, microarray data ...)

Generalized Databases • 2 Main Classes • DNA (nucleotide) The large databases are: • GenBank at NCBI (US), • EMBL at EBI (Europe - UK), • DDBJ (Japan). • Protein • SWISS-PROT/TrEMBL (high level of annotation), PIR (protein identification resource).

Specialized Databases • ESTs (Expressed Sequence Tags) • STSs (Sequence-Tagged Sites) • SNPs (Single Nucleotide Polymorphisms) • Organismal Genomic databases: Human (GDB), mouse (MGB), yeast (SGB), fly • HTGS (High Throughput Genomic Sequences • RNA • tRNAs, rRNAs, small RNA’s & others

Specialized Databases • Protein families • PROSITE, PRINTS, BLOCKS • Pathways: metabolic, regulatory etc. • EMP , PathDB • Microarray data: expression data • 4 major: GeneX, ArrayExpress, • Stanford, Gene Expression Omnibus (GEO) To find specialized databases: http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm#

Types of Database • Primary: archival • experimental data with some annotation (interpretation) • Secondary: curated

What is annotation? • Extraction, definition and interpretation of features on the genome sequence • Derived by integrating computational tools and biological knowledge • for example, known and predicted genes • Some databases are referred to as “annotated databases” • means that they contain sequence, comments, literature references, notes on experiments…

Curated Databases • Records are added only after they have been through a curation process • checked for accuracy, additional information (annotation) • scientific judgments are made as data are cleaned up and merged • Examples of curated databases: • SWISS-PROT, OMIM, RefSeq, LocusLink

Swissprot http://www.expasy.ch/sprot/ • Swissprot • SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.

Human Mouse Drosophilia C. elegans Yeast Organismal Databases These databases often serve a specific research community • Livestock • Arapidopsis • Maize • Plasmodium • Other http://tolweb.org/tree/home.pages/linksdb.html#organismal

Multi-Organism Resources www.ncbi.nlm.nih.gov www.tigr.org www.expasy.org

Biological Databases Types of Database Search • Text-based database search (SRS, Entrez) • Sequence-based database search (sequence similarity search) (BLAST, FASTA...) • Motif-based database search (ScanProsite, eMOTIF) • Structure-based database search (structure similarity search) (VAST, DALI...)

Database Search Tools Text-based :querying the annotation • SRS6 at http://srs6.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+top • ENTREZ at http://www.ncbi.nlm.nih.gov/Entrez/ • DBGET/LinkDB at http://www.genome.ad.jp/dbget-bin/www_bfind?linkdb

Sequence-based Searches • Considerations: • Should I compare DNA or protein sequences? • More random matches with DNA • Protein “matches” may be more relevant • DNA databases are larger • Sensitivity vs. Selectivity • Sensitivity: the ability to find true positive matches but still have false positives • Selectivity: the ability to reject false positives • Trade-off when choosing algorithm

Database Search Tools Sequence-Based • FASTA (FASTA at EBI, UK) • BLAST (Basic local alignment search tool at NCBI, USA) • MPsrch (Smith-Waterman algorithm-based search at EBI, UK)

More Sequence-based Tools • BLAST Microbial Genomes at http://www.ncbi.nlm.nih.gov/Microb_blast/unfinishedgenome.html • (Search finished and unfinished genomic sequences at NCBI) • Genome and proteome FASTA (at EBI, UK) at http://www2.ebi.ac.uk/fasta3/genomes.html

More Sequence-based Tools • Protein search in genomes at http://searchlauncher.bcm.tmc.edu/seq-search/protein-search-genomes.html (BLAST and FASTA Species-specific protein sequence searches at Baylor College of Medicine, USA) • SectionSearch (FASTA or TFASTA search against predefined sections of sequence databanks at IUBIO Indiana, USA) • NRL-3D at http://pir.georgetown.edu/pirwww/search/searchseq.html(Sequence-structure data base search at John Hopkins University, USA)

Tools to Search Special Databases for Sequences with Similar Motifs or Patterns • ProfileScan • uses pfscan to find similarities between a query sequence and profile library • prosite is one such database • an Expasy database (ExpertProteinAnalysisSYstem, http://www.expasy.ch/) • similarities are based on fingerprints or common patterns

BLOCKS Database • a block is a motif or region of similar structure • no gaps are introduced • a block refers to the alignment, not the individual sequences • BLOCKS database is derived from PROSITE • searches can be done at Fred Hutchinson Cancer Center in Seattle

3 Major Portals into the Genome Data • UCSC Genome Browser at Univ. of California Santa Cruz • http://www.sequenceanalysis.com/ • Ensembl at European Bioinformatics Inst (EBI) • http://www.ensembl.org • Entrez at NCBI • http://www.ncbi.nlm.nih.gov/Entrez/

Entrez Databases • PubMed: The biomedical literature • PUBMED database contains Medline abstracts as well as links to full text articles on sites maintained by journal publishers • Nucleotide sequence database (Genbank) • Protein sequence database • Structure: three-dimensional macromolecular structures • Genome: complete genome assemblies • PopSet: population study data sets

Entrez Databases • OMIM: Online Mendelian Inheritance in Man • Taxonomy: organisms in GenBank • Books: online books • ProbeSet: Gene Expression Omnibus (GEO) • 3D Domains: domains from Entrez Structure

Entrez sequence searching • can find sequences for a given gene or protein • can download copy of sequence

NCBI BLAST NCBI offers several “flavors” of BLAST

The Take Home Lessons • Search often, search with multiple parameters • Use specialized DBs where possible, use protein sequence if appropriate • There are many tools available. • You must know what tools are relevant. • You must know how to use available tools. • Look for sites that have multiple resources • e.g. Bio-Mirror at http://www.bio-mirror.net/ • Google is your best friend.

Review of Biological Database Utilization