1 / 30

Review of Biological Database Utilization

Review of Biological Database Utilization. Biological Databases. We will discuss: Usefulness to the bioinformaticist Database types Search methods and tools. Importance of the Public Databases. The data provide the basis for sequence-based biology Open access is key

julie
Download Presentation

Review of Biological Database Utilization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review of Biological Database Utilization

  2. Biological Databases We will discuss: • Usefulness to the bioinformaticist • Database types • Search methods and tools

  3. Importance of the Public Databases • The data provide the basis for sequence-based biology • Open access is key • Supported by Human Genome Project, International Nucleotide Sequence Database Collaboration and others • The amount of biological data is enormous • Biologists are dependent on computers for storing, organizing, searching, manipulating, and retrieving the data/information

  4. Why Search Biological Databases? • Generate new sequence • Is it already in bank? • Homologous sequences? • Find out about the gene • Annotation • Literature

  5. Why Search Biological Databases? • Similar non-coding sequences • Repetitive elements • Regulatory regions • Homologous proteins;families • Identify and verify PCR priming sites

  6. Biological Databases Types of Databases • Generalized databases (DNA, proteins and carbohydrates, 3D-structures) • Specialized databases (EST, STS, SNP, RNA, genomes, protein families, pathways, microarray data ...)

  7. Generalized Databases • 2 Main Classes • DNA (nucleotide) The large databases are: • GenBank at NCBI (US), • EMBL at EBI (Europe - UK), • DDBJ (Japan). • Protein • SWISS-PROT/TrEMBL (high level of annotation), PIR (protein identification resource).

  8. Specialized Databases • ESTs (Expressed Sequence Tags) • STSs (Sequence-Tagged Sites) • SNPs (Single Nucleotide Polymorphisms) • Organismal Genomic databases: Human (GDB), mouse (MGB), yeast (SGB), fly • HTGS (High Throughput Genomic Sequences • RNA • tRNAs, rRNAs, small RNA’s & others

  9. Specialized Databases • Protein families • PROSITE, PRINTS, BLOCKS • Pathways: metabolic, regulatory etc. • EMP , PathDB • Microarray data: expression data • 4 major: GeneX, ArrayExpress, • Stanford, Gene Expression Omnibus (GEO) To find specialized databases: http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm#

  10. Types of Database • Primary: archival • experimental data with some annotation (interpretation) • Secondary: curated

  11. What is annotation? • Extraction, definition and interpretation of features on the genome sequence • Derived by integrating computational tools and biological knowledge • for example, known and predicted genes • Some databases are referred to as “annotated databases” • means that they contain sequence, comments, literature references, notes on experiments…

  12. Curated Databases • Records are added only after they have been through a curation process • checked for accuracy, additional information (annotation) • scientific judgments are made as data are cleaned up and merged • Examples of curated databases: • SWISS-PROT, OMIM, RefSeq, LocusLink

  13. Swissprot http://www.expasy.ch/sprot/ • Swissprot • SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.

  14. Human Mouse Drosophilia C. elegans Yeast Organismal Databases These databases often serve a specific research community • Livestock • Arapidopsis • Maize • Plasmodium • Other http://tolweb.org/tree/home.pages/linksdb.html#organismal

  15. Multi-Organism Resources www.ncbi.nlm.nih.gov www.tigr.org www.expasy.org

  16. Biological Databases Types of Database Search • Text-based database search (SRS, Entrez) • Sequence-based database search (sequence similarity search) (BLAST, FASTA...) • Motif-based database search (ScanProsite, eMOTIF) • Structure-based database search (structure similarity search) (VAST, DALI...)

  17. Database Search Tools Text-based :querying the annotation • SRS6 at http://srs6.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+top • ENTREZ at http://www.ncbi.nlm.nih.gov/Entrez/ • DBGET/LinkDB at http://www.genome.ad.jp/dbget-bin/www_bfind?linkdb

  18. Sequence-based Searches • Considerations: • Should I compare DNA or protein sequences? • More random matches with DNA • Protein “matches” may be more relevant • DNA databases are larger • Sensitivity vs. Selectivity • Sensitivity: the ability to find true positive matches but still have false positives • Selectivity: the ability to reject false positives • Trade-off when choosing algorithm

  19. Database Search Tools Sequence-Based • FASTA (FASTA at EBI, UK) • BLAST (Basic local alignment search tool at NCBI, USA) • MPsrch (Smith-Waterman algorithm-based search at EBI, UK)

  20. More Sequence-based Tools • BLAST Microbial Genomes at http://www.ncbi.nlm.nih.gov/Microb_blast/unfinishedgenome.html • (Search finished and unfinished genomic sequences at NCBI) • Genome and proteome FASTA (at EBI, UK) at http://www2.ebi.ac.uk/fasta3/genomes.html

  21. More Sequence-based Tools • Protein search in genomes at http://searchlauncher.bcm.tmc.edu/seq-search/protein-search-genomes.html (BLAST and FASTA Species-specific protein sequence searches at Baylor College of Medicine, USA) • SectionSearch (FASTA or TFASTA search against predefined sections of sequence databanks at IUBIO Indiana, USA) • NRL-3D at http://pir.georgetown.edu/pirwww/search/searchseq.html(Sequence-structure data base search at John Hopkins University, USA)

  22. Tools to Search Special Databases for Sequences with Similar Motifs or Patterns • ProfileScan • uses pfscan to find similarities between a query sequence and profile library • prosite is one such database • an Expasy database (ExpertProteinAnalysisSYstem, http://www.expasy.ch/) • similarities are based on fingerprints or common patterns

  23. BLOCKS Database • a block is a motif or region of similar structure • no gaps are introduced • a block refers to the alignment, not the individual sequences • BLOCKS database is derived from PROSITE • searches can be done at Fred Hutchinson Cancer Center in Seattle

  24. 3 Major Portals into the Genome Data • UCSC Genome Browser at Univ. of California Santa Cruz • http://www.sequenceanalysis.com/ • Ensembl at European Bioinformatics Inst (EBI) • http://www.ensembl.org • Entrez at NCBI • http://www.ncbi.nlm.nih.gov/Entrez/

  25. Entrez Databases • PubMed: The biomedical literature • PUBMED database contains Medline abstracts as well as links to full text articles on sites maintained by journal publishers • Nucleotide sequence database (Genbank) • Protein sequence database • Structure: three-dimensional macromolecular structures • Genome: complete genome assemblies • PopSet: population study data sets

  26. Entrez Databases • OMIM: Online Mendelian Inheritance in Man • Taxonomy: organisms in GenBank • Books: online books • ProbeSet: Gene Expression Omnibus (GEO) • 3D Domains: domains from Entrez Structure

  27. Entrez sequence searching • can find sequences for a given gene or protein • can download copy of sequence

  28. NCBI BLAST NCBI offers several “flavors” of BLAST

  29. NCBI BLAST NCBI offers several “flavors” of BLAST

  30. The Take Home Lessons • Search often, search with multiple parameters • Use specialized DBs where possible, use protein sequence if appropriate • There are many tools available. • You must know what tools are relevant. • You must know how to use available tools. • Look for sites that have multiple resources • e.g. Bio-Mirror at http://www.bio-mirror.net/ • Google is your best friend.

More Related