1 / 33

The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information

Database Resources . The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information. www.ncbi.nih.gov. NCBI Mission.

hera
Download Presentation

The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Resources The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information www.ncbi.nih.gov

  2. NCBI Mission • …. to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. • What does this involve ? • creating automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics; • facilitating the use of such databases and software by the research andmedical community; • coordinating efforts to gather biotechnology information both nationallyand internationally; • performing research into advanced methods ofcomputer-based information processing for analyzing the structure and function of biologically important molecules.

  3. What is a Database ? • A model or representation of some aspect of the real world • An organized collection of data. May contain many different types of data • Coherent, consistent and designed for a specific purpose • A computational system for managing and querying the data.

  4. What is a Database ? • A collection of information organized in such a way that a computer program can quickly select desired pieces of data. • An electronic filing system • Traditional databases are organized by fields, records, and files. • A field is a single piece of information; • a record is one complete set of fields • a file is a collection of records. For example, a telephone book is analogous to a file. It contains a list of records, each of which consists of three fields: name, address, and telephone number.

  5. What is a Database ? • To access information from a database, you need a database management system (DBMS). This is a collection of programs that enables you to enter, organize, and select data in a database. • Most molecular biology databases primarily use relational database management systems (RDBMS).

  6. Relational Database • A relational database is like a large spreadsheet. Each field is a column, each row is an entry. Relational databases use a set of tables to organize data. • Each entry must be unambiguously identified • Names are not reliable e.g. incorrectly assigned gene function • Unique IDs (UID)s are used, e.g. in GenBank these are called accession numbers

  7. Relational Database • Achieving consistency • Repeated information is stored in a single place. • Only one copy needs to be updated TaxonomyTaxonomy ID*GenusSpecies *May be referred to by a secondary ID SequenceUIDDefinitionLocusAccessionTaxonomy ID*Sequence Ref IndexMedline IDAuthorsTitleJournal Ref Index*UIDMedline ID * May be referred to indirectly via an index

  8. Relational Database • Language used is SQL or structured query language • Easy to understand (essentially English?) • Relatively consistent across RBDMS • Supplies a set of commands to define tables, insert data and make queries • Queries • SELECT some fields FROM some table WHERE some condition is met • E.g. select accession, sequence FROM sequence WHERE Accession = BU039022BU039022 CATACAAATACTGCTACHTAAATC …. • More complex queries require two or more tables be joined to produce a result

  9. Relational Database • Most RDBMS do not allow users to directly query the database by SQL. • An ill formed query can overload or crash the system • SQL still too complex for biologists? • Provide a search interface for the user instead • E.g. user enters a phrase and the database identifies what part of the database should be searched. • The queries that make it through the web interface have to be translated to SQL

  10. Relational database : Example GenBank Query

  11. What Constitutes a Good Database ? • Broad coverage of the chosen topic • Up to date information gathering • Curated • Support staff • Commitment to the future • Good query interface Issues for Molecular Biological Databases ? • Annotation • Archives • Updates • Redundancy

  12. Issues for Molecular Biological Databases ? • Annotation • Adding biological information to genome sequence. Textual descriptive information • Correctness • Many genes are incorrectly annotated. May assign a function to a novel gene from a similar sequence that may itself be incorrectly annotated so the error is propagated throughout the database. • Routine error • Quality • Expert or non expert curation? Who provided the curation? • Is there any biological verification? • What vocabulary is used • Has their been any peer review ?

  13. Issues for Molecular Biological Databases ? • Archival Quality • Is the database archival or curated • Can the same data be recovered later • Don’t overwrite primary key (each accession numbers) • The best databases note any changes to the data. • Updates • How often is the database updated? • Major databases take direct submissions • Only the direct submitter can make changes, even if you can prove its wrong. • When is a sequence finished ? • How is annotation updated as more knowledge is available • Redundancy • This is a major issue, how do we deal with it without losing potentially valuable information. • Also relates to archival quality

  14. NCBI and GenBank • Genbank is the genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information in them. • GenBank is hosted within NCBI • Researchers submit their sequences to GenBank • NCBI provides analysis and retrieval resources for the data in GenBank (and many other NCBI hosted databases).

  15. NCBI Databases (http://www.ncbi.nlm.nih.gov/guide/all/#Databases_) • SNP (dbSNP) • dbVAR – large scale genomic variation • dbGAP– integration of genotype & phenotype • PopSetDatabase • Taxonomy Database • GEO Profiles • GEO Datasets • Cancer Chromosomes • Epigenomics • PubMedCentral • Journals • MeSH • Bookshelf • OMIM Database • Nucleotide Database • EST (dbEST) • GSS (dbGSS) • Protein Database • Structure Database • Genome • 3D Domains • Conserved Domains • UniSTS • Gene • UniGene • HomoloGene • Reference Sequence (refseq)

  16. Retrieving Data from NCBIusing Entrez • Entrez is a text based retrieval system that integrates all the information resources available at the NCBI such as; • Scientific literature • DNA and protein sequence databases • 3D protein structure and protein domain data • Population study datasets • Expression data • Assemblies of complete genomes • Taxonomic information

  17. http://www.ncbi.nlm.nih.gov/guide/all/#howto_

  18. Create/login to the myNCBI portal

  19. Understanding GenBank records Go to http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#ModificationsDateB Click on the links on the left to get a description of what the term means, Copy the description into a word document and after completed, save the document on your drupalweb site

  20. Entrez Sequences Help http://www.ncbi.nlm.nih.gov/books/NBK44864/

More Related