1 / 16

Sajid Khan

Sajid Khan . Chapter 2. Databases. Bioinformatics databases. Database Nomenclature. . WHAT IS DATABASE ? . Data repositories, data marts, and data warehouses differ primarily in the diversity of data sources that contribute to their contents. Database Models.

denna
Download Presentation

Sajid Khan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sajid Khan

  2. Chapter 2. Databases

  3. Bioinformatics databases

  4. Database Nomenclature. WHAT IS DATABASE ? Data repositories, data marts, and data warehouses differ primarily in the diversity of data sources that contribute to their contents.

  5. Database Models Defines data organization (schema) Relational Entities and relationships stored in tables Predefined schema Examples: Oracle, DB2, MySQL, PostgreSQL Object-oriented Stores data as objects (i.e., structures with predefined type) Examples: Versant, Jasmine, Objectivity Semi-structured Schema dynamically defined within data (self-describing) Flexible description of data with complex relationships Example: XML databases Bioinformatic Databases Useful information 0 DNA sequences 0 Conserved DNA domains 0 Genomes 0 Gene expression (ESTs, microarrays) 0 Protein sequences 0 Protein 3D structure 0 Protein families 0 Mutations / polymorphisms / SNPs 0 Metabolic pathways 0 Chemical compounds (ligands) 0 Biomedical literature (journal papers, online books…)

  6. Classification schemes • 0 Database design – relational, object-oriented… • 0 Data type – DNA, RNA, EST, protein… • 0 Organism – bacteria, virus, human… • 0 Accessibility – public, academic, commercial • 0 Data source – primary, derived • 0 Data entry – manually curated, computational derived • 0 Focus – sequence-oriented, gene-oriented • Resulting in many bioinformatics databases… • Bioinformatics Databases Issues • Naming • 0 Multiple names for same chemical • Arising from multiple • biological disciplines, conventions • 0 Example

  7. Bioinformatics Databases Issues • Redundancy • 0Multiple entries for same DNA / protein sequence • 0Arising from multiple experiments & biological disciplines • Example – redundant GenBank entries for E. coli dUTPase • 4 separate publications (X01714, V01578, L10328, AE000441) • Data annotation & formats • 0Multiple data for single geneSequence, location, expression, structure, function… • 0Resulting in multiple data annotations & formats • Data integration • 0Combining data from multiple bioinformaticdatabases • Redundancy • 0 Multiple entries for same DNA / protein sequence • 0 Arising from multiple experiments & biological disciplines • 0 Example – redundant GenBank entries for E. coli dUTPase • 4 separate publications (X01714, V01578, L10328, AE000441) • Data annotation & formats • 0 Multiple data for single gene • Sequence, location, expression, structure, function… • 0 Resulting in multiple data annotations & formats • Data integration • 0 Combining data from multiple bioinformatic databases

  8. Major Bioinformatics Databases DNA sequences 0 GenBank, RefSeq, UniGene 􀂋 Protein sequences 0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq 􀂋 Protein structure 0 Protein Data Bank (PDB) 􀂋 Gene expression 0 Gene Expression Omnibus (GEO) 􀂋 Biomedical publications 0 PubMed / MedLine Derived databases 0 Compiled from data in primary databases 0 Manually curated (human selection & correction) 􀁺 Advantages – high quality 􀁺 Disadvantages – high expense, low volume 􀁺 Examples 􀂋 Swiss-Prot, PIR-PSD, RefSeq 0 Computational derivation (automatically generated) 􀁺 Advantages – inexpensive, up-to-date 􀁺 Disadvantages – lower quality 􀁺 Examples 􀂋 GenPept, TrEMBL, UniGene, COGs Primary databases 0 Original submissions by researchers 0 Staff organizes information only 0 Generally sequence oriented 0 Examples 􀁺 GenBank, PDB

  9. Data Management /Databases Connection Data Management. In this data-management scenario for a pharmacogenomiclaboratory, data of various types are acquired from a variety of sources, incorporated into the data warehouse, used by a variety of applications, and archived for future use. Data created locally may be published electronically, serve as the basis for a paper publication, and may be used in a variety of applications, from drug discovery to genetic engineering.

  10. Pharmacogenomicsand Aggression Typical Electronic Medical Record (EMR) Contents. The EMR contains both objective signs, such as physical examination findings, as well as subjective patient symptoms, including chief complaint and review of systems. Aggression To illustrate the data-management issues associated with a biotech research effort that depends on multiple, disparate systems and accompanying databases, assume that the laboratory depicted focuses on understanding the genetic basis for aggression, with a goal of creating new, more effective medications to control the behavior.

  11. RNA-Protein Codon Transcription Wheel. The 64 possible codons represent the 20 common amino acids, as well as one start (ATG) and three stop (TAG, TAA and TGA) markers. Redundancies normally occur in the last nucleotide of the three-letter alphabet.

  12. Integration of Clinical Data To create an EMR capable of supporting efficient data mining, a data dictionary is used to impose a standard format and vocabulary on data stored in the clinical data mart. Integration of Bioinformatics Data. Like clinical data, bioinformatics data from a variety of sources and in numerous formats are combined in a data mart to enhance data management.

  13. From Data to Knowledge Search / retrieval tool for multiple linked databases 0 Papers biomedical literature (PubMed) 0 Nucleotide sequence database (GenBank) 0 Protein sequence database 0 Structure 3D macromolecular structures 0 Genome complete genome assemblies 0 OMIM Online Mendelian Inheritance in Man 0 Taxonomy organisms in GenBank 0 ProbeSet gene expression and microarray datasets Common identifiers for bioinformatic data 0 Locus name 0 Accession numbers 0 GenInfo ID 0 PubMed ID Original identifiers of GenBank records 0 LOCUS line in GenBank entries Originally 0 First 3 letters of organism followed by code for gene Example 0 HUMBB for human ß-globin region Problems 0 Unmaintainable due to growth of data 0 Homologous genes not named the same Data is stored / presented in a variety of formats 0 FASTA 0 GenBank 0 SwissProt 0 ASN.1 0 XML

  14. Bioinformatic Database Formats Defined by SWISS-PROT database 0 Includes annotation, other info 􀂋 Example ID BRC1_MOUSE STANDARD; PRT; 1812 AA. AC P48754; Q60957; Q60983; DT 01-FEB-1996 (Rel. 33, Created) DT 01-NOV-1997 (Rel. 35, Last sequence update) DT 16-OCT-2001 (Rel. 40, Last annotation update) DE Breast cancer type 1 susceptibility protein homolog. GN BRCA1. OS Musmusculus (Mouse). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. OX NCBI_TaxID=10090; RN [1] RP SEQUENCE FROM N.A. RC STRAIN=C57BL/6; TISSUE=Embryo; RX MEDLINE=96177659; PubMed=8634697; RA Abel K.J., Xy J., Yin G.Y., Lyons R.H., Meisler M.H., Weber B.L.; RT "Mouse Brca1: localization sequence analysis and identification of RT evolutionarily conserved domains."; RL Hum. Mol. Genet. 4:2265-2273(1995)… Used by FASTA tools 􀂋 Comment line followed by sequence data 0 No annotation, just sequence 􀂋 Example >gi|1040960|gb|U35641.1|MMU35641 Musmusculus Brca1 mRNA… GGCACGAGGATCCAGCACCTCTCTTGGGGCTTCTCCGTCCTCGGCGCTTGGAAGTAC GATCTTTTTTCTCGGAGAAAAGTTCACTGGAACTGGAAGAAATGGATTTATCTGCC GTCCAAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAAAATCTTAGAGTGT CCGATCTGTTTGGAACTGATCAAAGAACCTGTTTCCACAAAGTGTGACCACATATTT TGCAAATTTTGTATGCTGAAACTTCTTAACCAGAAGAAAGGGCCTTCACAATGTCCT TTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAGGGAAGCACAAGGTTTAGTCAG Flat file format used by GenBank 0 Annotation, author, version, etc… 􀂋 Example (just the top) LOCUS MMU35641 5538 bp mRNA linear ROD 18-OCT-1996 DEFINITION Musmusculus Brca1 mRNA, complete cds. ACCESSION U35641 VERSION U35641.1 GI:1040960 KEYWORDS . SOURCE house mouse strain=C57Bl/6. ORGANISM Musmusculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. REFERENCE 1 (bases 1 to 5538) AUTHORS Sharan,S.K., Wims,M. and Bradley,A. TITLE Murine Brca1: sequence and significance for human missense mutations JOURNAL Hum. Mol. Genet. 4 (12), 2275-2278 (1995) MEDLINE 96177660 PUBMED 8634698

  15. Bioinformatic Database Formats International standard 0 Semi-structured format 0 Base format for NCBI data 􀂋 Example Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Musmusculus Brca1 mRNA, and translated products" , source { org { taxname "Musmusculus" , db { { db "taxon" , tag id 10090 } } , orgname { name binomial { genus "Mus" , species "musculus" } , … eXtensibleMarkup Language 0 Open standard for semi-structured data, uses tags like HTML 0 Document split into content (XML), style (XSL), linking (XLL) 􀂋 Example <?xml version="1.0"?> <!DOCTYPE GBSeq PUBLIC "-//NCBI//NCBI GBSeq/EN" “http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd"> <GBSet> <GBSeq> <GBSeq_locus>MMU35641</GBSeq_locus> <GBSeq_length>5538</GBSeq_length> <GBSeq_strandedness value="not-set">0</GBSeq_strandedness> <GBSeq_moltype value="mrna">5</GBSeq_moltype> <GBSeq_topology value="linear">1</GBSeq_topology> <GBSeq_division>ROD</GBSeq_division> <GBSeq_update-date>18-OCT-1996</GBSeq_update-date> <GBSeq_create-date>25-OCT-1995</GBSeq_create-date> <GBSeq_definition>Musmusculus Brca1 mRNA, complete cds</GBSeq_definition> <GBSeq_primary-accession>U35641</GBSeq_primary-accession> <GBSeq_accession-version>U35641.1</GBSeq_accession-version> Format conversion 0 Frequently tools handle only one of the data formats 0 Use software to transform between formats 􀁺 ReadSeq, SeqIO 􀂋 Perl (Practical Extraction and Report Language) 0 Portable C-like interpreted scripting language 0 Powerful pattern matching, string processing operations 0 Frequently used to extract / process bioinformatic data 􀂋 BioPerl 0 Collection of Perl classes designed for bioinformatic tools 0 Sequence analysis, alignment, format conversion, I/O, automate bioinformatic analyses, parse results, create GUIs, manage persistent storage in RDMBS…

  16. Thank You Contact : gpgcm_bc ( Yahoo Group )

More Related