500 likes | 524 Views
Developing a Guiding System of Biological Database. CSE591 Fall 2003 (AP of AI CMB) By Hiro Takahashi. Motivation. Too many Bio-Information databases are available on the Internet (genes, proteins, enzymes, nomenclatures, taxonomy etc. etc.).
E N D
Developing a Guiding Systemof Biological Database. CSE591 Fall 2003 (AP of AI CMB) By Hiro Takahashi.
Motivation • Too many Bio-Information databases are available on the Internet (genes, proteins, enzymes, nomenclatures, taxonomy etc. etc.). • You have to know where you can find the information you need. • You have to know how to search with individual databases and how to read its results (and how to perform a farther search based on the result).
Database is growing everyday Everyone has a difficult time keeping up with the flow of new information. This is articularly true in biology now as the pace of discovery accelerates. Databases have become an essential tool for accumulating and archiving raw data. They also play a major role in analyzing and presenting information to researchers and the public in an easily accessible form. By Paul G. Young / NCBI
Goal • To create a more powerful query mechanism and a common interface to get the best answer using existing multiple databases.
Famous bioinformatics databases • NCBI: National Center for Biotechnology Information. - LITERATURE DATABASES PubMed, PubMed Central, Bookshlf, OMIM and PROW. - ENTREZ DATABASES Protain sequence database, Nucleotide sequence database, Genomes, Structure, Taxonomy, Population study data sets, Books, ProbeSet, 3D Domain,s UniSTS, SNP, CDD, Journals and UniGene. - NUCLEOTIDE DATABASES GenBank, EST database, GSS database, HomoloGene, HTG database, SNPs database, RefSeq and STS database
NCBI cont. - GENOME-SPECIFIC RESOURCES Bacteria, Eukaryotic Organelles, Fruit fly, Human, Malaria, Mouse, Nematode, Plant Genomes, Plasmids, Rat, Retroviruses, Viroids, Yeast, Zebrafish - TOOLS for DATA MINING Entres, LinkOut, Cubby, Citation Matcher – text term. BLAST, Blink – Sequence Similarity. Taxonomy Browser, TaxTable, ProtTable TaxPlot - Taxonomy. - TOOLS for Sequence Analysis COGs, COGnitor, GEO, HomoloGene, CDD, LocusLink, MGC, Clone Registry, Trace Archive, ORF Finder, VecScreen, e-PCR
NCBI cont. - Tools for 3D Structure Display and Similarity Searching CD-Search, Cn3D, Domain Architecture Retrieval Tool, VAST Search, Threading - MAPS Map Viewer, Arabidopsis Map, Fruit Fly Map, GeneMap ’99, Human Map, Human-Mouse Homology Maps, Malaria Map, Model Maker, Mosquito Map, Mouse Map, Nematode Map, OMIM Gene Map, OMIM Morbid Map, Rat Map, Zebrafish Map. - COLLABORATIVE CANCER RESEARCH - FTP Download - Statistics Recommended Tutorial of NCBI http://bcs.whfreeman.com/mga2e/bioinformatics/ch01/bridging_page.htm
Famous Databases cont. • EMBL-EBI European Bioinformatics Institue http://www.ebi.ac.uk • GenomeNet Bioinformatics Center / Institute for Chemical Research Kyoto University http://www.genome.ad.jp/
List of databases • Database for Metabolic Pathways • Database for Enzymes, Compounds and Reactions • Database for Regulatory Pathways • Database for Protein-Protein Interactions • Database for Transcription Factors • Database for Gene Expression Pattern • Database for Nomenclature (General) • Database for Nomenclature (Organism - specific) • Database for Nomenclature (Protein – specific) • Database for Taxonomy • Database for Complete Genomes and Analysis
Database for Metabolic Pathways KEGG Metabolic Pathways: http://www.genome.ad.jp/kegg/metabolism.html EMP - Enzymes and Metabolic Pathways: http://emp.mcs.anl.gov/ WIT - Metabolic Reconstruction: http://wit.mcs.anl.gov/WIT2/ UM-BBD - Microbial Biocatalysis/Biodegradatation: http://umbbd.ahc.umn.edu/ EcoCyc - E. coli Genes and Metabolism: http://www.ecocyc.org/ Metalgen - Genes and Metabolism: http://indigo.genetique.uvsq.fr/ Boehringer Mannheim - Biochemical Pathways: http://www.expasy.org/cgi-bin/search-biochem-index IUBMB-Nicholson Minimaps: http://www.tcd.ie/Biochemistry/IUBMB-Nicholson/
Database for Enzymes, Compounds and Reactions LIGAND - Biochemical Compounds and Reactions: http://www.genome.ad.jp/ligand/ ENZYME – Enzymes: http://www.expasy.ch/enzyme/ BRENDA - Comprehensive Enzyme Information System: http://www.brenda.uni-koeln.de/ Worthington Enzyme Manual: http://www.worthington-biochem.com/index/manual.html Klotho - Biochemical Compounds: http://www.biocheminfo.org/klotho/ ChemFinder - Searching Chemicals: http://chemfinder.camsoft.com/ ChemIDplus at NLM: http://chem.sis.nlm.nih.gov/chemidplus/ PROMISE - Prosthetic Groups and Metal Ions: http://metallo.scripps.edu/PROMISE/ GlycoSuiteDB - Glycan Structure Database: http://www.glycosuite.com/ CarbBank - Complex Carbohydrate Structure Database: http://bssv01.lancs.ac.uk/gig/pages/gag/carbbank.htm LIPIDBANK for Web – Lipids: http://lipidbank.jp/ WebElements - Periodic Table: http://www.webelements.com/
Database for Regulatory Pathways KEGG Regulatory Pathways: http://www.genome.ad.jp/kegg/regulation.html SPAD - Signal Transduction: http://www.grt.kyushu-u.ac.jp/spad/ Yeast Pathways in MIPS: http://www.mips.biochem.mpg.de/proj/yeast/pathways/index.html Interactive Fly - Drosophila Genes: http://sdb.bio.purdue.edu/fly/aimain/1aahome.htm GeNet - Gene Networks Database: http://www.csa.ru/Inst/gorb_dep/inbios/genet/genet.htm HOX-Pro - Homeobox Genes Database: http://www.iephb.nw.ru/labs/lab38/spirov/hox_pro/hox-pro00.html Wnt Signaling Pathway: http://www.stanford.edu/~rnusse/wntwindow.html TRANSPATH - Gene Regulatory Pathways: http://transpath.gbf.de/
Database for Protein-Protein Interactions BRITE Database for Biomolecular Relations: http://www.genome.ad.jp/brite/ DIP - Database of Interacting Proteins: http://dip.doe-mbi.ucla.edu/ BIND - Biomolecular Interaction Network Database: http://www.binddb.org/ Database for Transcription Factors TRANSFAC - Transcription Factor Database: http://transfac.gbf.de/TRANSFAC/index.html RegulonDB - E. coli Transcriptional Regulation: http://www.cifn.unam.mx/Computational_Genomics/regulondb/ DBTBS - B. subtilis Transcription Factors: http://elmo.ims.u-tokyo.ac.jp/dbtbs/ SCPD - S. cerevisiae Promoter Database: http://cgsigma.cshl.org/jian/ DPInteract - DNA binding proteins: http://arep.med.harvard.edu/dpinteract/
Database for Gene Expression Pattern Axeldb - Xenopus laevis: http://www.dkfz-heidelberg.de/abt0135/axeldb.htm NEXTDB - Caenorhabditis elegans: http://nematode.lab.nig.ac.jp/ MAGEST - Halocynthia roretzi: http://www.genome.ad.jp/magest/ IUBMB Nomenclature: http://www.chem.qmul.ac.uk/iubmb/ IUPAC Nomenclature: http://www.chem.qmul.ac.uk/iupac/ IUPHAR Receptor Nomenclature: http://www.iuphar-db.org/iuphar-rd/ SWISS-PROT Documents: http://www.expasy.ch/sprot/sp-docu.html Gene Ontology: http://www.geneontology.org/ Database for Nomenclature (General)
Database for Nomenclature (Organism - specific) Human (HUGO): http://www.gene.ucl.ac.uk/nomenclature/ Mouse (MGD): http://www.informatics.jax.org/mgihome/nomen/ Rat (RATMAP): http://rgnc.gen.gu.se/RGNChem.html D. melanogaster (FlyBase): http://flybase.bio.indiana.edu/docs/nomenclature/lk/nomenclature.html C. elegans: http://elegans.swmed.edu/Genome/Nomencl2001w.htm Plants (Mendel): http://www.mendel.ac.uk/ S. cerevisiae (SGD): http://genome-www.stanford.edu/Saccharomyces/registry.html
Database for Nomenclature (Protein – specific) Alcohol dehydrogenase: http://www.gene.ucl.ac.uk/nomenclature/genefamily/ADH-2.shtml Protein kinases (PKC): http://pkr.sdsc.edu/html/index.shtml Phosphodiesterases: http://depts.washington.edu/pde/Nomenclature.html Glycosyl hydrolases (CAZy) / (ExPASy): http://afmb.cnrs-mrs.fr/CAZY/ Aminoacyl-tRNA synthetases (AARSDB): http://rose.man.poznan.pl/aars/index.html Cytochrome P450: http://drnelson.utmem.edu/CytochromeP450.html Metallothionein / (ExPASy): http://www.unizh.ch/~mtpage/classif.html CD Molecules (PROW) / (ExPASy): http://www.ncbi.nlm.nih.gov/PROW/ Immunoglobulins and T-cell receptors (IMGT): http://imgt.cines.fr/ Cytokines (dbCFC): http://cytokine.medic.kumamoto-u.ac.jp/
Database for Nomenclature (Protein – specific cont.) Transport proteins: http://www-biology.ucsd.edu/~msaier/transport/ G protein coupled receptors (GPCRDB): http://www.gpcr.org/7tm/ Olfactory receptors (ORDB): http://senselab.med.yale.edu/senselab/ORDB/ Eph and Eph receptors: http://cbweb.med.harvard.edu/eph-nomenclature/ Nuclear receptors: http://www.ens-lyon.fr/LBMC/laudet/nomenc.html Nuclear receptors (NRR): http://nrr.georgetown.edu/NRR/NRR.html Mitochondrial proteins: http://mips.gsf.de/proj/medgen/mitop/ Ribosomal proteins (ExPASy): http://www.expasy.ch/cgi-bin/lists?ribosomp.txt Homeobox proteins / (ExPASy): http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8098522&dopt=Abstract
Database for Taxonomy NCBI Taxonomy: http://www.ncbi.nlm.nih.gov/Taxonomy/ Tree of Life: http://tolweb.org/tree/phylogeny.html UCMP Phylogeny Exhibit: http://www.ucmp.berkeley.edu/exhibit/phylogeny.html Ribosomal Database Project II: http://rdp.cme.msu.edu/html/
Database for Complete Genomes and Analysis NCBI Complete Microbial Genomes: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html TIGR Comprehensive Microbial Resource: http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl GIB - Genome Information Broker: http://gib.genes.nig.ac.jp/ PIR Complete Genomes: http://www-nbrf.georgetown.edu/pir/genome.html PEDANT: http://pedant.mips.biochem.mpg.de/ GOLD: http://ergo.integratedgenomics.com/GOLD/ MBGD - Comparative Microbial Genome Database: http://mbgd.genome.ad.jp/ COG - Clusters of Orthologous Groups: http://www.ncbi.nlm.nih.gov/COG/
Example Queries that requires multiple databases search • Homology Search • Prediction of protein secondary structure.
Homology Search Example • A Gene called BCL2 (related with apoptosis process). • Using its DNA sequence, find out similar gene and get their evolutional and functional relationship.
Process of Homology search • Find out Amino Acid sequence of BCL2. - Goto EMBL-EBI/SwissProt and search the sequence of BCL2. - Select BCL2/Human Gene from the result. - Copy the Sequence to clipboard • Use FASTA or BLAST to find similar genes using sequence match. - Go to a site that provides FASTA/BLAST and paste the above result. - Perform search and to through the result.
Review: FASTA and BLAST • Both are sequence match algorithms. FASTA: Focus on global matching view. Less sensitivity on gaps. Can be used with very short sequence. BLAST: More focus on local matching view. More sensitivity on gaps. Can’t use for very short sequence.
Prediction of protein secondary structure • Pick two proteins (one water-soluble protein and one membrane protein) and compare predicted and actual secondary structure.
Process of the query • Pick two proteins using PDB (http://www.rcsb.org/pdb/index.html) • Use nnpredict (http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html) to get prediction of water-soluble protein’s secondary structure. • Use SOSUI (http://sosui.proteome.bio.tuat.ac.jp) to get prediction of membrane protein’s secondary structure. • Use PDBsum (http://www.biochem.ucl.ac.uk/bsm/pdbsum/) to get actual secondary structures of each and compare the result.
Result of nnpredict Tertiary structure class: none Sequence:MSQSNRELVVDFLSYKLSQKGYSWSQFSDVEENRTEAPEETEPERETPSAINGNPSWHLADSPAVNGATGHSSSLDAREVIPMAAVKQALREAGDEFELRYRRAFSDLTSQLHITPGTAYQSFEQVVNELFRDGVNWGRIVAFFSFGGALCVESVDKEMQVLVSRIASWMATYLNDHLEPWIQENGGWDTFVDLYGNNAAAESRKGQERFNRWFLTGMTVAGVVLLGSLFSRK Secondary structure prediction (H = helix, E = strand, - = no prediction):------HEEEHHHHHH------EE------------------------------------------------------HH--HHHHHHHHHHH--HHHHHHHH---HHH--EEE--------HHHHHHHHHHH-----EEEEEE-----EEE----HHHHHHHHHHHHHHHHH--------H-------EEEEH-----HHHHH---HHHHHHHHH---HEEEEEE--H---- Result of PDBsum
One Existing Multiple Database Search Project DBGET - Integrated database retrieval system (A part of GenomeNet – from Kyoto University) http://www.genome.ad.jp/dbget/
About DBGET • DBGET is a simple database retrieval system for a diverse range of molecular biology databases. • Most of the existing molecular biology databases can be treated in this simplified manner, or as so-called flat-file databases. • Because each entry of a database is given a unique identifier, i.e., an entry name or an accession number, the molecular biology databases in the world can be retrieved uniformly by specifying the combination of the database name and the identifier. dbname:identifier • The KEGG gene catalogs are also considered as flat-file databases where the combination of the organism name and the gene name: organism:gene is used for identification.
LinkDB • LinkDB is a database of links, each of which is represented as a binary relation in the form of: dbname1:identifier1 --> dbname2:identifier2 • LinkDB contains all cross-reference links, called original links, extracted from all the databases in DBGET. Furthermore, LinkDB dynamically generates additional links by computation, i.e., by combining multiple links and/or using links in reverse directions. Thus, LinkDB is a deductive database and the links in LinkDB are of the following three types: original links represented by: --> reverse links represented by: <-- indirect links consisting of multiple links
DBGET access • DBGET has three basic commands (or three basic modes in the Web version), bfind, bget, and blink. bget : performs the retrieval of database entries specified by the combination of dbname:identifier. bfind : used for searching entries by keywords. blink : LinkDB search, can be used to retrieve related entries in a given database or all databases in GenomeNet.
How to get information using bfind/bget Command Version bfind [option] dbname expression bget [option] dbname identifier [identifier1...] bget [option] dbname1:identifier1 [dbname2:identifier2...] URL version Retrieve a single entry: http://www.genome.ad.jp/dbget-bin/www_bget?dbname+identifierhttp://www.genome.ad.jp/dbget-bin/www_bget?dbname:identifier dbname = Database name or organism nameidentifier = Entry name (accession number) or gene name Retrieve multiple entries: http://www.genome.ad.jp/dbgetbin/www_bget?dbname+identifier1+identifier2+...http://www.genome.ad.jp/dbgetbin/www_bget?dbname1:identifier1+dbname2:identifier2+... The first form is applicable only to multiple entries from a single database.
Cont… Retrieve sequence entries in FASTA format: http://www.genome.ad.jp/dbget-bin/www_bget?-f+dbname+identifier1+identifier2+...http://www.genome.ad.jp/dbget-bin/www_bget?-f+dbname1:identifier1+dbname2:identifier2+.. When the entry contains multiple sequences, specify as follows:-f+-n+1 first sequence in FASTA format-f+-n+2 second sequence in FASTA format-f+-n+a amino acid sequence in FASTA format (GENES database only)-f+-n+n nucleotide sequence in FASTA format (GENES database only) Display title description of entries http://www.genome.ad.jp/dbget-bin/www_btit?dbname+identifier1+identifier2+...http://www.genome.ad.jp/dbget-bin/www_btit?dbname1:identifier1+dbname2:identifier2+... Mark an object in the KEGG pathway http://www.genome.ad.jp/dbget-bin/show_pathway?mapno+dbname:identifier mapno = pathway entry accession number, such as map00010 and hsa00010dbname:identifier = kegg identifier such as ec:5.3.1.1
Conclusion • There are so many bio-databases around the world. • DBGET provides a common interface to access biological database in the world. • We can use this DBGET access method to perform extensive search online. • Since it provides a common interface, we can write an intelligent search procedure over DBGET using our own knowledge base. • Implement an easy/inteligent access method on top DBGET/LinkDB – I will call it “BioWizard.”
Reference • Akiyama, Y., Goto, S., Uchiyama, I., and Kanehisa, M.; WebDBGET: an integrated database retrieval system which provides hyper-links among related entries. MIMBD'95: Second Meeting on the Interconnection of Molecular Biology Databases (1995). [pdf] [ps] • Goto, S., Akiyama, Y., and Kanehisa, M.; LinkDB: a database of cross links between molecular biology databases. MIMBD'95: Second Meeting on the Interconnection of Molecular Biology Databases (1995). [pdf] [ps] • GenomeNet http://www.genome.ad.jp/ • NCBI http://www.ncbi.nih.gov/RefSeq/ • EBI http://www.ebi.ac.uk/embl/index.html