460 likes | 1.41k Views
Bioinformatics – Biological databases. Mikael Thollesson Evolutionary Biology Centre and Linnaeus Centre for Bioinformatics, Uppsala University. What is a database?. The data itself (i.e., information) The organisation of the data - Database structure Flat-file databases Mark-up and tags
E N D
Bioinformatics – Biological databases Mikael Thollesson Evolutionary Biology Centre and Linnaeus Centre for Bioinformatics, Uppsala University
What is a database? The data itself (i.e., information) The organisation of the data - Database structure Flat-file databases Mark-up and tags Relational databases – records and fields Object oriented databases Database Management System – DBMS Queries and retrievals Interfaces User interfaces, e.g. web pages or dedicated clients Computer interfaces
Relational databases Consists of tables with homogenous content, where each table contains records (items) and each record has one or several fields (properties)
Relational databases Records in different tables are related by key fields Contents from different tables are brought together using these key values
OS, e.g. Linux WWW-server, e.g. Apache http/https html Client Interface code, e.g. Perl or PHP SQL query SQL query SQL reply DBMS, e.g. mySQL DBMS, e.g. mySQL Server
One view of Bioinformatics Expression databases Metabolic databases Literature databases ? Sequence databases Genome databases Expression patterns Contig assembly Regulatory mechanism BLAST Predictions on DNA Pairwise/Multiple alignment Phylogenetic inference Gene Predictions on proteins Phylogenies Structure databases Phylogenetic databases Function, localisation
Sequence databases Nucleic acid sequence databases Contain primary nucleotide sequence data Repositories, i.e. the content of these databases are not curated Protein sequence databases Contain secondary and primary protein sequence data Some are curated, others are just extracts from other databases Several kinds of interfaces/search engines are available to retrieve data, e.g. SRS (Sequence Retrieval System) and the Entrez browser
Nucleotide sequence repositories Three primary centres, which exchange information on a daily basis EMBL / European Molecular Biology Laboratory DDBJ / DNA Data Bank of Japan GenBank All three adhere to the DDBJ/EMBL/GenBank Feature Table Definition – http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html, i.e. the content of each record is the same for these databases
EMBL – European Molecular Biology Laboratory Europe’s primary nucleotide sequence resource Established 1980 in Heidelberg by EMBL, now maintained by EBI (European Bioinformatics Institute) in Cambridge, UK Main sources of sequences are direct submissions from individual researchers, genome project and patent applications Contains two main parts A release section (embl_rel) that is issued every three months A new section (embl_new) where new sequences are added daily Also split into divisions depending on the origin of the sequence http://www.ebi.ac.uk/embl/Access/index.html Entries has a format that differs from GenBank and DDBJ
- DNA Database of Japan Mainly collects data from Japanese activities (but accepts submissions from any researcher in any country) Began DNA repository activities in 1986 endorsed by the Ministry of Education, Science, Sports, and Culture http://www.ddbj.nig.ac.jp/ Entries has the same format as GenBank
GenBank US primary nucleotide sequence resource Established in 1988 Maintained by National Center for Biotechnology Information (NCBI), Bethesda, MD Contains a release section and a new section as EMBL http://www.ncbi.nlm.nih.gov/ Entries has a format that is different from EMBL
EST databases Expressed Sequence Tags (ESTs) are short sequences from mRNA ESTs are useful to get a handle on expressed genes dbEst Is a division of GenBank containing ESTs from a number of organisms UniGene A non-redundant set of gene-oriented clusters Contains numerous novel ESTs, but also “proper” sequences Presently Homo, Rattus, and Mus has been processed
SWISS-PROT and TREMBL Developed by Swiss Institute of Bioinformatics (SIB) and European Bioinformatics Institute (EBI) PIR-PSD A collaboration between National Biomedical Research Foundation (NBRF), Munich Center for Protein Studies (MIPS) and Japan International Protein Information Database (JIPID) Protein databases
SWISS-PROT (86000 entries June 2000) Is a curated protein sequence database Aims to provide a high level of annotations (e.g., function, domain structure, post-translational modifications) Divided into Swissprot_rel and Swissprot_new Not divided into sections based on species TREMBL (ca 300 000 entries June 2000) Contains translated sequences from the EMBL database Divided into SP-TREMBL with sequences that are candidates for incorporation into SWISS-PROT REM-TREMBL that will not be incorporated into SWISS-PROT Protein databases I
Protein Information Resource - Protein Sequence Database (PIR-PSD) is similar to SWISS-PROT in its aims PIR’s stated goal is “to provide a comprehensive, non-redundant, classified, well-annotated, and freely available, protein sequence database, in which entries are classified into family groups and alignments of each group are available” Also produces a computer generated supplemental database of translations, PATCHX, similar to TrEMBLE with sequences not yet incorporated New entries in batches from genome sequencing projects or from selected GenBank/EMBL entries The PIR database is in constant flux as the level of annotation on entries increases and new entries with minimal annotation are added PIR-PSD database is growing at a higher rate than the SWISS-PROT, but has a lower level of annotation per entry. The PIR-PSD consists of four sections: PIR1. Fully Classified Entries PIR2. Verified and Classified Entries PIR3. Unverified Entries PIR4. Un-encoded or Un-translated Entries Protein databases II
Full-text Electronic journals OMIM PubMed Nucleotide sequences 3D Structures Protein sequences Maps & Genomes Taxonomy Interfaces to public databases Several different databases are usually accessible through the same WWW interface. For example, the databases below are accessible via National Institute of Health/National Centre for Biotechnology Information (NIH/NCBI) (http://www.ncbi.nlm.nih.gov/Database/)
Genome databases • Differs from sequence databases by being more heterogeneous and diverse • A genome database organises all information on an organisms genome, such as • Genetic mapping • Maps how genes are located relative to each other and with a distance measured as percentage recombination • Physical mapping • Ranges from cytogenetic maps (banding patterns of chromosomes) to the positions of clone contigs • Sequence data • Nucleotide sequences are (usually) deposited at the nucleotide sequence repositories even before finishing the genome sequencing • Entries to genome databases are e.g. • Genome Net – http://www.genome.ad.jp/ • NCBI’s genome section – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
Protein structures n-IVTAHAFVMI-c • Primary structure; the order of amino acids • Secondary structure; conformations, mainly alpha helices and beta sheaths • Tertiary structure; the complete three dimensional folding of the polypeptide • Quaternary structure; exists if the protein is composed of two or more polypeptide chains
Structural databases Contain information on the three-dimensional structure of molecules, chiefly proteins Data is primarily based on x-ray crystallography (>80%), NMR, or theoretical models (<2%) Examples of such databases Protein databank (PDB) - http://www.rcsb.org/pdb/ Molecular Modelling Database (MMDB) - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure
All metabolic databases use EC-numbers, which are a combination of four figures that classify the type of reaction the enzyme catalyses Example: EC 1.2.3.4 is a oxido-reductase (1) that act on aldehyde or oxo groups (1.2) with oxygen as acceptor (1.2.3). The last digit, 4, is an ordnial number within the class Pros and cons + EC provides an unique identifier + Enables a synonym dictionary - Many classes of enzymes are not covered in sufficient detail, especially proteases and nucleases with macromolecules as substrate Metabolic databases
Metabolic databases Describe enzymes, reactions, substrates, products, and biochemical reactions Data are specific for different organisms (“type organisms”) as well as general overviews and links to sequence and structure databases Example Kyoto Encyclopedia of Genes and Genomes – http://www.genome.ad.jp/kegg/
Phylogenetic databases Primary (repositories) and secondary (data analysis and interpretation) database Primary databases contain information on the result of phylogenetic analyses (trees, taxonomic names), data, and assumptions on which the analyses are base Secondary databases contain interpretations and assembled phylogenetic hypotheses for all kinds of taxa Examples TreeBase – http://www.herbaria.harvard.edu/treebase/index.html (Primary) Tree of Life – http://phylogeny.arizona.edu/tree/ (Secondary)
Expression databases • Functional genomics • DNA arrays (cDNA probes on a chip) are used to assess the RNA levels of different genes (several hundreds at a time) • Measurements are taken at intervals after some treatment is initialised • Genes are grouped in clusters according to expression profile • Reverse engineering of expression levels of these groups are used to propose regulatory genetic networks • No unified format for DNA-chip data yet, although work is in progress • Example of gene expression databases are • EBI ArrayExpress database – http://www.ebi.ac.uk/arrayexpress/ • KEGG Expression Database – http://www.genome.ad.jp/kegg/expression/