310 likes | 464 Views
Databases for Bioinformatics A very partial introduction. Paul Greenfield CSIRO Mathematics, Informatics and Statistics January 2013. Databases. A heavily overloaded term
E N D
Databases for BioinformaticsA very partial introduction Paul GreenfieldCSIRO Mathematics, Informatics and Statistics January 2013
Databases • A heavily overloaded term • Used here to mean a collection of (possibly) structured data that can be accessed and queried (and the related technology) • Structured databases • Data is defined using a schema • Strongly defined data structures, with some defined semantics • Relational databases (accessed through SQL) are structured • Unstructured databases • Collections of data that can be created, read, updated or deleted • No (or little) definition of the data itself • And many other variants and mixtures of the above • Key-value stores are popular... • As are document-centric databases...
Databases and bioinformatics • Various types of databases commonly used • Repositories of structured reference data • Back-end to Web forms for downloading selected data • Publications, reference sequences, proteins, ... • Metadata and results from experiments • Needs to keep the results organised somehow • (spreadsheets are possibly more common though) • Non-computing use of ‘database’ as well • A collection of data rather than a technology • Often using databases to structure and provide access to a larger repository • Is PubMed a ‘database’ or a ‘database’? • Is the SRA (Short Read Archive) a ‘database’?
Metabase example • A database of biological databases • Wide variety of different types of data on different organisms • Holding the core data for a research community • Basis for collaboration • EBI, NCBI, ... • Typical usage model is search then download • APIs for searching • ftp downloads
Relational databases and SQL • Mainstream database technology • Oracle, mySQL, SQL/Server, DB2, ... • Widely used in commerce and science • Data entities are modelled using tables (relations) • Tables are sets of identically-structured records (tuples) • Defined relationships between tables • Lots of theory about how to design relational databases (‘normal forms’) • Structured Query Language • A standard, non-procedural, set-based, declarative query language • Say what results you want, and let the database work out how to get them
A bacterial database • Goal: a queriable bacterial database • Genes, sequences, functions, ... for all available bacteria • Answering some questions about bacteria without writing code... • Finished and draft annotated bacterial genomes from NCBI • ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ • ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/ • Draft genomes are sets of contigs (from an assembly) • One sequence file containing a set of contigs/scaffolds • Finished genomes • One sequence file per genome, plasmid, ... • One directory for each species/strain • DNA sequences, amino acid sequences, annotations • In a variety of formats
Bacterial database design Species Sequences Genes • A few thousand directories of bacterial sequences & annotations... • How do you answer simple questions? • Loading this raw data into a database could make it more useful • What do we want to model as tables? • Organisms (directories) • Genomes, plasmids (for finished organisms), scaffolds/contigs (for drafts) • Genes (from annotations) – sequence + annotations • Other pre-computed metrics (relatedness??) • Each table needs a unique key • Used to connect related entities (foreign keys) • Can be ‘compound’ (speciesNo, sequenceNo)
Species CREATE TABLE Species ( SpeciesNoint NOT NULL AUTO_INCREMENT, -- unique id pecies SpeciesIDvarchar(100) NOT NULL, -- species/strain name SpeciesUidint NOT NULL, -- uid from NCBI directory Finished int NOT NULL -- finished = 1, draft = 0 ) ENGINE=MYISAM AVG_ROW_LENGTH=200 MAX_ROWS=10000; ALTER TABLE Species ADD PRIMARY KEY (SpeciesNo), ADD INDEX BySpeciesID (SpeciesID);
Sequences CREATE TABLE Sequences ( SpeciesNoint NOT NULL, -- (SpeciesNo, SequenceNo) is compound key SequenceNoint NOT NULL, -- ...for this sequence/genome/plasmid/... SequenceIDvarchar(20) NOT NULL, -- name of this sequence (e.g. NCnnnnnnn) SequenceDescvarchar(100) NOT NULL, -- description of sequence SequenceLengthint NOT NULL -- length of this sequence string ) ENGINE=MYISAM AVG_ROW_LENGTH=200 MAX_ROWS=1000000; ALTER TABLE Sequences ADD PRIMARY KEY (SpeciesNo, SequenceNo), ADD INDEX BySequenceID (SequenceID);
Genes CREATE TABLE Genes ( SpeciesNoint NOT NULL, -- |(SpeciesNo, SequenceNo, GeneNo) is key SequenceNoint NOT NULL, -- |... for this gene record GeneNoint NOT NULL, -- | GenePIDvarchar(20) NOT NULL, -- PID name for gene (or ‘non coding’) GeneNamevarchar(20) NOT NULL, -- short-form gene name GeneSynonymvarchar(20) NOT NULL, -- GeneCodevarchar(20) NOT NULL, -- GeneCOGvarchar(20) NOT NULL, -- COG code for protein product GeneProductvarchar(200) NOT NULL, -- description of protein or RNA GeneRNA char (3) NOT NULL, -- 'RNA' for RNA genes GeneGCint NOT NULL, -- GC content of this gene GeneStrand char (1) NOT NULL, -- '+'=forward; '-'=reverse GeneStartint NOT NULL, -- chromosome-relative location of 'first' base GeneEndint NOT NULL, -- chromosome-relative location of 'last' base GeneLengthint NOT NULL -- total length of gene GeneKeyint NOT NULL, -- key used to get to gene sequence (1-1) ) ENGINE=MYISAM AVG_ROW_LENGTH=200 MAX_ROWS=100000000; ALTER TABLE Genes ADD PRIMARY KEY (SpeciesNo, SequenceNo, GeneNo), ADD INDEX ByGenePID (GenePID), ADD INDEX ByGeneLocation (SpeciesNo,SequenceNo,GeneStart,GeneEnd,GeneNo,GeneRNA)
Gene Sequence CREATE TABLE GenesSeqs ( GeneKeyint NOT NULL, -- unique key for gene sequence (1-1) Gene longtext NOT NULL -- gene itself ) ENGINE=MYISAM AVG_ROW_LENGTH=200 MAX_ROWS=100000000; ALTER TABLE GeneSeqs ADD PRIMARY KEY (GeneKey) • Actual sequence for each gene is useful sometimes • But often just an overhead resulting in worse query performance • Current DB design splits out the gene sequence from the gene metadata and annotations • Reduces the size of the heavily-used Gene records
SQL Queries • A query is an operation over tables that returns a table • Tables can be thought of as sets if that helps • Based on relational algebra originally SELECT <result columns> FROM <tables>WHERE <restriction predicate>ORDER BY.... GROUP BY.... • Query returns a table with the columns specified by <result columns> • <tables> specifies the tables that are joined for the query • <restriction predicate> is a Boolean expression over table columns that define the conditions that must be met for a query row to be included in the results • Optional ORDER BY clause defines the order of the result table • Optional GROUP BY clause is used for aggregation • And all this is recursive...
Some simple sample queries select * from Species order by SpeciesID • Returns all columns from all rows in the Species table, in SpeciesID order select * from Species whereSpeciesIDlike ‘Bacillus thuringiensis%’ • Returns all columns for all the Bt organisms in the database select * fromSpecies, Sequences whereSpeciesIDlike 'Bacillus thuringiensis%' and Finished=1 andSpecies.SpeciesNo=Sequences.SpeciesNo • Joins Species and Sequences rows (Cartesian product in theory) on SpeciesNo • Only returns rows for finished Bt organisms • All columns from Species and Sequences returned
Selecting columns selectsp.SpeciesNo, sp.SpeciesID, sq.SequenceNo, sq.SequenceID, sq.SequenceLength, sq.SequenceDescfromSpecies sp, Sequences sq wheresp.SpeciesIDlike 'Bacillus thuringiensis%' andsp.Finished=1 andsp.SpeciesNo=sq.SpeciesNo • sp and sq are aliases for tables • Get all the sequences for all the Bt organisms selectsp.SpeciesNo, sp.SpeciesID, sq.SequenceLengthas 'Genome length'fromSpecies sp, Sequences sq wheresp.SpeciesIDlike 'Bacillus thuringiensis%' andsp.Finished=1 andsp.SpeciesNo=sq.SpeciesNoandsq.SequenceDesclike '% chromosome' • And just the chromosome lengths (if they’re annotated properly)
select 'Anthrax', COUNT(*), AVG(sq.SequenceLength) as 'Average length'fromSpecies sp, Sequences sq wheresp.SpeciesIDlike 'Bacillus anthracis%' andsp.Finished=1 andsp.SpeciesNo=sq.SpeciesNoandsq.SequenceDesclike '% chromosome' How many Anthrax strains are in the database and what is their average genome length selectsp.SpeciesID, COUNT(*) as'sequences'fromSpecies sp, Sequences sq wheresp.SpeciesIDlike 'Bacillus anthracis%' andsp.Finished=1 andsp.SpeciesNo=sq.SpeciesNogroup bysp.SpeciesID How many sequences do we have for each of the Anthrax strains? Aggregations (counts, averages, ....)
More on joins • A complex topic (inner, outer, natural, left, right) • We’ll just be linking rows together using their keys • Either explicitly or implicitly • Differences come from handling no-match cases... selectsq.SequenceNo, g.GeneStart, g.GeneEnd, g.GeneLength, g.GeneProduct, g.GeneCOG, g.GeneGCfromSpecies sp, Sequences sq, Genes g wheresp.SpeciesNo=sq.SpeciesNoandg.SpeciesNo=sp.SpeciesNoandg.SequenceNo=sq.SequenceNoandsp.SpeciesID='Bacillus anthracisAmes'orderbyg.SequenceNo, g.GeneStart • Get the genes in the 'Bacillus anthracisAmes’ strain • Could rewrite using explicit joins (see example)
More joins selectdistincts.SpeciesNo, s.SpeciesID, g.GeneProduct, gs.GenefromSpecies s joinGenes g ong.SpeciesNo = s.SpeciesNojoinGeneSeqsgsong.GeneKey = gs.GeneKeywhereg.GeneRNA='RNA' and(g.GeneProductlike '%16S%' org.GeneProduct= 'small subunit ribosomal RNA' org.GeneProduct= 'Ribosomal RNA small subunit') orderbys.SpeciesID Find all easily identifiable 16S genes from all organisms Explicit ‘join’ syntax being used ‘distinct’ says don’t return duplicate rows
Joins across databases select sp.SpeciesNo, sp.SpeciesID, sq.SequenceLength,sq.SequenceDescfrom Species sp join RDPBacterialTaxonomy.dbo.SpeciesNamessnon sp.SpeciesNo=sn.SpeciesNojoinSequences sq on sp.SpeciesNo=sq.SpeciesNowhere sn.FamilyID='Enterobacteriaceae' and sq.SequenceLength> 1000000 and sp.Finished=1 order by sp.SpeciesID • Fetch the lengths and descriptions of all long sequences for finished organisms in the Enterobacteriaceae family • Not all ‘chromosomes’ are annotated as such... • You could query to find out which ones...
Set operations selectdistinct sp.speciesNo, sp.SpeciesIDfromSpecies sp wheresp.Finished=1 exceptselectspg.speciesNo, spg.SpeciesIDfromSpecies spg, Genes g whereg.SpeciesNo=spg.SpeciesNoandg.GeneRNA='RNA'andspg.Finished=1 and (g.GeneProductlike '%16S%' org.GeneProduct= 'small subunit ribosomal RNA‘ org.GeneProduct= 'Ribosomal RNA small subunit') orderbysp.SpeciesID • What organisms did not have an easily-identifiable 16S sequence? • Union, intersection, ...
Pre-computed similarity metrics • The database you have included three similarity tables • Based on k-mers in common (shared k-mers) • Organism-to-organism (GenomeToGenomeMatches) • Sequence-to-sequence (SequenceToSequenceMatches) • Gene-to-gene (GeneToGeneMatches) • Other such pre-computed similarity metrics are possible • But shared k-mers are fast to compute and useful • Support answering questions about relatedness and conservation
Performance • SQL engines look at your query and the database and decide how to execute it most efficiently • ‘Query optimiser’ • Use available indexes to improve search time • You say what you want to do – not how to do it • Slow queries may be doing linear searches of large tables • (look at bacterial db schema and the defined indexes) • Typical queries take just a few seconds • Acinetobacterbaumannii AB0057 query...
Wrap-up • The bioinformatics world is full of ‘databases’ • Collections of searchable data/references/literature • Often based on some form of database technology • Structured, unstructured, SQL, noSQL, key-value stores, ... • Often hidden behind web pages and scripting • Often accessed through APIs • Use whatever seems to work best • Query interfaces are much rarer • Some Web forms constructing queries on your behalf • Constructing some form of search predicate • Direct SQL queries are powerful but not common...
Prac session • The bacterial database is available on the lab system • mySQL • Run some of the samples in these notes • Other tasks • Find out how many 16S copies there are, on average, in a given family • How would you do this for all families? • Find the toxin region in 'Clostridium botulinum A2 Kyoto‘ • Find out what other organisms share these toxin genes • Find plausible functions for some of the ‘hypothetical’ genes in Methanococcusmaripaludis X1 • Do something else interesting...