480 likes | 621 Views
BioMart. Federated Database Architecture. Arek Kasprzyk EBI 9 June 2005. BioMart. A join project European Bioinformatics Institute (EBI) Cold Spring Harbor Laboratory (CSHL) Aim To develop a simple and scalable data management system capable of integrating distributed data sources.
E N D
BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005
BioMart • A join project • European Bioinformatics Institute (EBI) • Cold Spring Harbor Laboratory (CSHL) • Aim • To develop a simple and scalable data management system capable of integrating distributed data sources.
Challenges • Data sources • Large • Distributed • Different data
Requirements • User • All data accessible through a single set of interaces • Suitable for power biologists and bioinformaticians • Deployer • ‘Out of the box’ installation • Built in query optimization • Easy data federation • Architecture • Distributed • Domain agnostic • Platform independent
Federated architecture Query Engine
BioMart User interfaces Data mart Data sources
Dataset Data mart and dataset
Schema Data mart, dataset and schema
XML XML XML Dataset Configuration
BioMart abstractions • Dataset • A subset of data organized into 1 or more tables • Attribute • A single data point • e. g. gene name • Filter • An operation on an attribute • e. g. ‘Chromosome =1’
Mart Dataset Attribute Filter GENE gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description Datasets, Attributes and Filters
Examples Upstream sequences for all kinases up-regulated in brain and associated with a QTL for a neurological disorder Name, chromosome position, description of all genes located on chromosome 1, expressed in lung, associated with human homologues and non-synonymous snp changes
PK PK Data model FK FK FK FK
PK PK FK FK FK FK FK FK PK PK PK FK FK Data model
PK Data model FK FK FK FK PK FK FK FK FK
PK1 Data model - ‘reversed star’ FK1 FK1 main1 dm dm FK1 FK2 PK1 FK1 FK2 FK2 PK2 FK1 FK2 dm 2 FK2 PK2 PK1 FK2
A C TA TB B DatasetFixed schema transformation
BioMart abstractions • Link • ‘common currency’ between two datasets • e. g. accession • Exportable • Potential links to export • Importable • Potential links to import
Dataset 1 Links Dataset 2 Exportables, Importables and Links
Links Importable Exportable name = uniprot_id filters = uniprot_ac name = uniprot_id attributes = uniprot_ac Dataset 2 Dataset 1 Exportables, Importables and Links
Links Importable Exportable name=genomic_region filters=chr_name (=), chr_start (>=), chr_end (<=) name=genomic_region attributes=chr_name, chr_start, chr_end Dataset 1 Dataset 2 Exportables, Importables and Links
Configuration Transformation Source databases Mart XML Building BioMart databases MartBuilder MartEditor
Table naming conventionNaïve configuration • Tables • Meta tables meta_content • Data tables dataset__content__type • Data tables • Main __main • Dimension __dm • Columns • Key _key
Retrieval MartExplorer MartShell MartView JAVA Perl BioMart API Databases Public data (local or remote) MartBuilder MartEditor myDatabase Vega SNP myMart MSD UniProt Ensembl Schema transformation Configuration XML BioMart architecture
Using = dataset Get = attribute Where = filter MartShell
Mart Query Language (MQL) • Mart Query Language (MQL) syntax: using <dataset> get <attributes> where <filters> • Can join datasets together: • using Dataset1 get Attribute1 where Filter1=var1 as q; • using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q • Can script and pipe: martshell.sh -E MQLscript.mql > results.txt martshell.sh -E MQLscript.mql | wc
Third party software • Bioconductor (biomaRt) • BioMart schema • Taverna • BioMart java library • DAS ProServer • BioMart perl library
ProServer • No programming • DAS request and responses defined by Exportables and Importables and configured by MartEditor • DAS1
BioMart deployers • Large scale data federation (EBI) • Optimising access to a large database (Ensembl, WormBase) • Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
SANGER EBI Ensembl Uniprot SNP MSD Vega Sequence Hinxton example WWW
BioMart deployers • Large scale data federation (Hinxton) • Optimising access to a large database (Ensembl, WormBase, ArrayExpress) • Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
BioMart deployers • Large scale data federation (Hinxton) • Optimising access to a large database (Ensembl, WormBase) • Federating user data with public data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)
GMIA_SNP_mart_database SNP1 T/A AL13929 963253 1 SNP2 C/T AL13929 963255 -1 SNP3 C/G AL13929 963258 1 . ………………………………. . ………………………………. dbsnp HapMap Ensembl AceView Vega RefSeq Give me genoype and frequency data from HapMap Give me SNPs location on gene/transcript Give me frequency data from dbsnp Give me frequency, genotype, location on gene/transcriptfrom dbsnp,HapMap,Ensembl,RefSeq, AceViewandVegas Java graphical user interface WWW web browser Genetics of Infectious and Autoimmune Diseases, Pasteur Institute, INSERM U730, Paris, France.
BioMart model • Already applied • Ensembl • Vega • SNP • Uniprot • MSD • ArrayExpress • WormBase • Variety of ‘in house’ projects • In development • HapMap
Summary • BioMart interface • Batch queries • ‘Data mining’ • Large annotation • BioMart software • Set up your own database • Make your database scalable and responsive • Federate with other data
Where are we? • 0.2 released in february • 0.3 to be released in june • Platforms • Mysql • Oracle • Postgres
Acknowledgments • BioMart • Damian Smedley (EBI) • Darin London (EBI) • Will Spooner (CSHL) • Contributors • Arne Stabenau (Ensembl) • Andreas Kahari (Ensembl) • Craig Melsopp (Ensembl) • Katerina Tzouvara (Uniprot) • Paul Donlon (Unilever)