230 likes | 324 Views
Data Mining with BioMart. Simple and … Complex Queries. Genes within a candidate region Gene products with a particular protein domain …
E N D
Simple and … Complex Queries • Genes within a candidate region • Gene products with a particular protein domain … • Genomic location and description of all mouse and rat homologues of all human genes, that have transmembrane domains, are expressed in the cardiovascular system and are associated with non-synonymous SNPs
Ensembl Core Database Relational database • Normalised • Each data point stored only once Therefore: • Quick updates • Minimal storage requirements But: • Many tables • Many joins for complicated queries • Slow for data mining applications
BioMart Database Data warehouse • De-normalised • Query-optimised • Tables with apparent “redundancy” Therefore: • Fast and flexible • Ideal for data mining Produced from normalised core databases at every new release
BioMart • Developed jointly by the European Bioinformatics Institute (EBI) and Cold Spring Harbor Laboratory (CSHL) • http://www.biomart.org/ Publicly available implementations at: • Ensembl • Central Server • Dictybase • Wormbase (WormMart) • Gramene (GrameneMart) • euGenes • HapMap (HapMart) • ZF-Models
Data Sets Primary • Ensembl Genes • Vega Genes • SNPs Secondary • Markers • “Diseases” • Gene ontology • Gene expression information • Homology predictions • Protein annotation
START FILTER OUTPUT REGION REGION GENE GENE EXPRESSION EXPRESSION HOMOLOGY HOMOLOGY PROTEIN PROTEIN SNP SNP DATABASE SPECIES Swiss-Prot FASTA EMBL GTF RefSeq HTML GO TEXT InterPro EXCEL Affymetrix FILE Information Flow
BioMart Example Find all Ensembl genes on the short arm of human chromosome 1 which are known to be associated with a disease Export the 100 bp upstream of the transcripts of the above genes
1. Select “Ensembl 38” 3. Click “next” 2. Select “Homo sapiens genes (NCBI36)”
4. Select “Chromosome 1” 7. Click “next” 5. Select “Band Start p36.33 – End p11.1” 6. Select “with Disease Association Only”
8. Select Attribute Page “Features” Summary of actions 9. Select “Ensembl Gene ID” and “Ensembl Transcript ID”
10. Select “Disease OMIM ID” and “Disease description” 11. Select Output format “MS Excel” 12. Click “export”
13. Select Attribute Page “Sequences” 17. Click “export” 14. Select “Flank (Transcript)” 15. Enter “Upstream flank 100” 16. Select Header information
There are other ways… • MartShell • Command line interface to Mart written in Java • Mart Query Language
What about queries not possible to do in BioMart? • MySQL queries on ensembldb.ensembl.org • MySQL client • Perl API • BioPerl and Ensembl modules • Java API
Q & A Q U E S T I O N S A N S W E R S
Exercises «The range and complexity of the questions you can address through the Ensembl MartView resource is truly impressive. We really encourage you to spend some time playing with it …»