250 likes | 787 Views
Data retrieval. BioMart. Export View. Data sets on ftp site MySQL queries of databases Perl API access to databases. ExportView. Data Mining in Ensembl with EnsMart. August 2005. Possible queries…. All genes from a candidate region Genes with a particular protein domain
E N D
Data retrieval BioMart Export View Data sets on ftp site MySQL queries of databases Perl API access to databases
Data Mining in Ensembl with EnsMart August 2005
Possible queries… • All genes from a candidate region • Genes with a particular protein domain • Members of a protein family • Genes associated with SNPs
More specific queries • Human genes with upstream regions conserved w.r.t. mouse • Upstream sequence for all Ensembl genes mapped to U95A chip (similarly, complete genomic annotation of MG_U74). • Genomic location and description of all mouse, rat and fugu homologues of all human genes, with transmembrane domains, expressed in cardiovascular system and have non-synonymous SNPs.
Ensembl core database • Normalised • Each data point stored only once • Quick updates • Minimal storage requirements • But: • Many tables • Many joins for complicated queries • Slow for data mining questions
BioMart and EnsMart • Large-scale data retrieval tool • Query builder interface • Databases: Ensembl, SNP, Vega, (MSD, UniProt) • Associated features or sequences • Flexible output formats • http://www.ebi.ac.uk/biomart/ • http://www.ensembl.org/EnsMart/
Mart database • De-normalised • Tables with ‘redundant’ information • Query-optimised • Fast and flexible • designed for data mining
Primary Data Sets • Ensembl genes • SNP • Single nucleotide polymorphisms • Deletion-insertion polymorphisms • Short tandem repeats • Vega genes • (MSD protein structures) • (UniProt proteomes)
Secondary Data Sets • Markers • Diseases • Gene ontology • Gene expression information • Homology predictions • Protein annotation
SPECIES FOCUS REGION GENE EXPRESSION HOMOLOGY PROTEIN SNP REGION REFSEQ FASTA EMBL GENE GTF EXPRESSION HTML AFFY SWISSPROT TEXT HOMOLOGY PROTEIN EXCEL GO SNP INTERPRO FILE Information flow start filter output
BioMart http://www.biomart.org/
HTML Output formats
What about queries not possible to do in EnsMart • Direct database access at ensembldb.ensembl.org • martdb.ebi.ac.uk • MySQL client Download MySQL for Windows http://www.winmysql.com/page4.html File: wmysr11.zip
Access via Perl object API • Based on bioperl • Ensembl modules • For an introduction, see the tutorial at: • http://www.ensembl.org/info/software/core/
There are other ways… MartShell Commandline interface to Mart written in Java. It works with a Mart Query Language