200 likes | 311 Views
EnsMart: A Generic System for Fast and Flexible Access to Biological Data. Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust. Objectives. Understand the idea of a “Data Mart” Understand why this idea is useful to biology Have an idea of how Ens Mart works.
E N D
EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust
Objectives • Understand the idea of a “Data Mart” • Understand why this idea is useful to biology • Have an idea of how EnsMart works. • Assess the significance of the EnsMart system. Will it last?
Data Mart defined • A database that is potentially derived from many other databases whose primary purpose is query processing and report generation for non-technical users. • Similar to a “Data Warehouse” • Marts/warehouses important components in “decision support systems” in business.
Data Mart in EnsMart • Data collected • Standardized • Query Optimized • Presented to Users
Marts – benefits • Allows good division of labor • Computers for transactions separate from computers for queries • Interface development separate from database development. • Biologists (can be) separated from computer scientists as a result of good interface design. • Produces faster more stable system for users
Costs • Construction of the Mart is a challenging and continuous process. • New sources of data need to be incorporated and validated constantly • Trust
The case for EnsMart, why now? • Growing number of different databases and opportunities. Genomes, expression, protein, disease… • Assembled, high quality genomes available. • “finished” genomes can be used as references to link data from different databases consistently. • EnsMart built to take advantage of the opportunities for cross-database queries.
Inside EnsMart • 9 organisms • At least 17 different primary sources of data, many with multiple databases. • 2 kinds of “Foci” • Genes • Ensemble • EST • Vega • SNPs
EnsMart schema Many Many One Focus 1 Many Many
Schema -> Query Speed • “Central” tables or foci contain binary values for each satellite indicating existence. First step in query generation limits the range of satellite tables accessed. • These values are only useful in the query process (take extra space and time for transactions). • Many queries may not require access to satellite tables as a result.
User Interfaces • Supposedly Confucian quote • "What I hear I forget. • What I see I remember. • What I do I understand."
User Interfaces • MartView: website, “wizard” query construction. • MartExplorer: Stand alone tool, tree-based query construction. • MartShell: text-based application that utilizes an SQL-like query language. Can be used interactively or in batch processes. • Write your own! – using MartLib java library
MartView 1 Mart View 1 Choose org and focus
MartView 2 Design query
Conclusions • Powerful query system for biologists. • Useful framework for software engineers. • All open source! • What about other loci such as repetitive elements? • Data validation? • Annotation updates?
EnsMart Discussion • What, if any, are the problems with the foci system? • What alternatives to this system exist? • Describe a task that EnsMart could be used to accomplish. • Describe any personal experiences with EnsMart.