280 likes | 387 Views
The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble. Laboratoire LIP6. ACI MD. Context and goals. Heterogeneous metadata management on grids Clusters of clusters High-level queries using metadata Easy and flexible deployment and configuration
E N D
The Gedeon Project: Data, Metadata and DatabasesYves DENNEULINLIG laboratory, Grenoble Laboratoire LIP6 ACI MD
Context and goals • Heterogeneous metadata management on grids • Clusters of clusters • High-level queries using metadata • Easy and flexible deployment and configuration • Minimal overhead • Various interfaces • Initial target application domains • Biocomputing (lots of metadata, few data) • Microscopic imaging (lots of data data, few metadata)
The Gedeon middleware • Metadata management on lightweight grids • Records of (attribute,value) pairs stored in files • Flexible requests • Can be combined through scripting • Various interfaces • Command line (tools) • Libraries • Virtual FS (legacy applications support) • Deployment “à la carte” • Composition of various data sources • Performances • Dedicated I/O library • Semantic caching
Outline • General architecture • Gedeon internal structure • Composition of various data sources • Practical use • « dual » cache Conclusion
Example of a deployment Query Interface (API, FS, GUI, ...) cache Local proxy Client cache cache Servers « close » to the client Interconnect middleware Interconnect middleware cache cache cache cache cache Local proxy Local proxy Local proxy Storage sites Interconnect
application lowerG vSGF fuple fuple network network Gedeon components • Gedeon Kernel • fuple • I/O Library • Evaluate the queries • lowerG • Operators to compose bases • Remote access • Interface • API lowerG • Virtual FS • Cache Local proxy cache lowerG
What inside the sources? • Records of pairs attribute/value Record Id 457 classifA Bacteria classifB Clostridia taille 26 ref
Example of composition of sources site S2 site S1 site S3 + J RR Metadata can be local or copies client
Union enreg. A1 enreg. B1 enreg. A2 enreg. A1 enreg. B1 enreg. A3 enreg. A2 enreg. B2 + enreg. B2 enreg. A3 enreg. B3 enreg. B3 enreg. A4 enreg. B4 ... ... enreg. A4 enreg. B4 Unify storage space + Parallel evaluation ...
Round Robin Fault Tolerance Source 1 RR client Source 2
Round Robin Load Balancing Source 1 client RR client Source 2
Join operator Id 457 Id 457 A1 v1 A1 v1 A2 v2 A2 v2 Id 457 A3 v3 A3 v3 An vAn1 J An vAn1 Id 458 Id 458 Id Id 458 A1 v4 An vAn2 A1 v4 A2 v5 ... A2 v5 A3 v6 Enrich a source with another A3 v6 ... An vAn2 ...
Outline • General architecture • Gedeon internal structure • Composition of various data sources • Practical use • « dual » cache Conclusion
Tools 1/2 • Libraries • CLI • Operations • sort • projection • select • index • ...
Tools 2/2 • Examples • sort$> cat mesmeta.g | fsort 'taille' > trie_taille.g sort(attr='taille') • index .Id.idx create_idx(attr='Id') search_idx('Id', 'P0123') .Id.idx .Id.idx
Language for the requests • Simple ($, type control with the operators) • Regular expressions • Of the second order
Select expression Id 457 classifA Bacteria classifB Clostridia taille 26 Select $Id>459 Id 460 classifA Fermicutes Id 459 classifB Bacteria taille 47 Id 460 classifA Fermicutes
Select using regexp Id 457 Id 457 classifA Bacteria classifA Bacteria classifB Clostridia classifB Clostridia taille 26 taille 26 Select $classifB==/.*a$/ Id 459 Id 459 classifB Bacteria classifB Bacteria taille 47 taille 47 Id 460 classifA Fermicutes
Select using 2nd order logic Id 457 classifA Bacteria classifB Clostridia taille 26 Id 459 Select $/classif[AB]/==Bacteria && $taille>=36 classifB Bacteria Id 459 taille 47 classifB Bacteria taille 47 Id 460 classifA Fermicutes
Virtual FS interface • Just a specific file-oriented interface • Data and metadata can be anywhere in the grid • Definition of logical directories • Ex: cd '$classifB==|.*a$|' • « and » between directories • 1 filename =value of a metadata: logical view/fs_virt/$classifB==|.*a$|> ls457 459/fs_virt/$classifB==|.*a$|> cat *>/tmp/mater/fs_virt/$classifB==|.*a$|>
Outline • General architecture • Gedeon internal structure • Composition of various data sources • Practical use • « dual » cache Conclusion
Dual cache (1) • 2 cooperative caches • cache of requests (R, {id,...})-> save computing power • cache of data (id, {attr,...})-> save bandwidth • Semantic cache • Can evaluate a query using the data in the cache • Can generate a remainder to complement the data cached
Example • Refinement of a request • '$OC==/Eukaryota/'-> (R, Lid={id1,id2, ...}) • '$OC==/Eukaryota/ && $year>=1998'Select(*Lid, '$year>=1998')
Dual cache (2) • Distributed semantic cache • Typically used inside communities • Lots of common requests • No location constraints • Members of the community can be geographically scattered • Distributed data cache • Minimize time and data transfer • Cooperation between close, from a topological point of view, sites
Rennes Grenoble Servers Semantic locality Dual cache Geographic locality Query cache Object cache Community Archaea Community Eukaryota Dual cache (3)
Dual cache (4) • Work in progress on the notion of distance • Find geographical proximity • Find common interests between communities • Create hybrid communities based on their requests • Could be used to change the cache parameters • Manual and/or automatic
Conclusion • A data integration middleware • Handling of metadata • Distributed and modular • Deployment can be done according to architectural/organisational constraints • Definition of a dual cache infrastructure • Reflect both organisational use • Prototype in use • Packaging and documentation needed