230 likes | 313 Views
The Bookkeeping SQL API. Tim Adye Rutherford Appleton Laboratory Bookkeeping / Data Distribution Parallel BaBar Collaboration Meeting 8 th December 2004. Talk Plan. The problem “Why not just write SQL?” The BaBar SQL API User view: Perl classes and command-line tool Behind the scenes
E N D
The BookkeepingSQL API Tim Adye Rutherford Appleton Laboratory Bookkeeping / Data Distribution Parallel BaBar Collaboration Meeting 8th December 2004 Tim Adye
Talk Plan • The problem • “Why not just write SQL?” • The BaBar SQL API • User view: Perl classes and command-line tool • Behind the scenes • Table schema configuration classes • Summary of features • Could this be generalised to other applications? • Possible improvements • Comparison with other DBIx packages • Summary and references Tim Adye
User Access • Users need to query database to find out what data to process • May also need other information • eg. luminosity, run numbers, file sizes • Mostly select by dataset, but may need to limit further • eg. only data available locally, taken at peak energy, excluding some problem datataking period • Cannot expect users to know which tables to use, how to join them, or even the SQL syntax • Even worse if the schema change • Cannot expect developers to code for all combinations of queries with all possible selections • Previous ad hoc tools (some mine!) tried to do this and it was a nightmare, even for a simpler table structure Tim Adye
BaBar SQL API – user view • Each column that users might want to query or select on is given a unique logical name – regardless of which table it lives in • These names are used to specify query values $query->addValues('collection', 'gbytes'); and selections $query->addSelector('dataset', 'Dilepton-*'); $query->addSelector('run', '10000-19999'); • Different types of data allow for different selection syntax, eg. wildcards for names, or ranges for run numbers. • Can also use SQL expressions (in terms of logical names) $query->addValues('SUM(lumi)/1000'); and sorting, row limits, etc Each of these happens to be in a different table Tim Adye
SQL API – returning results • That’s enough to generate a valid SQL SELECT query. To return the results:- my $sta = $query->execute(); while (my $row = $sta->fetch()) { print $row->gbytes(), $row->collection(), "\n"; } • The $query object collects the user requests • $query->execute() returns a “statement accessor” (like a DBI statement handle). • $sta iterates over row objects, each of which has accessors for each query value, gbytes and collection. • That’s all there is to it! • After the usual DBI connect, and $query object instantiation (see later), these statements form a working program Tim Adye
Command-Line Tools • Standard BaBar tools use this API to create job configuration, create datasets, calculate luminosities, etc. • Standard tasks, but optionally allowing additional selections • Also provide an “expert” tool that allows access to full API functionality from the command line • This has proved very popular, with many “non-experts” making their own unique queries Tim Adye
Examples $ BbkUser --dataset=A0-Run4-OnPeak-R14 \ --is_local=1 --file_status=0 \ dse_lumi events gbytes file \ --style=adye --display DSE_LUMI EVENTS GBYTES FILE ======== ====== ====== ==================================================== 1250.3 526115 1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.01.root 1250.3 526115 0.8 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.02HBCA.root 1348.4 576239 1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.01.root 1348.4 576239 1.0 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.02HBCA.root ... 156 rows returned $ BbkUser –-collection-file=coll.lis \ tot_gbytes collection Tim Adye
What happens behind your back • The SQL API • translates the logical names to table columns • selects the required tables and joins • including otherwise unused tables required for the joins • generates and executes a valid SQL SELECT statement • creates a statement accessor object • dynamically generates a class for the row objects with accessors for each query value Tim Adye
BbkUser --dataset=A0-Run4-OnPeak-R14 --is_local=1 --file_status=0 dse_lumi events gbytes file Our Example • That first BbkUser command involved 5 tables • including one that provides the join between dataset and collection tables SELECT dse.lumi_sum AS "dse_lumi", dse.output_nev AS "events", file.bytes, dse.name AS "collection", file.suffix AS "file_suffix", ds.id AS "ds_id", dse.id AS "dse_id", dtd.id AS "dtd_id", dtd.link_status FROM bbk_dataset ds, bbk_dsentities dse, data_files dfile, bbk_files file, bbk_dstodse dtd WHERE ds.id=dtd.ds_id AND dtd.dse_id=dse.id AND dse.id=file.dse_id AND file.id=dfile.file_id AND ds.name='A0-Run4-OnPeak-R14' AND dse.is_local='1' AND dfile.status='0'; The SQL API can even pretty-print it like this for you (What’s shown here is somewhat abbreviated: actual command includes full database and table names in case of ambiguities) Tim Adye
Table schema configuration classes • Mapping between logical names and table columns is defined in the configuration classes. • One class per table • Can also define special properties of each column (eg. whether to allow ranges (“100-199”) for selection). • Possible joins between tables defined here too • Use logical column names for join conditions, so one table class does not need to know about column names in other classes. • In most cases it’s just a matter of listing logical vs. column names • with a little Perl syntactic sugar • Inheritence of config classes expresses commonalities • eg. common id and created, and modified columns Tim Adye
Example Table Configuration sub table { return 'bbk_files' } sub tableConfig { return { alias => 'file', columns => [ bytes => 'bytes‘, uuid => 'uuid', checksum => 'checksum', file_suffix => 'suffix', nfiles => 'COUNT(DISTINCT file_id)', gbytes => '(bytes/1073741824)', tot_gbytes => 'SUM(bytes)/1073741824', file_dse_id => { column => 'dse_id', selectorType => 'range' }, file => { valueAction => 'addLfnValue', selectorAction => 'lfnSelector' }, ], joins => [ dse_id => 'file_dse_id', file_id => 'dfile_file_id', ], }} Tim Adye
Putting it all together • Configuration classes must be registered with $query object my $query = new BbkSqlSelect($bbkconfig); $query->addModules(new MyTableClass($bbkconfig)); but of course it is usually simpler to provide a $query object pre-registered with all the table configs as part of a specific API. Tim Adye
Overriding and Synthetic Columns • A crutial advantage of this system is that it allows us to override the default behaviour • Allows us to hide complexities from user • Make even complex schema changes transparent to users • A logical name can refer to • ordinary database column name • SQL expression (in terms of database columns, or other logical names) • Perl method to pre- or post-process selection or query value • “synthetic” query value or selection • can return calculated value or alter behaviour • Global post-processing • Can be triggered by value, selection, or table inclusion • Allows global filtering of returned rows Tim Adye
What happens behind your back 2 • We already used some of these features without noticing! • Dataset names can be found in the bbk_dataset or the bbk_aliases table • Requires a check and translation using the alias table • Datasets can evolve with time, with collection being added or removed • Need to query dataset for any time in the past, or use tagged dataset alias (like a CVS tag) • Implemented by automatically including date selection in query, and post-processing returned results to remove deleted collections • File names are made from a collection name + a suffix • $query->addSelector('file') splits the file name for the query and the $row->file() accessor rejoins them Tim Adye
Features • Supports Oracle 8 and MySQL 3.23 • Most queries that can be expressed in both these dialects can be expressed by users via the API – without breaking the paradigm of a flat namespace • aggregatation and grouping • sorting and distinct • MySQL’s LIMIT emulated in Oracle • inner and outer joins (generates Oracle or MySQL syntax) • Does not support UNION or subqueries • Could be added, but not in MySQL 3.23 • Convenience features • automatic Getopt specification • query results display formatting and summary table generation • Configuration class summary table generation Tim Adye
Limitations • Assumes tables can be joined in a unique way • ie. the joins form an acyclic graph • can still select different joins with explicit switches • Each column must have its own unique logical name • This is usually a good thing • but if the same data is held in different columns, it would be more efficient to automatically select from tables that are already included Tim Adye
A public version • Current version has a few BaBar-specific pieces • BaBar Connection/Configuration manager – can use DBI directly • BaBar Options manager – can use Getopt directly • BaBar base objects – borrow required methods • BaBar table formatting class – publish this too … otherwise just uses standard Perl modules but with different table configs could be used elsewhere • Already do this in BaBar – used for QA and TM databases • Maybe I’m making some other assumptions that are true of our database and requirements, but not more generally so. I can’t think of any. • Needs a better name! • This is really an SQL API creator DBIx::SqlAbstractor??? Tim Adye
Possible improvements • Tidy up code! • User and config APIs are OK, but in between it’s pretty ugly • Separate functionality that can be used on its own • Already true of the DBI statement accessor class • More SQL dialects: PostgreSQL, MS SQL? • New SQL syntax: subqueries, UNIONs,… • INSERT, UPDATE, etc • these don’t need joins, so hand-coding not such a problem • Automatic selection of different join possibilities • Automatic generation of default table classes from SQL schema • Could use “The SQL Fairy” • though not much work to do it by hand Tim Adye
Why not use another package? • More than 100 DBIx and other SQL access packages in CPAN • Could not find any that do all (or even most) of • hide table structure from user • allow multi-table queries, taking care of joins automatically • do not impose their own conventions on table schema • allow query values and selections to be overridden • allow transparent post-processing of query results • provide accessor functions for query results • I believe that taken together these features provide a clear and easy to use abstraction Tim Adye
Feedback and Discussion • Would this be useful outside BaBar? • Is it a good idea to make a public release? • eg. on CPAN • Does it need any improvements? • New features • Make it compatible with some other standards • eg. sit on top of another abstraction like DBIx::Table • A better name! Tim Adye
References • BaBar Bookkeeping project http://slac.stanford.edu/BFROOT/www/Computing/Distributed/Bookkeeping/Documentation/ • BaBar Bookkeeping presentation and paper http://indico.cern.ch/contributionDisplay.py?contribId=338&sessionId=7&confId=0 D.A. Smith et al.,BaBar Book Keeping project –a distributed meta-data catalog of the BaBar event store,Proc. Computing in High Energy and Nuclear Physics 2004 (CHEP04). • CPAN Database Interfaces (see particularly DBIx) http://cpan.uwinnipeg.ca/chapter/Database_Interfaces Tim Adye