Alice detector databases – architecture and data structures

Alice detector databases – architecture and data structures Wiktor Perytwperyt@cern.ch Tomasz Traczykttraczyk@ia.pw.edu.pl

Central database Placed at CERN(temporarily placed at WUT) Plays role of central repository Contains central inventory of components copies of data from laboratories metadata, e.g. dictionaries Satellite databases Placed in laboratories-participants Contain source data produced at laboratories delivered by manufacturers working copies of data from central repository partial copies of metadata (read only) Communication Passing messages in XML Mainly off-line (batch processing) No satellite-satellite communication! Request-response model (like in HTTP) only satellite database can initiate communication Central database XML Satellite databases Alice detector databases – architecture Alice DB

Advantages Labs can input data almost independently even if communication problems occur Data from different satellite databases can be integrated All the data is available from one site it’s easier to create applications access rights can be managed in easier way Data can be restored from central database even if local database is broken It’s possible to avoid satellite-to-satellite data transfers At the end of the production phase all the data will be stored in one database – no further integration is necessary Disadvantages Data must be transferred between central and satellite databases Data from satellite databases must be integrated in central database each time it is uploaded from satellite databases Data integrity is difficult to preserve copies must be properly updated if the same data is changed concurrently in more than one database, a conflict can occur, which must be detected and solved Central + satellite architecture – pros and contras Alice DB

Central database Usage characteristics very large data volume many concurrent users twenty-four-hour utilisation critical data (breakdown may be very costly) Necessary features transaction processing procedural constraints (triggers) 7 × 24 availability high reliability continuous administration by professional DBA team on-line backup data partitioning Proper solution high-end commercial DBMS software is necessary Satellite databases Usage characteristics average or small data volume only few concurrent users used only few hours per day not critical data (can be restored from backup or from central database) Necessary features transaction processing procedural constraints (triggers) should not need complex administration (no trained personnel at labs-participants) should be monitored remotely by developers off-line backup procedures are sufficient data may not need to be partitioned should not be costly! Proper solution open-source database can be used Central database versus satellite databases Central database versus satellite databases Alice DB

Central database Oracle RDBMS Advantages very stable and reliable support for transaction processing built-in procedural language triggers support for complex data types and BLOBs support for VLDB (very large databases), e.g. data partitioning 7 × 24 availability (on-line backup, etc.) Disadvantages quite expensive complex and difficult to administer Satellite databases PostgreSQL Advantages free of charge quite easy to administer stable enough support for transaction processing built-in procedural languages triggers support for complex data types and BLOB objects Disadvantages not very fast (but fast enough for this particular application) no support for distributed processing (data replication, etc.) no support for heterogeneous systems no support for VLDB no 7 × 24 availability Proposed technology Alice DB

Problems Centrally maintained data (e.g. dictionaries) must be replicated to satellite databases Data entered/edited in satellite databases must be copied to central repository interchanged between satellite databases System is heterogeneous no built-in support for distribution, data replication, etc. no support for direct data interchange (gateways) Proposed solution XML-based data interchange data is exported to XML documents XML documents are sent over the network data is imported from XML into database structures Advantages XML is very flexible any data structures can be represented XML documents have legible format contain metadata easy to debug can even be written/changed by human XML can be written and read (parsed) by standard (and free) software tools (XML parsers) XML documents can be transformed by standard software (XSL processors) and presented using standard tools (e.g. HTML browsers) Disadvantages XML documents are quite big should be sent compressed Data interchange – technology Alice DB

Central database Central inventory of components should contain actual location and status of each components history of each component Physical flow of components Creation of new component must be signalled to central inventory a globally unique identifier for each component is generated by satellite db and registered in central inventory Arrival and departure of each component to/from any laboratory must be recorded in the inventory Destruction of the component must also be signalled to the inventory Solution Components’ data should be “checked-out” and “checked-in” from/to central inventory before/after they are processed by satellite databases Messages are passed as XML documents Satellite databases 1. Creation Central database 2. Check-in 3. Check-out 4. Destruction Monitoring of components Alice DB

Problems Lab may need to modify components’ data even if it does not have the component itself Data transfer from satellite database to some auxiliary database (e.g. on notebook) may be required To avoid possible conflicts we should not allow concurrent modifications of the same data in multiple satellite databases – some kind of centralised locks is required Temporary communication problems should not stop the work There is a danger of deadlocks(e.g. when one lab locks data set A and the second lab locks B, then the first lab requests B, and the second lab requests A) Data transfer from/to satellite database should be complete: successfully finished, or completely cancelled Proposed solution Check-in / check-out of the data must be separated from check-in / check-out of the physical component data is checked in/out to/from central repository when the transfer of read-write access rights is necessary component is registered/unregistered in satellite databaseand lab when it is physically transferred from one lab to another There must be some way to override check in/out and to force modifications of not-checked-out data conflicts resolution must occur after changes in forced mode are finished There must be some kind of deadlock detection and resolution Some kind of distributed transaction processing with ”two phase commit” is necessary Components’ flow and data flow Alice DB

Main assumptions Satellite databases will work in co-operation with central database The same data structure should be used for all labs all detectors and component types History of data changes is recorded only if absolutely necessary (in central database changes of data are traced in more detailed way) Series of test results should be stored in efficient but easy-to-query way Proposed solution Generic data structure should be used Generic “core” application has been created with possibility to create some more specialised application modules for particular labs, if necessary Natural primary keys (identifiers) are preferred alphanumeric codes used in most cases lead to more natural queries but compound keys must be used artificial (numeric) identifiers used only when no natural key can be supplied Design of primary keys enables the data to be easily integrated into central database (no key conflicts should occur) Complex object-relational data types (“nested” tables) are used to store series of results Structure of Satellite Database Alice DB

“Typical” structure Separate column for each property Separate set of tables for each type of objects (having different set of properties) Generic structure Dictionary-based dictionaries of object types, properties, etc. One generic set of tables for all objects Type # Id * Name ... Object # Id * Name ... Object type A# Id * Property A1 * Property A2 * Property A3 ... Object type B# Id * Property B1 * Property B2 * Property B3 ... Object type C# Id * Property C1 * Property C2 * Property C3 ... Propertyvalue * Value Propertydefinition # Id * Name* Optional? * Data type ... Generic structures Alice DB

Generic structure Dictionary-based dictionaries of object types, properties, etc. One generic set of tables for all objects “Typical” structure Separate column for each property Separate set of tables for each type of objects (having different set of properties) Type # Id * Name ... Object # Id * Name ... Object type A# Id * Property A1 * Property A2 * Property A3 ... Object type B# Id * Property B1 * Property B2 * Property B3 ... Object type C# Id * Property C1 * Property C2 * Property C3 ... Propertyvalue * Value Propertydefinition # Id * Name* Optional? * Data type ... Generic structures Entity (~ table) Attribute (~ column) ––– Obligatory relationship - - - Optional relationship “One to many” relationship “Many to many” relationship # Unique identifier (~ key) Alice DB

Advantages The same structure can be used in all labs it is therefore much easier to maintain The same generic application can be used on top of all the databases Structure of the central database will be very similar to the structure of satellite ones it’s easy to understand the relationships between central and satellite data no complicated translations between satellite and central structures are required it’s easier to introduce changes to both structures The structure is flexible new types of components can be introduced with no need to change the database structure The structure is quite simple less than 20 tables quite easy to understand and to memorise Disadvantages The structure is not so straightforward There may be more performance problems comparing to not-generic structures(in our case the performance of the generic structure has been tested and proved to be sufficient) The data dictionaries must be created and distributed to satellite sites Procedural constraints (triggers) are necessary to enforce data integrity (e.g. proper types of values) The generic application is a bit more difficult to create than “normal” one (but we have to create and to maintain only one universal application instead of many different specialised applications) Generic structures – pros and contras Alice DB

Satellite Database – data structure Alice DB

Satellite Database – data structure • Dictionaries • Maintained in central repository • Supplied from central database to labs • Read-only in satellite databases Alice DB

Satellite Database – data structure • Components’ data • Created and updated in satellite databases at labs • Copied to central repository Alice DB

Component Identified by globally unique internal numerical Id (contains database prefix and locally unique sequence number) user code (unique when concatenated with creator’s database code) serial number (probably unique for specific component type) Component type Identified by globally unique type code and name Each component belongs to exactly one component type Component cannot change its type(the relationship is not transferable) Manufacturer Each component has one manufacturer(lab-participant or external manufacturer) Group of components Auxiliary grouping of components Data structure – explanation Not transferable relationship Alice DB

Data structure – explanation • State of component • Existence • exists • assembled – included in compound component • partitioned – divided into new components • broken • destroyed • Final quality assessment (in percents) • Final acceptation mark • Full history of state changes is recorded • each change of the state causes insertion of a new record in the COMPONENT_STATES table Relationship is part of unique id (id consists of the relationship and # attribute) Alice DB

Data structure – explanation • Component assemblage/partition • The data structure represents a graph (digraph) • nodes represent components • arcs represent composition/derivation: • a component consists (is assembled) of other components • a component is derived (e.g. by partition or disassembling) from another one • The location of the component in compound component (e.g. location slot no) can also be recorded • Full history of the assemblage/partition processes is recorded • Component – special cases • Skeleton – partial component data • copied from central database • used in derivation graph • Virtual component – denotes group • with the same set of processes and/or parameters’ values Nodes of the graph Arcs of the graph Alice DB

Data structure – explanation • Component type derivation • The data structure represents a graph (digraph) • nodes represent component types • arcs represent composition/derivation: • a component type consists (is assembled) of components of other types • a component of particular type is derived from components of another type • derivation type can be: • assemblage • partition Alice DB

Parameter values Value is stored in text format values can be converted to proper data types as needed, e.g. to calculate an aggregation (sum, average, etc) Each value change is validated against value data type by database trigger Full history of changes of component parameters’ values is recorded each change of the value causes insertion of a new record in the PARAMETER table Definition of parameter Each component can have several parameters Components of the same type have identical set of parameters (certainly, values of these parameters may vary) Parameter is identified by a parameter code and a component type code Data structure – explanation Alice DB

Data structure – explanation • Data types of parameters • Data type of each parameter must be defined • Elementary data types are • string • number (float) • integer number • timestamp (date + time) • Derived data types can be defined • by enumeration • each allowable value can have descriptive label (Meaning) • each allowable string value can have its numerical counterpart defined (Numerical Value) • by restriction of range of allowable values • Value and High Value attributes are used Alice DB

Data structure – explanation • Processes • Processes are used to store information related to • tests • manufacturing processes (e.g. assemblage) • Each component can be described by • several types of processes • many instances of each process type (e.g. many repetitions of particular test type) • Each process can be related to a “manufacturer” – an institution which performs the process Alice DB

Compound processes Compound process can include other processes recursive relationships denote hierarchies of processes Complex test data Complex process results can be stored max. 3-dimensional data (X, Y, Z) series of results (Value Table) whole series is stored in one table row – reasonable query performance is ensured complex object-relational data type is used – each data cell can be retrieved using SQL query Numerical results are stored in text format scale and precision of numbers are preserved; scientific notation can be used text values can be converted to proper data types as needed Data structure – explanation Recursive relationship Alice DB

Parameters of processes Each process can have several parameters Processes of the same type have identical set of parameters Process can have input parameters output parameters (results) Data types of these parameters are defined similarly to data types of components’ parameters Only current values of process’ parameters are stored (no history of value changes) Data structure – explanation Alice DB

Data structure – explanation • BLOBs (Binary Large Objects) • Can be used to store binaries • pictures (photographs, etc) • other binary files • Should not be used to store structured data (e.g. results of a test in a spreadsheet format) • SQL cannot query binary data • BLOBs can be connected to • components • processes (e.g. tests) Exclusive arc:each BLOB must be related to a process or to a component Alice DB

Open structure The structure should still be be refined to reflect new or just discovered needs Generic structure features can be used to extend the abilities of the database new data types, parameters, processes The structure itself will be extended if we discover some important data that cannot be stored in current structure in efficient and legible way we find out that some information is important and/or common and should be ”honoured” by creating separate attribute(s) or new entities for it Current problems Linking components in the database with technical documentation (EDMS?) links to the documentation should be stored in the database Precise definition of components’ distribution among locations when a component of a compound type is composed of components distributed in many locations component type derivation should be supplemented with the table containing quantities in each particular location Data structure – planned refinements Alice DB

Central Database structure Alice DB

Data change logs Each change of not-versioned data should be logged dictionaries COMPONENTS table Journal tables are created for these tables(denoted by thick borderson the diagram) each data change in a table causes insertion of a record into associated journal table this record consists of new data state and some characteristics of the insert/delete/update action Minor differences No IS SKELETON attributes data in the central database must always be complete IS DELETED attribute in BLOB and PROCESS entities instances deleted from satellite database are not completely removed from the central database, but are marked as deleted Other differences More complex control structures not shown on the diagram still under development Differences between Central and Satellite structures Alice DB

Check-in/check-out support Central inventory of components in the central database Some control data in satellite databases Functions recording of checked-in/checked-out statuses of components’data recording of registered/unregistered statuses and current location of physical components To be finished soon Data transfer support Modification control data e.g. timestamps of last components’ data change (timers in satellite servers should be synchronized with CERN time servers) Logs of export/import actions In central and satellite databases Still under development User authentication and access control Dictionaries of user names and passwords roles access rights Access logs In central and satellite databases Still under development Control structures Alice DB

Check-in/check-out services Analysis of the problem use cases and scenarios (UML) Design of XML-based communication protocol Design of supporting data structures Development of transfer software based on Java, servlet technology and Oracle XDK To be finished soon Central – satellite data transfer Analysis of the problem use cases and scenarios (UML) Design of XML structures for data transfer Design of supporting data structures Development of transfer software based on Java, servlet technology and Oracle XDK Still under development Current design activities Alice DB

Problem Many local databases No trained personnel at labs-participants We need to be able to monitor the the state of remote servers from one central place Some monitoring system is needed Main functions Performing tests on remote computers, e.g. connection checking parsing remote databases’ log files Performing actions inform administrators about errors found on their servers during tests, e.g. by e-mail. Gathering and reporting information on tests’ results Proposed solution Remote Computer State Analysis System Design of RCSAS Three-layer architecture database (PostgreSQL) – stores all necessary information main program (business logic) split into two parts (client and server) Web application -- for data presentation Security connection between server and client over SSL username / password authentication between server and client Extensibility new tests and actions can be easily be added Monitoring of remote databases Alice DB

Alice detector databases – architecture and data structures