Semantic Data Management for Organising Terabyte Data Archives

Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg) HAMSTER-WS, Langen, 11.07.03

Content: • DKRZ archive development • CERA1) concept and structure • Automatic fill process • CERA access statistic 1) Climate and Environmental data Retrieval and Archiving

DKRZ Archive Development

Problems in file archive access: • Missing Data Catalogue • Directory structure of the Unix file system is not sufficient to organise millions of files. • Data are not stored application-oriented • Raw data contain time series of 4D data blocks. • Access pattern is time series of 2D fields. • Lack of experience with climate model data • Problems in extracting relevant information from climate model raw data files. • Lack of computing facilities at client site • Non-modelling scientists are not equipped to handle large amounts of data (1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals).

5D Model Data Structure Model Raw Data Structure 6-hour Storage Interval Application-oriented Data Storage

(I) Data catalogue and pointer to Unix files Enable search and identification of data Allow for data access as they are (II) Application-oriented data storage Time series of individual variables are stored as BLOB entries in DB Tables Allow for fast and selective data access Storage in standard file-format (GRIB) Allow for application of standard data processing routines (PINGOs) CERA Concept: Semantic Data Management

Parts of CERA DB Internet Web-Based User Interface Access Catalogue Inspection Climate Data Retrieval Current database size is 15.3778 TerabyteNumber of experiments: 299Number of datasets: 19818Number of blob within CERA at 20-MAY-03: 1104798147 Typical BLOB sizes: 17 kB and 100 kB Number of data retrievals: 1500 – 8000 / month CERA Database: CERA Database System 7.1 TB (12.2001) * Data Catalogue * Processed Climate Data * Pointer to Raw Data files Mass Storage Archive: DKRZ Mass Storage Archive 210 TB neglecting Security Copies (12.2001) DB Size 09.07.03: 19.6 TB

Experiment Description Pointer to Unix-Files Dataset 1 Description Dataset n Description BLOB Data Table BLOB Data Table Data Structure in CERA DB Level 1 - Interface: Metadata entries (XML, ASCII) Level 2 – Interf.: Separate files containing BLOB table data

Primary Data Processing Climate Model Raw Data Application-oriented Data Storage (Interface level 2)

Automatic Fill Process (AFP) Creation of application-oriented data storage must be automatic !!!

Archive Data Flow per month 50 TB 75 TB (2006+) Common File System Mass Storage Archive Unix-Files Compute Server Application Oriented Data Hierarchy Unix-Files 2003: 4 TB 2004: 10 TB 2005: 17 TB 2006+: 25 TB Application Oriented Data Hierarchy Important: Automatic fill process has to be performed before corresponding files migrate to mass storage archive. CERA DB System Metadata Initialisation

Automatic Fill Process Steps and Relations • DB-Server: • Initialisation of CERA DBMetadata and BLOB data tables are created • DB Server: • BLOB data table input accessed from DB fill cache • BLOB table injection and update of metadata • Step 2 repeated until table partition is filled (BLOB table fill cache) • Close partition, write corresponding DB files to HSM archive, open new partition and continue with 2) • Close entire table and update metadata after end of model experiment • Compute Server: • Climate model calculation starts with 1. month • Next model month starts and primary data processing of previous monthBLOB table input is produced and stored in thedynamic DB fill cache • Step 2 repeated until end of model experiment

Automatic Fill Process Development ca. 60 TB 4 TB/month GFS AFP development with 1 TB dynamic DB fill cache • Oracle RDBMS is • currently not • connected with GFS. • Future: • RDBMS movement • to Linux system and • connection with GFS, • AFP is planned to move • to DB-server 192 CPU's 1.5 TB Memory BLOB table fill cache

AFP Cache Sizes • Dynamic DB fill cache • In order to guarantee stable operation the fill cache should buffer data from 10 days production. • Cache size is determined by the automatic data fill rate.

AFP Cache Sizes • BLOB table partition cache • Number of parallel climate model calculations and size of BLOB table partitions determine the cache size. • BLOB table sections of 10 years for the high resolution model and 50 years for the medium resolution result in 0.5 TB partitions • Present-day experience on parallel model runs result in BLOB table partition cache sizes equal to dynamic DB fill cache. • Modification will be inferred if number of parallel runs and/or table partitions change.

CERA Access Statistic

CERA DB using countries

Semantic Data Management for Organising Terabyte Data Archives