280 likes | 492 Views
GFDL Data Portal. Current Status, Achievements and Future Development. K.Dixon, V.Balaji, S.Nikonov GFDL, Princeton. NOAATECH-2006. History. Data Portal was launched in 1995 as simple ftp server. The idea and the term “Data Portal” arose 3 years ago.
E N D
GFDLDataPortal Current Status, Achievements and Future Development K.Dixon, V.Balaji, S.Nikonov GFDL, Princeton NOAATECH-2006
History • Data Portal was launched in 1995 as simple ftp server. • The idea and the term “Data Portal” arose 3 years ago. • Originally it served data by occasional requests. • Now the main assets are IPCC data. NOAATECH-2006
Common technical characteristics Software • Red Hat Linux • Apache Web Server • DODS Aggregation Server • THREDDS • LAS Server • GrADS-DODS NOAATECH-2006
Hardware • Dell Power Edge 2650 machine • Dual Processor Intel Xeon 2.4 GHz • 3 GB RAM • 7 Dell Power Vault 220S with • 14 HDs in each, 19 TB total (expansion pending up to 35 TB) • Network bandwidth: internet – 9 Mbit/s internet-2 – 100 Mbit/s NOAATECH-2006
WEB Site Structure NOAATECH-2006
Basic Metadata • Model description • Experiment description • Institution • Extra metadata for treating tripolar grids (including ferret scripts for their visualization) • Metadata is compliant with standard CF • Metadata accompanies each data file NOAATECH-2006
Basic features GFDL LAS server • Dynamic data presentation chosen by user • Spatial/time subsampling with included metadata • Defining on a fly new variables calculated by given formula • ferret visualization NOAATECH-2006
General Statistics01-Oct-2004 to 01-Oct-2005 • Total amount of CM2 Climate Model Data: 12 TB • More then 10000 NetCDF files, average file size: 1 GB • Successful requests: ~62,000 • Average successful requests per day: ~200 • Distinct files requested: 5,000 • Distinct hosts served: ~850 • Data transferred: 15 TB • Average data transferred per day: ~42 GB • Number of journal articles submitted that include analyses of GFDL CM2 model output: > 100 NOAATECH-2006
Current standard procedure of publishing data • Climate Model Output Rewriter (CMOR) processing • manual configuring for different models, experiments, variables • triggered manually • Quality Control • made by scientist, includes checking metadata, time ranges, values diapasons, etc. • Splitting up CMORized, QC-ed data into small (<2GB) NCDF files and pushing them out of firewall to Data Portal • manual configuring scripts doing this • starting scripts manually • Preparing checksum report on Data Portal • running cron started script • Configuring Aggregation Server and LAS • made manually NOAATECH-2006
Current Data Portal workflow NOAATECH-2006
Desirable Features of Data Portal • Relational Database storing metadata with description of • model components and model configuration • scenarios • postprocessing (model output and CMOR) • experiments • variables • formulized rules of Quality Control • data locations in Archive • task scheduler • users and groups accounts • XML as data exchange format • for compliance with FMS Runtime Environment (FRE) • working format of existing third party software • good fitted for hierarchical metadata description • prevalent in world, easy to exchange with others Data Portals • Publisher Control Center (PCC) • controls CMOR subsystem • controls Data Publisher Manager • controls data quality (QAC) NOAATECH-2006
Desirable Features of Data Portal(continue) • Climate Model Output Rewriter (CMOR) subsystem • prepares data consistently with specific project requirements • Data Publisher Manager • transfers data to target destination in accordance to settings from DB • Front-end Data Portal Software Package • Configuration Manager (configures Aggregation Server and Data Portal Interface) • Search Catalog Engine • Data Subsampling Engine • Data Computation Engine • Data Visualization • Data Delivery Manager NOAATECH-2006
Proposed functionality schema of ‘GFDL Data Factory’ NOAATECH-2006
Standard scenario of functioning Model Data Factory (ideal picture) • Scientist builds model in existing GFDL FMS Runtime Environment System (FRE) using available model components, datasets and forcing scenario. • FRE puts metadata about built model, scenario, experiment into “curator” DB and runs experiment; • Postprocessing subsystem extracts metadata about postprocessing plan from “curator” DB and executes it, and on finish puts metadata about processed experiment back into DB. • Data Publisher (DP) regularly checks “curator” DB for new experiments marked as “public” and if finds any invokes CMOR. • CMOR goes to “curator” DB for metadata and processes needed data following metadata instructions. • DP calls QAC and then transfers data to Data Portal storage. • Configuration Manager configures Aggregation Server and Data Portal Interface and puts records about new public data in “curator” DB. • End of process, data is ready to go. NOAATECH-2006
Database Compartments: Database ‘curator’design • Model Metadata Compartment contains models’ descriptions, allows to build coupled model of needed configuration • Variables Compartment List of all related physical variables • Workflow Compartment contains scenarios, experiments, institutions, projects and users info • Postprocessing Compartment defines postprocessing plan for conducting experiment • Data Portal Compartment contains info about experiment data NOAATECH-2006
Interaction between compartments NOAATECH-2006
Coupled_Models Model_List Component_Medias Models Variables Model Metadata Compartment(in development) Workflow Compartment Experiments Variables Compartment NOAATECH-2006
Components_Medias Coupled_Models Model_List Models Data Samples from Model Compartment NOAATECH-2006
Variables Variable_Bundles Variable_Lists Variable_List_Contents Projects Proj_Var_Names Variables Compartment Workflow Compartment NOAATECH-2006
Proj_Var_Names Variables Variable_List_Contents Variable_Lists Variable_Bundles Data Sample from Variables Compartment NOAATECH-2006
GFDL_USERS Institutions Experiment_Status Realization Projects Experiments Scenarios Workflow Compartment NOAATECH-2006
Scenarios Experiments Data Samples from Workflow Compartment NOAATECH-2006
Post_Proc PP_Units Coupled_Models Projects GFDL_USERS PP_Content Average_Periods Variable_Lists PP_Content PP_Units Postprocessing Compartment Data Samples from Postprocessing Compartment NOAATECH-2006
Data_Files Data_Grids Variables MissedData_Descriptors Experiments Coupled_Models Variable_Bundles Data Portal Compartment NOAATECH-2006
Data_Files MissedData_Descriptors Data_Grids Data Samples from Data Portal Compartments NOAATECH-2006
Curator DB on Data Portal stream • Curator DB is already used on GFDL Data Portal. • JSP technology with servlets on backend was applied • New data transferred onto Data Portal is automatically registered in Curator DB with all accompanied metadata. • It turned out the fastest way to search for data on Data Portal: CM2.0 CM2.1 NOAATECH-2006
Another Aspects of Future Development • Set up model metadata schema standards in scientific community and develop SQL metadata schema. • Populate Curator with real metadata extracted from GFDL models. • Conjugate Curator DB with GFDL FMS Modeling System • Customize LAS server to use the Curator DB • Design user interfaces NOAATECH-2006
END Questions? Thanks! NOAATECH-2006