110 likes | 119 Views
Learn about storing and accessing Lagrangian data in databases for Underway Carbon Data Management projects, including analysis, database design, future work, and data simulation. Explore security concerns and scalability solutions.
E N D
Storing and Accessing Lagrangian Data in Database Yonghua Wei NOAA / PMEL / TMAP
Outline of Presentation • Introduction to Underway Data Management project • Analysis and database design • Data simulation • Security issue • Future work
Introduction Underway Carbon Data Management project (CDIAC) • Build web accessible data visualization systems for carbon data which are collected by sensors on research ships or commercial ships • The amount of CDIAC data will increase quickly in the coming years. The number of rows 5 years from now is estimated to be 10 million or more. • The maximum acceptable time limit for each database request is 10 seconds. • Design a database schema that can allow scalability for data size, good performance for most frequently used requests.
Introduction Typical ship tracks
dataset_ID CRUISE_DESIGNATION SECS_1970 LAT_DEC_DEGREE LONG_DEC_DEGREE JD_GMT, DATE_YYYYMMDDHHmmSS,xCO2W_PPM, xCO2A_PPM, PRES_EQUIL_hPa, PRES_SEALEVEL_hPa, EqTEMP_C, SST_TSG_C, SAL_TSG_PERMIL, WATER_FLOW_L_PERMIN, GAS_FLOW_IR_ML_PERMIN, TEMP_IR_C, PRES_IR_hPa, SHIP_HEADING_TRUE_DEGREE, SHIP_SPEED_KNOT, WIND_DIR_REL_DEGREE, WIND_SPEED_REL_M_PERS, fCO2W_SST_uATM, QC_FLAG_WATER, fCO2A_uATM, QC_FLAG_AIR, dfCO2_uATM, FLUORO_uG_PERL, WIND_SPEED_TRUE_M_PERS, WIND_DIR_TRUE_DEGREE, AIR_TEMP_C Introduction Information for each measured point
Analysis and Database Design • Main difficulty is multi-dimensional range selection and subsetting: • B+ tree or ISAM • Future solution: • R tree • Data clustering
Analysis and Database Design • One table is used to avoid costly join operation • Table should remain ordered by main index (CRUISE_DESIGNATION, SECS_1970) to take advantage of the strong correlation between two adjacent points in the same ship track • two columns called "SECS_1970" and "HOURS_1970" are added for performance enhancement • For better performance, each column uses the data type with smallest size as possible: not null, smallint, char… • Five indices are created for current data schema to address the most commonly used queries
Analysis and Database Design • Five indices are created for current data schema to address the most commonly used queries : KEY main (CRUISE_DESIGNATION,SECS_1970) KEY date (DATE_YYYYMMDDHHmmSS) KEY sec (SECS_1970) KEY hour (HOURS_1970) KEY x_y (LONG_DEC_DEGREE,LAT_DEC_DEGREE) KEY y_x (LAT_DEC_DEGREE,LONG_DEC_DEGREE)
Data Simulation • To address the problem of database scalability, a database with big amount of data is required to test its performance • Some charateristics of the simulation that are currently used: • There are 25 ships in total. • Those ships run for 5 years (2005-2009) • Each ship has 8 to 13 trips each year (random). • Each ship spend 40% of the time measuring data each year. • For each trip, the starting point and ending are random. • For each ship, the time period for one trip is random but will not overlap other trips for this ship. • Ship speed is fixed at 0.18 degree per hour. • One measurement for every 400 seconds during the trip. • Ocean and land are not considered.
Security Issue • Local host login only • User ID and password protection • LAS need super user privilege (need improvement)
Future Work • More testing on the database • Collect more information about user query pattern and optimize the database schema for higher performance. • Provide better support for metadata query. • Improve LAS to allow more interactive interface for CDIAC data query.