340 likes | 473 Views
Extending a Relational Database Intelligent storage of geospatial data John.Pickford@uk.ibm.com. 07/09/2014. Agenda. History Current methods Problems with current methods Solution – extend database to support new data types Timeseries NAG Spatial including Geodetic and GRID Conclusion.
E N D
Extending a Relational Database Intelligent storage of geospatial dataJohn.Pickford@uk.ibm.com 07/09/2014
Agenda • History • Current methods • Problems with current methods • Solution – extend database to support new data types • Timeseries • NAG • Spatial including Geodetic and GRID • Conclusion
IBM - Informix Dynamic Server • It was the Informix-IDS Database • Informix database business bought by IBM in 2001 • IDS and DB2 UDB are similar and are getting closer • Extensibility in Informix came from Stonebraker and Brown • Ingres -> Postgres -> Illustra added to Informix • DB2 UDB extensibility the same in theory, DB2 Extenders equivalent to IDS DataBlades
Current Methods • Data held in a RDBMS • Easy to manage • SQL • Easy to access from applications • C • ESQL/C, CLI • ODBC • JDBC • Easy to administer • Standard tools • Trained staff • Many 3rd party products use RDBMS
Problems with RDBMS • Not good at loading lots of data items • OK with high volume of data • Not so good with lots of small rows, e.g. ticks, readings • High latency (time between data arriving and being visible through SQL queries) • Not good with complex data • A row is just a row, no concept of an ordered set of rows • Restricted set of data types • Not good with “unusual” functions • What is unusual ? • Seasonal averaging • Multivariate analysis • VWAP
SERVER CLIENT CLIENT MEMORY DISK Standard client-server • Supports common data types • Character • Numeric • Integer • Float • Decimal • Datetime/interval • Large object • Binary • Text • Collections • List, set, multiset
Database can be extended – new data types • Distinct • e.g. lengths, mass etc • Row • e.g. address = houseNumber + road + town + postCode • Opaque • User defines how the data is stored • In row • Binary large object • External (file or some external process) • Once created it is like an additional built-in type
Opaque data type implementation(1) • Opaque types are implemented by a series of functions • Cast functions • lvarchar -> myType • myType -> lvarchar • Comparison function • B-tree index support • R-tree index support • Mathematical operators • + - x /
Opaque data type implementation(2) • Functions implemented in • C • Java • SPL • Made known to the server by an SQL statement • Functions usually execute as part of the main database server process • Types handled correctly by SQL layer • Optimiser • Indexing • Mathematical operations • Aggregates
Opaque data type implementation(3) • Simple example, complex numbers • Stored as 2 double precision numbers • Function to convert the string “12.34 + 56.78i” to a C structure containing 2 floats • Function to convert a C structure with 2 floats into the string “12.34 + 56.78i” • Functions to add, subtract, multiply and divide complex numbers • complex_t *Plus(complex_t, complex_t) • Can now be used in SQL statements: • create table (u complex_t); • insert into table values (“1 + 2i”); • select u * u from tab; • “-3 + 4i” • Comparisons OK, what about ordering ?
Virtual Table/Index Interface • Tables can be replaced by functions (VTI) • Insert function • Select function • First, next, previous, nth, last • Delete function • Can create own indexing methods • Specialist area • Example of VTI – can make Google look like a database table: select * from google where search = “IBM+IDS+TimeSeries”;
Problems storing TimeSeries data in RDBMS • “Tall-thin” tables, primary key roughly same size as data • Each row selected can cause a disk read • Extra indexing often added for index only reads • Solutions include • storing more than one element per row - makes management difficult • Using a separate timeseries database – expensive, difficult to manage
Timeseries – the problem • Ordinary RDBMs store the values in rows, these rows are stored randomly in tables even though access is usually in time order. You need to create an index to store the key value for each row in time order. This takes more space, data access can be slow with over complicated SQL statements.
Timeseries – the solution • Store elements as vectors in time order
CLIENT CLIENT MEMORY DISK SERVER ORDBMS - Extended to Support TimeSeries • Supports common data types • Supports extended data types and functionality • TimeSeries TimeSeries
Performance (time/space) Write SQL functions that work on TimeSeries Advantages of TimeSeries
TimeSeries • Create a type to hold the data: create row type rt_t ( ts datetime year to fraction(5), minTemp temperature, maxTemp temperature); • Create the table: create table tab1 ( sensorId integer, temps timeseries(rt_t)); • Use the data: select temps, getLastElem(temps), getnelems(temps), clip(temps, t1, t2), func(temps, t1, t2) from tab1 where sensorId = 99;
Timeseries functions • Built in SQL functions • Used in SQL statements and SPL functions • Server and Client C API • Common way of writing TimeSeries functions • Server and Client Java Class Library
CLIENT MEMORY DISK SERVER CLIENT ORDBMS - Extended to Support Complex Analysis TimeSeries • Supports common data types • Supports extended data types and functionality • TimeSeries • NAG types and functions • Moves processing closer to the data • Data reduction at earliest opportunity • Functionality common to very different clients NAG
Vector and Matrix data types NAG Functions Roots of equations Curve and surface fitting Matrix factorisation Eigenvalues and Eigenvectors Linear algebra support Simple correlation on statistical data Random number generation Linear statistical modeling Smoothing Time series analysis Sorting Optimisation PDE Black-Scholes NAG DataBlade
Timeseries data can be analysed using NAG functions in the server (1) • Timeseries and NAG functions can be combined: select tstovec(clip(temp, t1, t2), “maxTemp”) from tab1 where sensorid = 99; select f001(tstovec(clip(temp, t1, t2), “maxTemp”)) from tab1 where sensorid = 99; • f001(timeseries, time, time) could be a stored procedure that calls the NAG function G13AAF() that carries out seasonal differencing of a timeseries.
Timeseries data can be analysed using NAG functions in the server (2) • The function f001() could be a stored procedure calling NAG functions: create function f001(x vecDblType) returning vecDblType; define ifail, integer; define xd vecDblType; -- The output vector xd needs to be the same size -- as the input vector let xd = copy(x); -- Call g13aaf() let ifail = udrg13aaf(x, 2, 1, 4, xd, ‘0’); return xd; end function;
CLIENT SERVER CLIENT Data Feed RTL CLIENT MEMORY RTL MEMORY DISK ORDBMS - Extended to Support RealTime Data TimeSeries • Supports common data types • Supports extended data types and functionality • TimeSeries • NAG • RTL • Ticks made available to client programs in < .01s • High load rates, up to 100,000 s-1 NAG RealTime Loader
RealTime Loader Memory ORDBMS Disk RealTime Loader • Ticks in the RealTime Loader memory appear as part of the ORDBS time series • SQL access to the RTL data
CLIENT CLIENT SERVER Data Feed RTL CLIENT MEMORY RTL MEMORY DISK ORDBMS - Extended with Customer Functionality TimeSeries • Supports common data types • Supports extended data types and functionality • TimeSeries • NAG • RTL • Customer types and functionality NAG RealTime Loader Custom
Spatial extensions • Spatial DataBlade – Implements the Open GIS Consortium types • Spatial datatypes • Point • X, Y, Z, M • Line, Polygon • Set of points • MultiPoint, MultiLineString, MultiPolygon • Spatial functions • Area, BoundingBox, Contains, Crosses, Distance, Overlap, Union, Within • R-Tree index • Geodetic DataBlade • Uses longitude, latitude, height, time, measure • Grid DataBlade
Spatial query • Spatial data types can be added to the table: create table tab1 ( sensorId integer, location ST_Point, temps timeseries(rt_1)); • and used in queries: select sensorId, location, temps, getLastElem(temps), getnelems(temps), clip(temps, t1, t2), func(temps, t1, t2) from tab1, counties where ST_Within(counties.border, location) and counties.name = “Cheshire”;
GRID Datablade (1) • Written, supplied and supported by an IBM partner, Barrodale Computing Services Ltd, Victoria, Canada • Stores data in “Smart Binary Large Objects” • Provides functions to load data from common data formats • Provides functions to extra data from the grids • Can store: • 1D: timeseries, vectors • 2D: raster images, arrays • 3D: spatial volumes, images at different times • 4D: volumes at different times • 5D: 4D grids with a set of variables at each 4D point
GRID Datablade (2) • Operations: • Interpolation • Affine transformation • Projection • Windowing • Update • Storage formats: • CDF • GRIB • HDF • NetCDF • SDTS
Other DataBlades • Node • Stores tree structures • Ancestor, parent, sibling, descendants • Period • Uses R-Tree index • Same problem as spatial search • N-dimensional index • Just the R-tree index (up to 5 dimensions) • GMP extended precision floats and integers • Text • Image
DataBlade Tools • BladeSmith • Generates templates for the functions • Maintains SQL to manage types/functions • Blade Packager • Blade Manager
References • “Object-relational DBMs: Tracking the Next Great Wave”, Michael Stonebraker, Paul Brown, Dorothy Moore • ISBN 1-55-860452-9 • “Informix Dynamic Server.2000 – Server-Side Programming in C”, Jacques Roy • ISBN 0-13-013709-X • “Storing and Manipulating Gridded Data in Databases” • http://www.barrodale.com/grid_Demo/gridInfo.pdf • IBM website • http://www-306.ibm.com/software/data/informix/blades/spatial/
Conclusion • ORDBMS extensibility • Better performance • Faster • Less space • More functionality • Different packages can be combined into complex systems • Already supports useful types and functions: • Geodetic • GRID (from BCS) • Image • NAG • Spatial • Text • Timeseries • User additions • Easy to do