220 likes | 370 Views
OBIS SOS Custom DAP ERDDAP ... Database ERDDAP Files . DAP, ERDDAP, and Tabular (Sequence) Datasets. Your Favorite Client Software. Try it: http://coastwatch.pfeg.noaa.gov/erddap Bob Simons <bob.simons@noaa.gov> NOAA NMFS SWFSC ERD.
E N D
OBIS SOS Custom DAP ERDDAP ... Database ERDDAP Files DAP, ERDDAP, andTabular (Sequence) Datasets Your Favorite Client Software Try it: http://coastwatch.pfeg.noaa.gov/erddapBob Simons <bob.simons@noaa.gov>NOAA NMFS SWFSC ERD
My Goals for this Presentation • Tell you more about ERDDAP. • Raise awareness and appreciation of tabular data. • Convince you that tabular datasets are best served as DAP sequences.And that serving them in DAP as 1D or 2D gridded datasets is a bad idea.(This has nothing to do with how they are stored.) Bonus: 3 powerful ideas: • Abstractions (capture the essence; hide the instance details) • Representations (different file formats) • Reusability (value is multiplied)
ERDDAP Features • (Re)serves diverse local and remote datasets Abstraction: thanks to DAP, the source differences are hidden. • Serves gridded and tabular datasets • Offers a unified place to search for datasetsFull-text, category-based, or advanced. • Encourages improvedmetadataSo users can understand the dataset. • Offers a standard way to request data from any datasetFor humans: forms on web pages.For computers: DAP, WMS, (SOS) web services. • Offers a choice of response file formatsDifferent representations • Standardizes time formats (Here, different representations are trouble.)As Strings - ISO 8601:2004(E), e.g., 2014-07-01T20:00:00ZAs numbers - seconds since 1970-01-01T00:00:00Z • Is reusable.
Tabular Datasets Tabular data sources: databases, OBIS, SOS, CSV files, flat .nc files, CF DSG .nc files, ... • GeospatialCF Discrete Sampling Geometry (DSG) feature types: Point: whale sightingsProfile: disposable CTDTimeSeries: moored buoyTimeSeriesProfile: CTDTrajectory: shipTrajectoryProfile: profiling glider • Non-Geospatiallaboratory data, references, fish disease lists, ecosystem: what eats what, ...Larry Ellison is rich because databases are reusable for numerous types of data.
(ERD)DAP Data Requests:Gridded vs. Tabular Datasets • Gridded Datasets (DAP projection constraints)DAP: ?temperature[437] [46:1:162][122:282]ERDDAP: ?temperature[(2014-07-01)][(22):(51)][(-145):(-105)] • Tabular Datasets (DAP selection constraints)DAP: ?s.id,s.owner,s.time,s.latitude,s.longitude,s.wtemp&s.id="sp031"&s.time>=1404172800ERDDAP: ?id,owner,time,latitude,longitude,wtemp&id="sp031"&time>=2014-07-01
(ERD)DAP Sequence Requests vs. Database SQL Requests • (ERD)DAP: ?id,owner,type,time,latitude,longitude,wtemp&id="46088"&time>=2014-07-01 • SQL: SELECT id,owner,type,time,latitude,longitude,wtemp FROM s WHERE id="46088" AND time>=2014-07-01 Pablo Picasso: "Good artists copy, great artists steal."
Related Tables vs. One Table Normalized Observation Table Buoy Table Join (Denormalized)
Yeah, but why doesn't ERDDAP support nested sequences? • It does, but just internally. • ERDDAP (re)presents the dataset as a single table. • One table is an abstraction. It hides details. • The average user understands a table. • One vs. many tables: just different representations. • This lets all tabular datasets have the same structure. • The results of a DAP or SQL query is always one table. • There are many file format representations of one table.
3) Tabular datasets are bestserved as DAP sequences.(Why DAP Sequences Rock!)And that serving them in DAP as 1D or 2D gridded datasets is a bad idea.(This has nothing to do with how they are stored.)
Why Sequences Rock! Reason #1 If the data is coming froma relational database, OBIS, or SOS, the dataset can't be served as a gridded dataset. • There are no index (row) numbers. • It isn't easy/possible to know how many rows there are. • The order of the rows may change at any time. • New rows are added as new data arrives: frequently.
Why Sequences Rock! Reason #2 Serving tabular data in DAP as 1D or 2D gridded datasets is a bad idea. • Logic: Men:mortal. Socrates:man. Socrates:mortal. Grids:handled well by DAP. Treat table as:grid. Treat table as grid:handled well? • Grid dimensions usually represent a physical continuum.DAP: ?temperature[408:437][46:1:162][122:282]ERDDAP: ?temperature[(2014-06-01):(2014-06-30)][(22):(51)][(-145):(-105)] • No arrangement of tabular dataset dimensions works well 2D [buoy][time]: buoy is not a continuum, time leads to wasted space1D [time]: fine, but then you need 1000 datasets (1 per buoy)1D [row]: aggregated, but row isn't a continuum. In every case, it's hard to know which rows to request.The rows you want are scattered through the dataset.so you have to either download everything or make numerous requests. • Serving a DSG file directly: too many formats, too hard to query.
Why Sequences Rock! Reason #3 • DAP sequence requests use the terminology of the dataset. (It's easy.) • ?id,owner,type,latitude,longitude&distinct() • ?id,type,latitude,longitude&owner="NDBC"&distinct() • ?id&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&distinct() • ?id&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&time>=2014-07-01&distinct() • ?&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&time>=2014-07-01 Making these requests with index numbers is a difficult (not for Roberto), multi-step, programming task. And it's inefficient.
Why Sequences Rock! Reason #4 Because declarativelanguages (SQL, DAP selection constraints) let you describe what you want, not how to get it.?id,owner,type,latitude,longitude&distinct()?id,type,latitude,longitude&owner="NDBC"&distinct()?id&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&distinct()?id&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&time>=2014-07-01&distinct()?&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&time>=2014-07-01 With imperative languages (C, Fortran, Java, Python), you must describe, step-by-step, how to solve the problem. 1) Request all latitudes.2) Filter3) Request all longitudes.4) Multiple requests because data is scattered throughout the dataset.
Why Sequences Rock! Reason #5 Because the other options all suck. • Serving the datasets as grids doesn't work.You now understand why, right? • Serve the data files via FTP. Getting a chunk of data is all or nothing. Makes user deal with various file formats. • Custom forms and web services are too much work to make.Custom: 6+ months per dataset? Ongoing maintenance. No consistency! Reusable: 1 day, minimal maintenance, consistent! • Give trusted colleagues access to the database or the files.That's not making the data public! • Don't let anyone else use the data.This is actually the #1 method of fisheries data distribution.
My Goals for this Presentation • Tell you more about ERDDAP. • Raise awareness and appreciation of tabular data. • Convince you that tabular datasets are best served as DAP sequences.And that serving them in DAP as 1D or 2D gridded datasets is a bad idea.(This has nothing to do with how they are stored.) Bonus: 3 powerful ideas: • Abstractions (capture the essence; hide the instance details) • Representations (different file formats) • Reusability (value is multiplied)