490 likes | 590 Views
Ameriflux Scientific Data Server Technology Overview. Catharine van Ingen, MSFT (vaningen@microsoft.com) 11 December 2006. Outline. Overview Database schema Data cube structure Data staging pipeline Work in progress and future plans. Overview. Ameriflux Collaboration Overview.
E N D
Ameriflux Scientific Data ServerTechnology Overview Catharine van Ingen, MSFT (vaningen@microsoft.com) 11 December 2006
Outline • Overview • Database schema • Data cube structure • Data staging pipeline • Work in progress and future plans
Ameriflux Collaboration Overview 149 Sites across the Americas Each site reports a minimum of 22 common measurements. Communal science – each principle investigator acts independently to prepare and publish data. Data published to and archived at Oak Ridge. Total data reported to date on the order of 160M half-hourly measurements (includes 3 processing levels). http://public.ornl.gov/ameriflux/ 4
Ameriflux Scientific Data Server - Goals • Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources • Simplify provenance by providing a common “safe deposit box” for assembled data • Interact simply with existing and emerging Internet portals for data and metadata download, and, over time, upload • Simplify data assembly by adding automation • Simplify name space confusion by adding explicit decode translation • Support basic analyses across the entire dataset for both data cleaning and science • Simplify mundane data handling tasks • Simplify quality checking and data selection by enabling data browsing
Ameriflux Scientific Data Server - Non-Goals • Replace the large Internet data source sites • The technology developed may be applicable, but the focus is on the group collaboration scale and usability • Very large datasets require different operational practices • Perform complex modeling and statistical analyses • There are a lot of existing tools with established trust based on long track records • Only part of a full LIMS (laboratory information management system) • Develop a new standard schema or controlled vocabulary • Other work on these is progressing independently • Due to the heterogeneity of the data, more than one such standard seems likely to be relevant
Large Data Archives Local measurements Ameriflux Scientific Data Server - Workflows • Staging: adding data or metadata • New downloaded or field measurements added • New derived measurements added • Editing: changing data or metadata • Existing older measurements re-calibrated or re-derived • Data cleaning or other algorithm changes • Gap filling • Sharing: making the latest acquired data available rapidly • Even before all the checks have been made • Browsing new data before more detailed analyses • Private Analysis: Supporting individual researchers (MyDB) • Stable location for personal calibrations, derivations, and other data transformations • Import/Export to analysis tools and models • Curating: data versioning and provenance • Simple parent:child versioning to track collections of data used for specific uses
Databases: SQL Server 2005 • All data, descriptive ancillary data and metadata held in relational databases • Access via SQL query • Metadata is important too but handled differently than data • While separate databases are shown, the datasets actually reside in a single database today • Mapping is transparent to the scientist due to dataset grouping and versioning • Separate databases used for performance • Unified databases used for simplicity • New ancillary data and data are staged with a temporary database • Minimal quality checks applied • All name and unit conversions • Data may be exported to flat file, copied to a private MyDb database, directly accessed programmatically via ODBC connection, or ? • Database may be replicated for availability and load balancing
Data cubes: SQL Server Analysis Services • A data cube is a database specifically for data mining (OLAP) • Simple aggregations (sum, min, max, count) can be pre-computed for speed • Additional calculations (median) can be computed dynamically • Both operate along dimensions such as time, site, or datumtype • Constructed from a relational database via Visual 2005 project • A specialized query language (MDX) is used • Client tool integration is evolving • Excel PivotTables allow simple data viewing • More powerful charting with Tableaux or ProClarity (commercial mining tools) • Management similar, but different to SQL Server • Most tools are shared, but actual usage differs due to functional and organizational differences • References: • Professional SQL Server Analysis Services 2005 with MDX (Programmer to Programmer) by Sivakumar Harinath and Stephen Quinn • MDX Solutions: With Microsoft SQL Server Analysis Services by George Spofford
Simple plots: SQL Server Reporting Services • Server-side rendering for simple tables and charts • Initially developed for static rendering of database reports such as quarterly sales • Report layout via Visual Studio 2005 project • Report description language is XML and stored in a relational database • Accessed via ASP.NET executing in IIS • Actual data source can be either relational database or data cube • Reports can be downloaded in various formats (eg Excel, jpg). • Not intended for large quantities of data (>50000 points) or highly customized graphics • References: SQL Books Online (online product documentation)
Data Import/Export: SQL Server Integration Services • Simple GUI-based tool generation for data migration, import, export • Good for exploration and/or rapid script development • Includes several format converters • Eliminates the need for much of the “glue” code • We’ve had mixed luck with for more complex operations such as custom selection for export to flat file and/or ingesting all data from the ORNL web site. • References: • Extending SSIS 2005 with Script by Donald Farmer or Professional SQL Server 2005 • Integration Services (Programmer to Programmer) by Brian Knight et al
Schema Overview Subsets will be considered in turn
L2 Data Decode Discovered column headings are represented as: [Datumtype] [repeat][_offset][_offset][extended datumtype][units] • Datumtype: the measurement primary datumtype. • Example: TA, PREC, or LE. • Repeat: an optional number indicating that multiple measurements were taken at the same site and offset. • Example: include TA2. • [_offset][_offset]: major and minor part of the z offset. • Example: SWC_10 (SWC at 10 cm) or TA_10_7 (TA at 10.7m). • Extended datumtype: any remaining column text. • Example: “fir”, “E”, “sfc”. • Units: measurement units (should only be one per datumtype) • Example: w/m2, or deg C.
L3/L4 European Data Decode European discovered column headings can be represented as: [qualityprefix][Datumtype][count][extended datumtype][qualitypostfix] • QualityPrefix: “qf_” quality flags computed at level 3 processing • Datumtype: measurement primary datumtype, although other text can be used (such as G for soil heat flux) • Count: an optional number a site-specific offset depth • Replaces offset from previous column decode • We’ve asked this be changed; likely an artifact of the original ORNL column format • Extended datumtype: a datumtype-specific combination of additional modifying text OR deriviation algorithm identifier • QualityPrefix: “qc” quality flags computed at level 4 processing Key issue: merging the quality data in columns with the correct data value column. Not all quality columns are 1:1 with data columns
Data Table • Data values are stored in a “tall” normalized or atomized table • Only valid data values are retained • Each row contains exactly one value; all other columns are foreign keys to other tables • Data are stored as single precision floating point value • Horizontal partitioned tables are created for BigPlot • Various views created to for cube building and report generation
Times, repeat, offset • Times table algorithmically generated on half-hour intervals • All times are site local • Considering adding UTC via view with site-specific conversions or additional column • Repeat table monitonically increases • Offset table generated by discovery of staged data • Views of offset and times used for cube and report menu generation • Multiple time represented to support hierarchies • Offsets ordered to simplify usage
Datumtype and Exdatumtype • Datumtype holds primary (not “other”) data datumtypes and all ancillary datumtypes (eg LAI) • Columns for cube building and per-datumtype handling • intoCube: determines whether datumtype flows to cube • dailyCalc: indicates whether daily value should be average or cumulative • repeatCalc: indicates whether repeats are averaged, ignored, or summed when building default cube • siteCalc: determines how numeric site ancillary data should flowed into cube. Eg LAI is max • Ordered views used for cube building and report generation
Quality • Holds datumtype specific European quality indicators • Algorithmically generated given quality definitions • Intended to be extensible • qualityID indicates remaining column applicability • Additional columns can be added as new quality metric emerge • Flows into cube as a quality dimension • Will support selection and plotting by quality • Still being debugged
Site Ancillary Data • Ancillary data holds site specific properties such as biome, climate, canopy height • May be text or numeric • Separate table necessary as multiple reported measurements exist due to different sources or changes over time • Source column tracks provenance • To flow into the cube, need an algorithm to select one and only one value • Leaf area index uses “max” • Over time, could use start/stop times to generate values and treat as data
Site, Investigator • Site and investigator initially created by scrape of ORNL site • Now maintained by manual query • Spline connects the two and tracks the relevant start/stop time • Investigator could be removed with GUID identifier to alternate (membership) database • Deploying CasJobs or other workflow or authentication mechanism will necessitate this
Datasets, Sitesets • Datasets used to expose data versions such as NEE_or_MDS or September download • Sitesets used to build datasets and support incremental data staging or other changes to existing data • Original single level versioning scheme replaced by this two level scheme due to better mix/match ability with different data sources across datumtypes
Decoding Tables • The conversion from column header to decoded data is table driven • Sites and column headers are discovered, then decoded. • Allows for multiple different representations of the same site/datumtype. • Checksums used to suppress load if newly scraped data file has not changed
Versioning: Sitesets and Datasets • A siteset is a collection of similar data from a single site • May include actual observations, measurements derived from observations by known algorithm, derived from observations by private (investigator specific algorithm) • Derived from a single source – actual measurements, download from national data set, simulation or other • The Ameriflux site is only one such source • Sitesets are primarily internal bookkeeping • A dataset is a collection of sitesets • No restriction on sitesets in the collection, although some combinations won’t make scientific sense • Datasets are the external scientist-facing “version” Reduces older data churn, maps to (likely) scientist data usage, avoids even more complex queries and views
Server and Researcher Sets • Sitesets and datasets may be maintained by the server or maintained by a researcher • Server Sets are maintained on behalf of all researchers in the collaboration • Accessible to all researchers • Set change policy decisions determined by the server (administrator) on behalf of the collaboration • Researcher Sets are maintained by a specific researcher or small subset group of researchers in the collaboration • Accessible only to owning researcher and/or collaborators selected by the owner • Set change decisions determined by the owner
Fixed, AppendOnly, and Dynamic Sets • Fixed sets are used to define “known good” sets for long term data curation • Contains data of a known quality as of a specific point in time • Analogous to a software release • Once fixed, always fixed • A fixed dataset may only contain fixed sitesets • AppendOnly sets are used to define “latest good” sets for common accesses • Contains data of a known quality although additional data may be added at any time • May not be necessary, but offers a little protection against inadvertent staging side effects • An appendonly dataset may contain both fixed and append only sitesets • Dynamic sets are used to define “latest” sets for rapid data access, active analysis or cleaning • Contains data of a relatively well known quality. • No restriction on changes • A dynamic dataset may contain any siteset
Dataset and Siteset Timestamps • Significant changes to a dataset or siteset are tracked by timestamps • CreateTime: dataset or siteset creation • LastAppendTime: last addition of new siteset to a dataset or new data to a siteset (includes new derived measurements, new measurement types and recent data additions) • LastModifyTime: last deletion or substitution of a siteset member of a dataset or change to older previously existing data values of a siteset • FixTime: dataset or siteset converted to fixed • DeleteTime: all dataset or siteset data deleted • Notes: • Intended to be used as “something has changed, query for more details” indicators only • DeleteTime retained for some period to help the “oops” cases to help scientists determine what happened in inadvertent or forgotten scenarios • All timestamps are server local time
Dataset and Siteset provenance • creatorid tracks investigator that created the dataset or siteset • includes any software automation tracked via investigator table entry • prepares for investigator private sets in MyDB/MyCube • name is (relatively) short text string for simple identification • Commonly used to select dataset when building a cube or other user interaction • description is longer text string for reference • Siteset howmade is a controlled vocabulary text string to track data origins • Text used to simplify data viewing. • Initial strings: • ORNLsite: non-gap filled data downloaded from ORNL website • ANN: gap-filled data using ANN method • gap: gap-filled data using unknown or unspecified method • Siteset path is the URL (if available) that was used when downloading data • NULL used when URL unavailable • Treated as a text string only, no sanity checking or format enforcement
Siteset and Dataset Scenarios • Sitesets and datasets are just mechanisms, we also need policies to determine the actual behavior • Usage cases of interest: • New staged measurements • Existing older measurements re-calibrated or re-derived • Data cleaning or other algorithm changes • New derived measurements added • Gap filling • Making the latest acquired data available rapidly • Supporting individual researchers (MyDB) • Data version provenance: parents and children
Example: data version provenance • When making a new version, it’s good to track the original version • Full provenance is possible, but very easily leads to poor usability • Proposal: track siteset parents • A siteset has exactly one parent • A siteset may have any number of children due to gap filling, other algorithm change applied, or new data reported • When a new siteset is made by combining two different sitesets, the combination can be determined by annotation only • When a siteset is deleted, the youngest surviving ancestor become the parent to any surviving children • Relationships between datasets are more complex and should be determined by the constituent siteset relations
Database schema learnings • The tall table has both advantages and disadvantages • Best for building a cube and most extensible as new datumtypes or exdatumtypes et al are encountered • Queries are fairly complex to write and join multiple tables to specify siteset, site, datumtype, exdatumtype, offset, and repeat values • Views can simplify the query, but not the performance • Materialized views and pivot tables too restricted for practical application here • Build infrastructure for machine schemas and use the right one for the right job • 100+M rows in any table takes care • Clustered unique index to improve performance of common queries • Updates affecting many rows can fill the log; allocate 2-5x storage • Full backups are far simpler operationally and footnetting often a good alternative
Sample Cube Construction Visual Studio UI for construction, deployment, and browsing
Datacube structure • Structure follows tall table organization • Each dimension corresponds to a foreign key • Time has multiple hierarchies • Dataset replaces siteset as it is the scientist-facing concept • Accumulating computed measures as we discover them • Median, standard deviation, variance can be expensive • HasDailyData, HasYearlyData, DailyValue, Average cheap and handy • Development uses smaller subset cubes • Harvard Forest for L2 data and SeptemberORNL development • VairaWillow for L2/L3/L4 data and NovemberORNL development • Spreadsheets for Q/A checking
Minimalist MDX Query SELECT NON EMPTY { [Measures].[Average]} ON COLUMNS, NON EMPTY { [Timeline].[Year To Day].[Yydoy].&[2002-001]: [Timeline].[Year To Day].[Yydoy].&[2002-365] } ON ROWS FROM [NovemberORNL] WHERE( [Datumtype].[Datumtype].&[42],[Site].[Site].&[29], [Exdatumtype].[Exdatumtype].&[1], [Dataset].[Dataset].&[6], [Offset].[Offset].&[0 (cm)]) Retrieves daily average GPP values for Tonzi Ranch for 2002
Report Generation Cube Query SELECT NON EMPTY { [Measures].[Average] } ON COLUMNS, NON EMPTY { ([Site].[Site].[Site].ALLMEMBERS * [Datumtype].[Datumtype].[Datumtype].ALLMEMBERS * [Timeline].[Year To Month].[Yymmdd].ALLMEMBERS ) } DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME ON ROWS FROM ( SELECT ( { [Offset].[Offset].&[0 (cm)] } ) ON COLUMNS FROM ( SELECT ( { [Exdatumtype].[Exdatumtype].&[1] } ) ON COLUMNS FROM ( SELECT ( STRTOSET(@TimelineYear, CONSTRAINED) ) ON COLUMNS FROM ( SELECT ( STRTOSET(@SiteSite, CONSTRAINED) ) ON COLUMNS FROM ( SELECT ( STRTOSET(@DatumtypeDatumtype, CONSTRAINED) ) ON COLUMNS FROM [SeptemberORNL]))))) WHERE ( IIF( STRTOSET(@TimelineYear, CONSTRAINED).Count = 1, STRTOSET(@TimelineYear, CONSTRAINED), [Timeline].[Year].currentmember ), [Exdatumtype].[Exdatumtype].&[1], [Offset].[Offset].&[0 (cm)] ) Query used for report generation; STRTOSET in the above retrieves report parameters Still retrieves daily average GPP values for Tonzi Ranch for 2002
Excel Cube Query SELECT NON EMPTY HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownLevel({[Datumtype].[Datumtype].[All]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40], [Datumtype].[Datumtype].&[39], [Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37], [Datumtype].[Datumtype].&[36], [Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32], [Datumtype].[Datumtype].&[31], [Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55], [Datumtype].[Datumtype].&[29], [Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57], [Datumtype].[Datumtype].&[26], [Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24], [Datumtype].[Datumtype].&[23], [Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54], [Datumtype].[Datumtype].&[21], [Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19], [Datumtype].[Datumtype].&[18], [Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13], [Datumtype].[Datumtype].&[60], [Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9], [Datumtype].[Datumtype].&[7], [Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5], [Datumtype].[Datumtype].&[2], [Datumtype].[Datumtype].&[1]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40], [Datumtype].[Datumtype].&[39], [Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37], [Datumtype].[Datumtype].&[36], [Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32], [Datumtype].[Datumtype].&[31], [Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55], [Datumtype].[Datumtype].&[29], [Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57], [Datumtype].[Datumtype].&[26], [Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24], [Datumtype].[Datumtype].&[23], [Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54], [Datumtype].[Datumtype].&[21], [Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19], [Datumtype].[Datumtype].&[18], [Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13], [Datumtype].[Datumtype].&[60], [Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9], [Datumtype].[Datumtype].&[7], [Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5], [Datumtype].[Datumtype].&[2], [Datumtype].[Datumtype].&[1]})) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON COLUMNS , NON EMPTY HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownMember({DrillDownLevel({[Timeline].[Year To Day].[All]})}, {[Timeline].[Year To Day].[Year].&[2004]}))}, {[Timeline].[Year To Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005], [Timeline].[Year To Day].[Year].&[2003], [Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To Day].[Year].&[2001], [Timeline].[Year To Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999], [Timeline].[Year To Day].[Year].&[1998], [Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To Day].[Year].&[1996], [Timeline].[Year To Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994], [Timeline].[Year To Day].[Year].&[1993], [Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To Day].[Year].&[1991], [Timeline].[Year To Day].[Year].&[1990]}))}, {[Timeline].[Year To Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005], [Timeline].[Year To Day].[Year].&[2003], [Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To Day].[Year].&[2001], [Timeline].[Year To Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999], [Timeline].[Year To Day].[Year].&[1998], [Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To Day].[Year].&[1996], [Timeline].[Year To Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994], [Timeline].[Year To Day].[Year].&[1993], [Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To Day].[Year].&[1991], [Timeline].[Year To Day].[Year].&[1990]})) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON ROWS FROM [NovemberORNL] WHERE ([Measures].[Average], [Dataset].[Dataset].&[6], [Site].[Site].&[29], [Offset].[Offset].&[0 (cm)], [Exdatumtype].[Exdatumtype].&[1]) Query as generated by Excel Pviot Table Still retrieves daily average GPP values for Tonzi Ranch for 2002
Datacube learnings • It’s fairly simple to build a cube • We’ve built lots • The GUI tool is straight forward and the tutorials help • It’s not so simple to build a good cube • Incorrectly specified hierarchies can give either incorrect answers or catastrophically poor performance or both • Ordering all dimensions for good usability • Build times can be hours and/or timeout • Aggregations are designed in and take extra configuration • Once aggregations built, still need to watch performance trace to ensure that they’re used • Data cubes best for daily or larger time periods. Use the database for half-hour (leaf) measurements. • It’s easy to write a very slow MDX query • Not Empty very important • It’s really easy to write an MDX query that is hard to understand • Knowing more about what queries are of interest to the scientists will help us tune the existing cube and/or build targeted cubes
Data Staging Pipeline • Data can be downloaded from internet sites regularly • Sometimes the only way to detect changed data is to compare with the data already archived • The download is relatively cheap, the subsequent staging is expensive • New or changed data discovered during staging • Simple checksum before load • Chunk checksum after decode • Comparison query if requested • Decode stage critical to handle the uncontrolled vocabularies • Measurement type, location offset, quality indicators, units, derivation methods often encoded in column headers • Different sites use different units • Incremental copy moves staged data to one or more sitesets • Automated via siteset:site:source mapping
Updates to Basic Data Checks • Applied checks will be constantly increasing • Waiting for descriptions of current ORNL checks • Investigations pending: • Per-site, per datumtype limits (eg Summer is hotter Florida than in Alaska) • Systematic time errors (eg GMT rather than local time) • Time dependent checks for datumtypes max, min, zero at night or day day • Unit conversions or other simple conversions (eg relative humidity) • Add a new table to log results of all checks • Enables standardized reporting longer term • Can be copied to active database as part of provenance metadata for user access • Schema TBD
Copying Only Changed Data • Proposed approach: throw processing at the problem • File checksum can be used for older, retired sites or slowly reporting sites • Only addresses the “no new or changed data” case • Recommend as a filter to suppress load rather than suppress copy. • Time (yearly?) chunked count and checksum can be used for older, unlikely to change data • Reduces the total data that must be explicitly compared • Recommended given the likely query time for full comparison • Exhaustive timestep-by-timestep comparison between data in archive database and staging database used to determine the amount of change • Limit amount of returned data to bound query time • Compare only makes sense with“like” data (same processing and gap fill method) • Desired behavior for each site: • If the target siteset is fixed, do nothing • If the target siteset is append only and the data are not changed, append only the new data • If the target siteset is flexible, append new data and replace any changed data • Otherwise, create a new siteset and copy all data
L3/L4 European Quality and Gap-Filled Data • Additional column headings to denote quality flags, derived quantity algorithm and gap fill algorithm • Example: Rg_f, Rg_fqc, qf_NEE_st, NEE_st_fNDS, NEE_st_fMDSqc • _f indicates fill • qc indicates quality • st, or, MDS, and ANN indicate algorithms • Approach: load, decode, convert, delete • Loading unchanged: quality flags treated as data values. • Decoding upgraded: additional column headers decoded and flag/data indicator added • Convert: builds quality flag row and connect with the corresponding measurement • Delete: removes quality flag rows from the data table prior to data copy from the staging database
Work in progress: improving current capabilities • Debug and deploy quality flags on L3/L4 data • Implement incremental data staging to enable speedy and simple data editing by an actual scientist (rather than a programmer) • Implement expanded metadata handling to enable scientist to add site characteristics and sort sites on those expanded definitions • Investigating integration with ArcGIS to get spatial data analyses • Investigating MatLab, R2, and SPlus integration • Direct SQL queries available from MatLab and R2 • Given the query complexity, should we / can we build a wizard? • Get BigPlot working again and invest in usability • Last summer’s effort was really research into feasiblity
Future Plans: transitioning from prototype • Add data download • Current tablular report intended as stopgap/technology demonstration only • Likely starting point: adapting Euroflux collaboration download to ftp zip file given the good usability and compression • Alternate would be to write CSV file converter via SSIS • CasJobs and MyDb to support individual scientist analysis including data editing for correction, recalibration, etc • Implement expanded metadata handling to enable scientists to add site characteristics and sort sites on those expanded definitions • Automate MyCube construction to pair with MyDb • Internal Microsoft experience with real life data suggests this is possible and may even be fairly straight forward for cubes with with common dimensions. • Review current deployment for true DMZ deployment
Longer Term Futures • Handling imagery and other remote sensing information • Curating images is different from curating time series data • Using both together enables new science and new insights • Graphical selection and display of data • Support for user specified calculations within the database • We’ve done LE_Pot, but there are clearly others • Support for direct connections to analysis and statistical packages to cube as well as database • Leverage stored procedure (SQL) and nice wizards (MDX/SQL) to simplify query generation • Linkage with models • Additional (emerging) data standards such as NetCDF • Handling “just in time” data delivery and model result curation • Need for workflow and single sign on credentials • Data mining subscription services • Handling of a broader array of data types