210 likes | 322 Views
Components of a Data Analysis System. Scientific Drivers in the Design of an Analysis System. Data Import. Format Either widely used/accepted, or Can be converted easily from something widely used User need not know the details of the format Well documented (e.g., which flavor of latitude).
E N D
Components of a Data Analysis System Scientific Drivers in the Design of an Analysis System
Data Import • Format • Either widely used/accepted, or • Can be converted easily from something widely used • User need not know the details of the format • Well documented (e.g., which flavor of latitude). • Fast Access • Disk I/O speeds do not follow Moore’s law • Read speed is more important than write speed • Caching • File size is only important to keep access times low • Content must represent the details of the data • E2E - Full intent of the observer must be embedded
Data Export • Format • Either widely used/accepted, or • Can be converted easily into something widely used • User need not know the details of the format • Well documented (e.g., which flavor of latitude). • You can read what you write • Import format == Export format • Fast Access • Disk I/O speeds do not follow Moore’s law • Read speed is more important than write speed • Content must represent the details of the data • E2E - Full intent of the observer must be embedded. • Includes user annotation/comments
Data Base System • Ability to work with more than one data set • Data base for both export and import files • Large data volumes • Access using scan numbers is no longer sufficient • Require the ability to select subsets of data via sophisticated data-base queries • Moderate number of columns in data base index • ‘Index’ to data kept in memory to speed data access • File summaries at various levels of detail • Various levels of ‘granularity” • Calibrated and raw data • E2E - User can add annotation/comments • Security – Only the observer can access data
Data Archive • Write speed more important than read speed. • File size is very important • Cannot anticipate types of user queries • Large number of columns in data base index • Very sophisticated/fast RDBMS • Storage need not be a widely used data format • Format can be very different from that used by analysis system. • Export format should be a widely used data format
Interactive On-Line Data Analysis • The ability to access data ASAP • Import file updates automatically as observations proceed (real-time “filler”). • Index to file updates automatically • Updates happen per ‘integration’ (spectral-line) or per N seconds (continuum) • Minimum integration time ~ few times the minimum time of real-time “filler” • Analysis system automatically is aware of updated index. • Read-protect online/filled data? • User should be able to ‘see’ the data within an ‘integration’ of when it was taken (or N seconds).
User Interface • Command line • Familiar syntax better than a good syntax • Procedural with byte-wise compiling (performance) • History, min-match or command completion • Useful error messages • Interruptible • Error trapping and exception handling • Ability to “Undo”
User Interface • GUI’s best for: • Interacting with data visualizations • Filling in forms • data base queries • options for data pipelines • Browsing for data files • Defining E2E data flow (ala labview)
Imaging Tools • Visualization • Shouldn’t try to recreate those things already available in another package – export instead. • Data Flagging – Pick a system that works • Graphics • Traditional capabilities (zoom in/out, scroll, print, save, …) • Data volume requires great performance, smart libraries (screen resolution << # data pts) • Interactive feedback (e.g., defining baseline regions). • Publishable plots or export into something else? • Default plot style • Ability to tweak everything (label formats; char sizes; add, remove, move annotation; tick mark size; major/minor ticks, full box; grid; multiple X and Y axes, …..)
Analysis Algorithms • Algorithms well documented • Study what exists in other packages. • Robustness very important but so is speed • Provide less robust but faster alternatives • Developers should not force an algorithm on users • Developers should provide ‘defaults’ only • Building blocks better than a do-all algorithm. • Ability to use and modify ‘header’ information as well as data. • E2E – do-alls are built out of the same building blocks.
Documentation • On-line and hardcopy • Tutorials/Quick Guides • Cookbook • Based on observing types • Reference Manuals • Full, gory details • Data Formats • Algorithms • Searchable by keywords • Quick, interactive command help from within the system. • Never release until these are in place
User Support/Feedback • A familiar system minimizes staff support • Easily accessed, on-line “help desk” and “Suggestion” box • Automatic generation of “bug” reports • Observers of observers
Marketing • A familiar system already has a market • Don’t be another cereal on the supermarket shelf • Workshops are better than papers • Create a User Community • Responsive feedback from developers • Independent Beta testers • Reputation & first experiences are everything
User Community • User Forums • Newsletters • Accept User Contributions/Additions • Sourceforge-like system • NRAO-seal-of-approval • NRAO Moderator
Real-Time Data Display • To guarantee data quality • Product is not stored (except for hardcopy) • Sequential processing -- different from E2E/Data pipeline • Fast is more important than accurate • Few bells and whistles -- must avoid the RTD black hole • A simple display for all observation types more important than sophisticated displays for a few data types • Display happens within an ‘integration’ of when data were taken – tied to real time filler • GUI based – underlying language is unimportant • Output understandable by an operator
Real Time Data Analysis • Pointing/Focus/Tipping/… are different from RTD • Results should be stored (Data Base) • Results are used by the control system (pointing/focus) or by subsequent analysis (tipping) • Accuracy is as important as speed • More bells, whistles, user-options • Sequential processing (non E2E/data pipeline) • Only a few observation types are handled • Analysis happens within an ‘integration’ of when data were taken • GUI based – underlying language is unimportant • Output understandable by an operator
IDL Work Package • SDFITS • Interim solution for data import/export • Class/IDL specific; soon Aips++/Aips/UniPOPS? • MD/BDFITS next generation (keywords, incompleteness of contents, versatility, …) • IDL – Tom Bania • Uses UniPOPS as a ‘model’ – familiar to many • Very good reproduction • Bania-centric – needs to be generalized
IDL Work Package • Glen Langston • Assess whether IDL will meet performance, extensibility, usability, … goals. • Generalization to other observing types. • Real-Time data access and display • Developed on top of and in parallel with Tom’s work (so, implementations have diverged) • Works well for Glen’s own experiments
IDL Work Package • Institutionalize what Tom and Glen have done • Code management • Code review • Combine Tom and Glen’s branch • Generalize code • Provide ways for Tom and Glen to contribute within the same revision-control branch. • Develop ‘Institutionalized’ code • Improve performance, usability, maintenance • Add/Replace I/O components with better CS methods.
Calibration Work Package • User-tunable algorithms • Options for the ‘real-time filler’ – sequential • Options for E2E pipeline – non-sequential • Options for interactive data reduction • Default algorithms for all observing cases • Extensible as new algorithms are developed • User-defined/tweaked algorithms • Robust and not-so-robust algorithms
Calibration Work Package • Opacity/atmosphere model • Output units • Efficiencies • Source size • Telescope model • Tsys(f) estimates • Differencing schemes • Non-linearities/template fitting/….