1 / 21

Components of a Data Analysis System

Components of a Data Analysis System. Scientific Drivers in the Design of an Analysis System. Data Import. Format Either widely used/accepted, or Can be converted easily from something widely used User need not know the details of the format Well documented (e.g., which flavor of latitude).

mizell
Download Presentation

Components of a Data Analysis System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Components of a Data Analysis System Scientific Drivers in the Design of an Analysis System

  2. Data Import • Format • Either widely used/accepted, or • Can be converted easily from something widely used • User need not know the details of the format • Well documented (e.g., which flavor of latitude). • Fast Access • Disk I/O speeds do not follow Moore’s law • Read speed is more important than write speed • Caching • File size is only important to keep access times low • Content must represent the details of the data • E2E - Full intent of the observer must be embedded

  3. Data Export • Format • Either widely used/accepted, or • Can be converted easily into something widely used • User need not know the details of the format • Well documented (e.g., which flavor of latitude). • You can read what you write • Import format == Export format • Fast Access • Disk I/O speeds do not follow Moore’s law • Read speed is more important than write speed • Content must represent the details of the data • E2E - Full intent of the observer must be embedded. • Includes user annotation/comments

  4. Data Base System • Ability to work with more than one data set • Data base for both export and import files • Large data volumes • Access using scan numbers is no longer sufficient • Require the ability to select subsets of data via sophisticated data-base queries • Moderate number of columns in data base index • ‘Index’ to data kept in memory to speed data access • File summaries at various levels of detail • Various levels of ‘granularity” • Calibrated and raw data • E2E - User can add annotation/comments • Security – Only the observer can access data

  5. Data Archive • Write speed more important than read speed. • File size is very important • Cannot anticipate types of user queries • Large number of columns in data base index • Very sophisticated/fast RDBMS • Storage need not be a widely used data format • Format can be very different from that used by analysis system. • Export format should be a widely used data format

  6. Interactive On-Line Data Analysis • The ability to access data ASAP • Import file updates automatically as observations proceed (real-time “filler”). • Index to file updates automatically • Updates happen per ‘integration’ (spectral-line) or per N seconds (continuum) • Minimum integration time ~ few times the minimum time of real-time “filler” • Analysis system automatically is aware of updated index. • Read-protect online/filled data? • User should be able to ‘see’ the data within an ‘integration’ of when it was taken (or N seconds).

  7. User Interface • Command line • Familiar syntax better than a good syntax • Procedural with byte-wise compiling (performance) • History, min-match or command completion • Useful error messages • Interruptible • Error trapping and exception handling • Ability to “Undo”

  8. User Interface • GUI’s best for: • Interacting with data visualizations • Filling in forms • data base queries • options for data pipelines • Browsing for data files • Defining E2E data flow (ala labview)

  9. Imaging Tools • Visualization • Shouldn’t try to recreate those things already available in another package – export instead. • Data Flagging – Pick a system that works • Graphics • Traditional capabilities (zoom in/out, scroll, print, save, …) • Data volume requires great performance, smart libraries (screen resolution << # data pts) • Interactive feedback (e.g., defining baseline regions). • Publishable plots or export into something else? • Default plot style • Ability to tweak everything (label formats; char sizes; add, remove, move annotation; tick mark size; major/minor ticks, full box; grid; multiple X and Y axes, …..)

  10. Analysis Algorithms • Algorithms well documented • Study what exists in other packages. • Robustness very important but so is speed • Provide less robust but faster alternatives • Developers should not force an algorithm on users • Developers should provide ‘defaults’ only • Building blocks better than a do-all algorithm. • Ability to use and modify ‘header’ information as well as data. • E2E – do-alls are built out of the same building blocks.

  11. Documentation • On-line and hardcopy • Tutorials/Quick Guides • Cookbook • Based on observing types • Reference Manuals • Full, gory details • Data Formats • Algorithms • Searchable by keywords • Quick, interactive command help from within the system. • Never release until these are in place

  12. User Support/Feedback • A familiar system minimizes staff support • Easily accessed, on-line “help desk” and “Suggestion” box • Automatic generation of “bug” reports • Observers of observers

  13. Marketing • A familiar system already has a market • Don’t be another cereal on the supermarket shelf • Workshops are better than papers • Create a User Community • Responsive feedback from developers • Independent Beta testers • Reputation & first experiences are everything

  14. User Community • User Forums • Newsletters • Accept User Contributions/Additions • Sourceforge-like system • NRAO-seal-of-approval • NRAO Moderator

  15. Real-Time Data Display • To guarantee data quality • Product is not stored (except for hardcopy) • Sequential processing -- different from E2E/Data pipeline • Fast is more important than accurate • Few bells and whistles -- must avoid the RTD black hole • A simple display for all observation types more important than sophisticated displays for a few data types • Display happens within an ‘integration’ of when data were taken – tied to real time filler • GUI based – underlying language is unimportant • Output understandable by an operator

  16. Real Time Data Analysis • Pointing/Focus/Tipping/… are different from RTD • Results should be stored (Data Base) • Results are used by the control system (pointing/focus) or by subsequent analysis (tipping) • Accuracy is as important as speed • More bells, whistles, user-options • Sequential processing (non E2E/data pipeline) • Only a few observation types are handled • Analysis happens within an ‘integration’ of when data were taken • GUI based – underlying language is unimportant • Output understandable by an operator

  17. IDL Work Package • SDFITS • Interim solution for data import/export • Class/IDL specific; soon Aips++/Aips/UniPOPS? • MD/BDFITS next generation (keywords, incompleteness of contents, versatility, …) • IDL – Tom Bania • Uses UniPOPS as a ‘model’ – familiar to many • Very good reproduction • Bania-centric – needs to be generalized

  18. IDL Work Package • Glen Langston • Assess whether IDL will meet performance, extensibility, usability, … goals. • Generalization to other observing types. • Real-Time data access and display • Developed on top of and in parallel with Tom’s work (so, implementations have diverged) • Works well for Glen’s own experiments

  19. IDL Work Package • Institutionalize what Tom and Glen have done • Code management • Code review • Combine Tom and Glen’s branch • Generalize code • Provide ways for Tom and Glen to contribute within the same revision-control branch. • Develop ‘Institutionalized’ code • Improve performance, usability, maintenance • Add/Replace I/O components with better CS methods.

  20. Calibration Work Package • User-tunable algorithms • Options for the ‘real-time filler’ – sequential • Options for E2E pipeline – non-sequential • Options for interactive data reduction • Default algorithms for all observing cases • Extensible as new algorithms are developed • User-defined/tweaked algorithms • Robust and not-so-robust algorithms

  21. Calibration Work Package • Opacity/atmosphere model • Output units • Efficiencies • Source size • Telescope model • Tsys(f) estimates • Differencing schemes • Non-linearities/template fitting/….

More Related