130 likes | 230 Views
Measurement Data Archive – Project Highlights GEC12 Nov 2011 Giridhar Manepalli Corporation for National Research Initiatives http:// www.cnri.reston.va.us /. Why Archive?. The obvious: for use by others or by yourself in the future The Fourth Paradigm Data-intensive science
E N D
Measurement Data Archive – Project Highlights GEC12Nov 2011 Giridhar ManepalliCorporation for National Research Initiativeshttp://www.cnri.reston.va.us/
Why Archive? • The obvious: for use by others or by yourself in the future • The Fourth Paradigm • Data-intensive science • Emergent phenomena • Funding bodies increasingly asking for data plans • Citations from journal articles to data sets on the rise • Consistent archiving standards enhance the use of data over time and within a domain
Measurement Data Archive Internet Public Journals 4 5 3 CNRI Workspace 3 3 Archive Workspace Internet 2 Workspace = Prototype 2 = Digital Object 2 Slice = Data Model TBD Slice Measurement Data Template Key: 1. Experiment Initiated 2. Measurement Data Collected 10510.0.1/0-L2NucmlnZW5p 1 3. Measurement Data Archived 1 4. Archived Data Referenced Experimenter Y Experimenter X 5. Archived Data Retrieved Object A Run 1 Logs Run 2 Logs DO DO Metadata
Current Usage • Early adopters in GENI: • OnTimeMeasure - Ohio State University • INSTOOLS - University of Kentucky • Possible usage in other projects: • DARPA Transformative Apps program for managing mobile apps related data • Internal to CNRI for sharing documents and presentations across groups
Next Steps – I&M Standpoint • Revisit the protocols for pushing data into workspace • Associate metadata with data effectively • Where does the metadata live? • How is it associated with data? At what level of granularity is it specified? • Support GENI and I&M schemes of authentication, authorization, metadata enforcement, etc. • Allow multiple workspace deployments • Identify the process to push data from workspace into the archive • Should metadata be enforced before data is pushed into the archive? • How is the data serialized in the archive? • How is data visibility managed in the archive?
Next Steps – GENI-wide • Extend services offered by the archive beyond data storage • Developed a visualization service prototype to demonstrate automatic visualization of data for DataCite • Designed a theoretical model for enforcing terms & conditions, licenses, etc. prior to disseminating data • Goal: Expand archive into an eco-system to entice communities into using it • Use archive for experiments, not just for I&M
Archive Services Suite of extensible services end users can leverage by following the ID. Science Times Article Title Data ID License Enforcement Visualization Terms:… Terms:… Terms:… I Agree I Agree I Agree SUITE OF SERVICES Data Set Dissemination Data Processing 2 1 10100 11010 101…. 10100 11010 101…. 10100 11010 101…. User followsData ID into the Archive. Archive User is redirected to requested Archive Service. Stores & Retrieves Data Other Experiments Ohio University VDC Experiment Experimenter Other Experimenters
Measurement Data Archive – Project Highlights GEC12Nov 2011 Giridhar ManepalliCorporation for National Research Initiativeshttp://www.cnri.reston.va.us/
Prototype Limitations • Only one workspace service is deployed • Multiple workspaces, within and outside GENI networks, can be hosted that push data to the archive • Authentication and authorization model is simple and redundant • Should conform and use one scheme across GENI (or at least across I&M) • No metadata standard applied • I&M metadata requirements must be applied once identified
What is Metadata and Why Do I Need It? • Lots of miscommunication because • Metadata is not a type of data • Metadata isa type of relationship between two pieces of data • Needed for Understanding and Finding • Understanding (sometimes called Descriptive MD) • How do I parse this? • How do I interpret this? • Finding (sometimes called Subject MD) • Finding one item in a population of 10 is easy • Finding one item in a population of 1M is impossible w/o some some way to distinguish them • Generally requires a human in the loop at some level • Sometimes the object is self-describing (journal article) • Automatic indexing/classification works for some domains
Why is Metadata Hard? • To be effective it must be consistent, and consistently applied, within a given domain • What is the scope of the domain? • What aspects of the object need to be described? • What is the vocabulary, is it open or closed? • Even within a defined domain, there are many points of view • Especially true for any sort of subject description • May have to allow for multiple metadata records for a single described object • Spending time on creating good metadata is Good For You • The best sources for good metadata are the creators/owners of the described object, but they may lack interest and training • Some types of metadata are difficult to automate, e.g., good title • Keep it simple – trade consistency and coverage for depth
Misc Points • Precision and Recall useful concepts in searching • Precision: % of search results are on target • Recall: % of the correct result set did my search retrieve • Desirable tradeoff is situational • Consider University Libraries as reliable archive holders • Variety of approaches to managing a useful vocabulary of terms • Controlled vocabulary: set of terms – use these instead of slight variations • Taxonomy: parent-child relationships • Ontologies: introduce other types of relationships