1 / 23

Louis Reich,NASA/GSFC/CSC June 10.2008 Arlington,Va

Research Findings Concerning the Utility and Scalability of the XFDU and Related Technologies in the Packaging and Validation of Very Large Digital Information Products. Louis Reich,NASA/GSFC/CSC June 10.2008 Arlington,Va. Overview of Presentation. Backround Open Archival Reference Model

Download Presentation

Louis Reich,NASA/GSFC/CSC June 10.2008 Arlington,Va

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research Findings Concerning the Utility and Scalability of the XFDU and Related Technologies in the Packaging and Validation of Very Large Digital Information Products Louis Reich,NASA/GSFC/CSC June 10.2008 Arlington,Va

  2. Overview of Presentation • Backround • Open Archival Reference Model • XML Formatted Data Unit (XFDU) • XFDU Toolkit Library

  3. In the Beginning: OAIS Reference Model OAIS Mandatory Responsibilities: OAIS Information Model • Negotiating and accepting information • Obtaining sufficient control of the information to ensure long-term preservation • Determining the "designated community" • Ensuring that information is independently understandable • Following documented policies and procedures • Making the preserved information available OAIS RM OAIS Environment and Data Flows OAIS Functional Model

  4. XML Formatted Data Units:Information Packages

  5. XML Packaging Standard Rationale Technology and Requirements Evolution • Physical media -->Electronic Transfer • No standard language for metadata--> XML • Homogeneous Remote Procedure Call-->CORBA, SOAP • Little understanding of long-term preservation-->OAIS RM • Record formats-->Self describing data formats New Requirements • Describe multiple encodings of a data object • Better describe the relationships among a set of data objects. Technical Drivers • Use of XML based technologies • Designed to be extensible to include new XML technologies as they emerge • Linkage of data and software • Direct mapping to OAIS Information Models • Support both media and network exchange • Support for multiple encoding/compression on individual objects or on entire package • Mapping to current SFDU Packaging and Data Description Metadata where possible • Maximal use of existing standards and tools from similar efforts

  6. XFDU Conceptual View

  7. XFDU Manifest Logical View

  8. XFDU Reference Implementation XFDU Toolkit Library XFDU Toolkit Library • Java class library implemented in parallel with specification development • XFDU API library can be used within any other application that needs to perform XFDU packaging. • To use XFDU toolkit in this manner simply put all the jar files inside of the lib directory on the CLASSPATH of your application. Javadoc documentation for the library can found at: http://sindbad.gsfc.nasa.gov/xfdu • XFDU API Library conceptually consists of two layers • First layer (xfdu.core.* )is low level API representing each structure in XFDU schema. • Second layer (xfdu.packaging.*) is more high level API. It aggregates parts of functionality from the first layer; thus, allowing easier access to constructing and manipulating an XFDU package. • A crude GUI is also supplied for testing and demonstration

  9. Current XFDU Research Results

  10. XFDU Validation Functionality Rationale • The study of XML Schema specialization mechanisms indicates the importance of alternative validation mechanisms to enable the level of specialization required for effective use of XFDU as base schema for reuse in defining more specific packages for domain interoperability. • Currently the XFDU incorporates Schematron rules for the optional validation of semantic rules specified in the XFDU recommendation • The experience with the PAIS SIP schema development indicated that while this technique could be used to validate structural Specializations, the complex rule definition and the apparent inconsistency of the XML schema would be significant user issues. • The following experiment is an effort to allow verification of individual XML formatted data and metadata using included or referenced XML schema without involving the specialization of the underlying XFDU schema.

  11. XFDU Validation Functionality Design • A validation handler was created that implemented the ValidationHandler interface defined in the NASA XFDU Java library. • This handler was used as custom validation handler plug-in for the validation API. The plug-in is then used during execution of XFDU API for validation of an XFDU package. • The validation handler used Sun’s Multi Schema Validator in combination with SAX Parser to perform actual validation. • For the purposes of the test intentional errors were introduced in the sample XML file. • During the validation of each data object found in the package, the validation handler would traverse to its representation metadata via the value of repID attribute and retrieve the corresponding schema. • Then, the validator would validate the content of the data object against the schema. • Validation errors were collected during the execution. For example, the test XML was modified to include a string value in an element that requires an integer value thereby triggering a validation error.

  12. Validation Functionality Test Observations and Conclusions • To measure performance overhead introduced by validation, the test was run 10 times. • Validation of 296 identical data objects of 1.8 KB each against one XML schema resulted in average overhead of 1.8 seconds. This is an average of 5 milliseconds/object. • The errors introduced in the manifest were reported as: “ERROR: com.sun.msv.verifier. Validity Violation: "error value" does not satisfy the "integer" type “. • The validation handler used Sun’s multi schema validator in combination with a SAX Parser to perform actual validation of XML-formatted data objects via associated representation metadata in the form of their XML Schema (or DTD) doesn’t introduce any significant overhead of XFDU processing. • The overhead would vary depending on parameters such as the size of data objects, number of schema’s, size of each schema and number of errors in each data object. • The XFDU library validation API can be used to plug in such validation tools. • This technique would enable projects to validate more specialized XFDU instances without creating complex and potentially conflicting specification of the underlying XFDU schema.

  13. XFDU Validation Functionality-3 Scalability Testing • The mechanism of validating individual data objects provides significant added value to the XFDU Toolkit Library and may provide a practical solution to the unresolved issue of extending the XFDU XML Schema by third parties. To validate the utility of this technique, the scalability to a realistic maximum of the various factors that effect performance is needed. • Validation via XSD with 5 schemas and 5 test XML files spread over the sample • We also created a scalability test for Schematron using the size of the manifest as the independent variable.

  14. XFDU Validation Functionality-4

  15. XFDU Validation Massive Scalability

  16. XFDU Validation Functionality Issues and Conclusions • Schematron mechanism does not scale very well. • Given that Schematron uses XSLT and XPATH, the problem could be similar to the JXPATH issue we encountered in the XFDU creation tests. • However, we didn’t implement Schematron so it is doubtful we can optimize it. • We will need to include manifest size limits for the use of Schematron to validate XFDU rules in XFDU Best Practices • The massive scalability tests was amazingly consistant and showed scalability to higher than expected levels

  17. XFDU Scalability Testing

  18. XFDU Scalability Analysis • The first observation was that the XFDU creation time was linear up to 110000 objects. However the creation time for an additional 50000 appeared to become exponential. An analysis of the timing of phases of the creation process indicated that the compute intensive activity of building the manifest was the major problem with the mixed compute/IO ZIP operation also taking a longer time. • Resolution: The first step in the resolution process was to create an additional test case between • 110000 and 157000 objects to better understand the problem. That test case appears as the 141470 object column. We increased the JAVA VM memory allocation from one gigabyte to two gigabytes and the creation time appeared to be linear. The follow graphs illustrate these results. In both graphs, the y- axis is number of objects and the x-axis is time in seconds

  19. XFDU Scalability Analysis

  20. XFDU Scalability Results • The results of this series of performance and scalability testing were very encouraging but also pointed out some concerns that may need actions in future research plans. The primary conclusion is that the XFDU approach and toolkit implementation scale well for volumes expected in the ERA and in NASA data products in the next decade. • However the testing experience did give rise to the following issues: • The bug in Java’s ZIP library was initially reported in 2002 and does not have a high enough priority to have been fixed. This is of concern because it may show a tendency to “upgrade” operating systems to 64 bit processors without upgrading all the utilities • The running of each large test took considerable time and tests results varied due to server load conditions we could not control. We need user feedback on whether 2-2.5 hours is an acceptable time to gather 100,000 files and create the XFDU. • We have not done any testing in the federated GRID environment that is predicted for the ERA timeframe. This is a necessity and must be addressed prior to reaching any firm conclusions

  21. Current Activities and Results • The current round of XFDU-PAWN integration testing shows that while the integration of two independently developed JAVA application may appear to be successful on a centralized system • However,any level of distribution introduces a new weakest link into the system, the network. • There is an urgent need to validate the utility of the XFDU concepts and test the performance of XFDU Toolkit library in a GRID environment with appropriate network speeds, multiple simultaneous users and access to large datasets composed of NARA ERA data types. Further XFDU/Pawn testing requires such an environment.

  22. Current Issue • There appears to be a real requirements drift as users become accustomed to the massive storage available • In our JPL/NSSDC testbed the largest single volume needed to be archived has gone from 50 gbytes to 1-3 terabytes in less than a year

More Related