110 likes | 244 Views
An Efficient and Transparent Transaction Management based on the Data Workflow of HVEM DataGrid. Im Young Jung Seoul National University. Introduction. Transaction Management for a safe data update and insertion on e-Science DataGrid
E N D
An Efficient and Transparent Transaction Management based on the Data Workflow of HVEM DataGrid Im Young Jung Seoul National University
Introduction • Transaction Management for a safe data update and insertion on e-Science DataGrid • Heterogeneous storages according to the characteristics and the size of data • Based on workflow, the storing precedence of data across heterogeneous storages in a transaction • In this paper • An efficient and transparent transaction management on HVEM DataGrid • Dividing the transaction into sub-transactions according to the transaction states and Classifying them • Transaction hierarchy and parallelism provide • efficient and safe large data upload to HVEM DataGrid • transparency in the transaction including simultaneous access to heterogeneous storages • Automatic garbage collection
HVEM Grid • High Voltage Electron Microscope(HVEM) • Let scientists realize the 3D structure analysis of new materials in micrometer-scale • HVEM Grid • Remote users can perform the same tasks as on-site scientists. • Remote controlling of HVEM • Storing, retrieval and search data through HVEM DataGrid • Processing data through HVEM Computational Grid
HVEM DataGrid • Designed for Biologic experiments using HVEM • A logical view of one storage for DB and file storage • The small metadata is stored at DB • Information for materials, material handling methods, HVEM experiments, Images, experimenters • The large files are stored in file storages • 2D or 3D image files, the documents related to HVEM experiments • Internal process to find files • After finding their logical path in the file storage by searching the DB, users can retrieve the files they want in the file storage
HVEM DataGrid • A unified data management • The storing precedence among data • When store all biological information for the images, we should keep the images in HVEM Grid at the same time • The relational semantics between various data stored in distributed heterogeneous storages • To upload many large files to HVEM DataGrid efficiently and safely • Upload dependency & Serialization • Ensure the transactions for safe parallel uploads
An efficient and transparent transaction management • Requirement for the transactions on HVEM DataGrid • Consider the semantic of HVEM DataGrid • A project is composed of several experiments • The data for an experiment should be inserted according to its data workflow • The file and its metadata should be stored to HVEM DataGrid simultaneously. Otherwise, all of them should be deleted • Support • the long lifetime transaction according to the timelimit of experiment or project • the short lifetime transaction which stores the data to HVEM DataGrid physically • The optimization for the upload of large files to reduce the blocking time should ensure safe transactions • An asynchronous and parallel upload scheme should protect upload dependency and ensure safe transactions
An efficient and transparent transaction management • Transaction hierarchy • The transaction units as checkpoints on incomplete data insertion • Confine the rollback extent • When the data for an experiment or a project is not inserted to HVEM DataGrid until each timelimit, the experiment or the project should be vanished by the rollback of TnE or TnP • TnS((((1)2)5)2) • (1) represents the identity of TnP it belongs to • The next index ‘2’ indicates the identity of TnE and so on For Project For Experiment For a group of TnSs Parallel Processing For storing data to physical storage • Support Autonomous garbage collection • It is dependent on users to insert data or delete it on HVEM DataGrid. • When they do not insert experimental data any more due to any reason without deleting the related data, HVEM DataGrid would have a big garbage.
Transaction management Scheme • HVEM DataGrid forks two processes to connect DB and file storage each. • When the connections succeed, it gets the next requests and so on. • The state change of TnS(((())j)i) • jSiS jSiD(the notification from DB), jSiF(the notification from the file storage) jSiE (both of them arrive) : TnS completes • In the light failure(LF) due to temporary failures on network or server, retry the transaction fixed times • When the retries fail, a serious failure(SF) is assumed rollback process
Evaluation • Analysis • Transparency • Through transaction hierarchy and fine grained state management • the transaction manager in HVEM DataGrid enables the transparent transaction to upload the image files to the file storage and store their metadata to DB simultaneously. • Serializability • Many TnSs are upload serializable because their state changes are logged through transaction index. • To keep the upload dependency, • the transaction manager protects the first user entering TnW. • If he withdraws the TnW, then an other user can initiate the TnW • Transaction performance • Support the transaction scheme asynchronism and parallelism • Experiment Setting • Because the sub-transaction time on DB is negligible compared with that on file storage due to data size, we only considered the upload time for image file • Considering the semantic of the data workflow in HVEM DataGrid • For an asynchronous file transfer, the request intervals for file transfer are chosen randomly within 50 sec • The physical locations of the file storages are assumed to be distributed
Evaluation • Overhead • Log management cost • The cost for TnP, TnE and TnW; The general transaction management requires the log for TnS • The log size for TnP, TnE and TnW is smaller than that for TnS because they function as checkpoint rather than real transaction units. • Rollback cost • The cascade rollback of TnS in TnW due to the upload dependency on parallel processing of TnS • At LF, if the retry succeeds, the gain from transaction parallelism can be very large especially for large file handling • There are not many SFs or LFs because e-Science DataGrid is not popular as the multimedia storage
Conclusion • A transaction management on HVEM Grid • Safety • Ensure a safe transaction considering the data workflow in HVEM DataGrid • Efficiency • Improve the performance to upload large files by asynchronism and parallelism • Transparency • Data management across the heterogeneous storages • Automatic garbage collection • Reduce garbage