Large imports in caArray

Large imports in caArray Design Discussion

Background • How the current release (2.4.1) works • User uploads MAGE-TAB files and data files. • Each file is stored as a multipart blob in the database. • User selects a set of MAGE-TAB files and data files and validates and/or imports. • System validates, and creates the sample-data relationships in the database. • If a data file is “parseable” by caArray, the individual values are also parsed out and stored in the database.

Problems with Current Solution • Database size: • Storing the native files in the database causes the database size to grow rapidly, leading to a risk of performance problems down the road. • Limit on importable file set size: • There is a MySQL 4GB limit on a single import transaction, forcing the user to manually break a file set down into smaller chunks, each with its own MAGE-TAB file, before importing.

Solution 1: Store native files on file system • Store native files on the file system • This alleviates the rapid-database-growth problem. • It also alleviates the MySQL limit problem to some extent because now, only the parsed values go in the database and not the native file itself. • Status = Already implemented on the trunk in recent 2.5.0 milestone tags. • Does not completely solve the problem of user having to manually chunk the file set.

Solution 2: Break import into multiple smaller transactions • [Note: Switching from MySQL to Postgres was considered as an alternative, but we would still have had the problem of very very long-running transactions leading to potential lock wait timeouts on sensitive csm tables.] • This is a solution that makes the splitting transparent to the user. • It will replace the Perl script that we wrote for the curators to help them split the MAGE-TAB files and create multiple smaller file sets, each with its own MAGE-TAB files and data files.

Proposed Workflow: Splitting • User selects the full MAGE-TAB set and associated data files and chooses to Import. • System adds the import job to the Job Queue. • System first validates the file set and informs the user if there are errors. • User has the opportunity to correct files and re-upload and click on Import again. • Once validation succeeds, the System splits the MAGE-TAB set into multiple sets, each referencing a subset of the data files. • IDF will be cloned; SDRF will be split, likely 1 row at a time. • Each child IDF, SDRF and the data files referenced by that SDRF will be persisted as a MAGE-TAB Transaction Set. • Assumption: a single row will not have data files totalling to more than 4GB. This agrees with the file sizes that need to be spported by the user community.

Proposed Workflow: Import • Each MAGE-TAB Transaction Set will be imported • Each import will be an atomic transaction. • If an SDRF refers to a data file that has already been imported in a previous import, the System will know to use the imported file. • Transparent to the user – Job Queue will show only the parent import job. • This ensures that the Job Queue functionality (job statuses, Cancel Job) works just as it does today.

Proposed Workflow: Manage Data UI • The smaller split SDRFs and IDF clones are invisible to the user. • When a particular MAGE-TAB Transaction Set is being imported, the status of the involved files will change to “Importing”, “Imported”, etc. • The remaining files will stay in the “Uploaded” or “Validated” state.

Proposed Workflow: Failed Transactions • If some of the MAGE-TAB Transaction Set imports fail -> • Files will have the status of Import Failed. • Parent SDRF may have the status “Partially Imported” (if some sets imported successfully and others failed). • If partial failure, then user can delete the bad files, upload the corrected files with a new IDF-SDRF and re-import that new set. • Revisit this Assumption: Once any subset has started importing, the parent import job cannot be cancelled by the user.

Large imports in caArray

Large imports in caArray

Presentation Transcript

Wood Product Imports

Imports from china

WORLDSHOP IMPORTS

WORLDSHOP IMPORTS

US Imports

GoAir Imports

Imports and Exports

GoAir Imports

Support for MAGE-TAB in caArray 2.0

Role of Imports in Domestic Demand

Natural Gas In California LNG Imports

Tracking Electricity Imports

caArray Overview

Imports, MIREOT, OntoFox

Up Your Imports

Kole Imports

Pagnian Imports

jm-imports-store.co.uk

RISK IN IMPORTS

Support for MAGE-TAB in caArray 2.0

jm-imports-store.co.uk

African Imports USA