90 likes | 209 Views
Large imports in caArray. Design Discussion. Background. How the current release (2.4.1) works User uploads MAGE-TAB files and data files. Each file is stored as a multipart blob in the database. User selects a set of MAGE-TAB files and data files and validates and/or imports.
E N D
Large imports in caArray Design Discussion
Background • How the current release (2.4.1) works • User uploads MAGE-TAB files and data files. • Each file is stored as a multipart blob in the database. • User selects a set of MAGE-TAB files and data files and validates and/or imports. • System validates, and creates the sample-data relationships in the database. • If a data file is “parseable” by caArray, the individual values are also parsed out and stored in the database.
Problems with Current Solution • Database size: • Storing the native files in the database causes the database size to grow rapidly, leading to a risk of performance problems down the road. • Limit on importable file set size: • There is a MySQL 4GB limit on a single import transaction, forcing the user to manually break a file set down into smaller chunks, each with its own MAGE-TAB file, before importing.
Solution 1: Store native files on file system • Store native files on the file system • This alleviates the rapid-database-growth problem. • It also alleviates the MySQL limit problem to some extent because now, only the parsed values go in the database and not the native file itself. • Status = Already implemented on the trunk in recent 2.5.0 milestone tags. • Does not completely solve the problem of user having to manually chunk the file set.
Solution 2: Break import into multiple smaller transactions • [Note: Switching from MySQL to Postgres was considered as an alternative, but we would still have had the problem of very very long-running transactions leading to potential lock wait timeouts on sensitive csm tables.] • This is a solution that makes the splitting transparent to the user. • It will replace the Perl script that we wrote for the curators to help them split the MAGE-TAB files and create multiple smaller file sets, each with its own MAGE-TAB files and data files.
Proposed Workflow: Splitting • User selects the full MAGE-TAB set and associated data files and chooses to Import. • System adds the import job to the Job Queue. • System first validates the file set and informs the user if there are errors. • User has the opportunity to correct files and re-upload and click on Import again. • Once validation succeeds, the System splits the MAGE-TAB set into multiple sets, each referencing a subset of the data files. • IDF will be cloned; SDRF will be split, likely 1 row at a time. • Each child IDF, SDRF and the data files referenced by that SDRF will be persisted as a MAGE-TAB Transaction Set. • Assumption: a single row will not have data files totalling to more than 4GB. This agrees with the file sizes that need to be spported by the user community.
Proposed Workflow: Import • Each MAGE-TAB Transaction Set will be imported • Each import will be an atomic transaction. • If an SDRF refers to a data file that has already been imported in a previous import, the System will know to use the imported file. • Transparent to the user – Job Queue will show only the parent import job. • This ensures that the Job Queue functionality (job statuses, Cancel Job) works just as it does today.
Proposed Workflow: Manage Data UI • The smaller split SDRFs and IDF clones are invisible to the user. • When a particular MAGE-TAB Transaction Set is being imported, the status of the involved files will change to “Importing”, “Imported”, etc. • The remaining files will stay in the “Uploaded” or “Validated” state.
Proposed Workflow: Failed Transactions • If some of the MAGE-TAB Transaction Set imports fail -> • Files will have the status of Import Failed. • Parent SDRF may have the status “Partially Imported” (if some sets imported successfully and others failed). • If partial failure, then user can delete the bad files, upload the corrected files with a new IDF-SDRF and re-import that new set. • Revisit this Assumption: Once any subset has started importing, the parent import job cannot be cancelled by the user.