1 / 9

Large imports in caArray

Large imports in caArray. Design Discussion. Background. How the current release (2.4.1) works User uploads MAGE-TAB files and data files. Each file is stored as a multipart blob in the database. User selects a set of MAGE-TAB files and data files and validates and/or imports.

enid
Download Presentation

Large imports in caArray

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large imports in caArray Design Discussion

  2. Background • How the current release (2.4.1) works • User uploads MAGE-TAB files and data files. • Each file is stored as a multipart blob in the database. • User selects a set of MAGE-TAB files and data files and validates and/or imports. • System validates, and creates the sample-data relationships in the database. • If a data file is “parseable” by caArray, the individual values are also parsed out and stored in the database.

  3. Problems with Current Solution • Database size: • Storing the native files in the database causes the database size to grow rapidly, leading to a risk of performance problems down the road. • Limit on importable file set size: • There is a MySQL 4GB limit on a single import transaction, forcing the user to manually break a file set down into smaller chunks, each with its own MAGE-TAB file, before importing.

  4. Solution 1: Store native files on file system • Store native files on the file system • This alleviates the rapid-database-growth problem. • It also alleviates the MySQL limit problem to some extent because now, only the parsed values go in the database and not the native file itself. • Status = Already implemented on the trunk in recent 2.5.0 milestone tags. • Does not completely solve the problem of user having to manually chunk the file set.

  5. Solution 2: Break import into multiple smaller transactions • [Note: Switching from MySQL to Postgres was considered as an alternative, but we would still have had the problem of very very long-running transactions leading to potential lock wait timeouts on sensitive csm tables.] • This is a solution that makes the splitting transparent to the user. • It will replace the Perl script that we wrote for the curators to help them split the MAGE-TAB files and create multiple smaller file sets, each with its own MAGE-TAB files and data files.

  6. Proposed Workflow: Splitting • User selects the full MAGE-TAB set and associated data files and chooses to Import. • System adds the import job to the Job Queue. • System first validates the file set and informs the user if there are errors. • User has the opportunity to correct files and re-upload and click on Import again. • Once validation succeeds, the System splits the MAGE-TAB set into multiple sets, each referencing a subset of the data files. • IDF will be cloned; SDRF will be split, likely 1 row at a time. • Each child IDF, SDRF and the data files referenced by that SDRF will be persisted as a MAGE-TAB Transaction Set. • Assumption: a single row will not have data files totalling to more than 4GB. This agrees with the file sizes that need to be spported by the user community.

  7. Proposed Workflow: Import • Each MAGE-TAB Transaction Set will be imported • Each import will be an atomic transaction. • If an SDRF refers to a data file that has already been imported in a previous import, the System will know to use the imported file. • Transparent to the user – Job Queue will show only the parent import job. • This ensures that the Job Queue functionality (job statuses, Cancel Job) works just as it does today.

  8. Proposed Workflow: Manage Data UI • The smaller split SDRFs and IDF clones are invisible to the user. • When a particular MAGE-TAB Transaction Set is being imported, the status of the involved files will change to “Importing”, “Imported”, etc. • The remaining files will stay in the “Uploaded” or “Validated” state.

  9. Proposed Workflow: Failed Transactions • If some of the MAGE-TAB Transaction Set imports fail -> • Files will have the status of Import Failed. • Parent SDRF may have the status “Partially Imported” (if some sets imported successfully and others failed). • If partial failure, then user can delete the bad files, upload the corrected files with a new IDF-SDRF and re-import that new set. • Revisit this Assumption: Once any subset has started importing, the parent import job cannot be cancelled by the user.

More Related