70 likes | 174 Views
Sequence Search. The Problem as defined in Windows 3.1 days. Search for a sequence in a database several megabytes in size, on a machine with 640 KB memory machine as quickly as possible and return all matching sequences If the sequence does not exist then add it to the data file
E N D
The Problem as defined in Windows 3.1 days • Search for a sequence in a database several megabytes in size, on a machine with 640 KB memory machine as quickly as possible and return all matching sequences • If the sequence does not exist then add it to the data file • Each sequence will be given a unique identifier • A sequence may be a subset of another sequence
Givens • Unlimited disk space • Sequences made up of amino acids from a growing set • Each amino acid in the database given an entry number • Sequences are made up of at least 4 amino acids and maybe be of any length upwards
Limitations • Max network speed 2Mb/sec, Lantastic • No SQL databases, only Paradox available
Current Situation • Amino acid table consists of approximately 700 entries • Over 38000 unique sequences • Sequences occupy over 11MB of Paradox table • 2 auxiliary tables 17MB in total • Negative result returned almost instantly
Implementation • Each amino acid is represented by a letter or its entry number in the AA table. Eg ABCFTR ABS(123)DFR • Sequence as entered is converted to a hex-triple representation. Eg 00100200301F • Hex-triple chosen as it only requires 3 characters to represent up to 4095 distinct amino acids. Hence making for shorter sequence representations
Do we need to update the system? • Yes, we want to be rid of Paradox