90 likes | 170 Views
Andrew Dalke Dalke Scientific Software, LLC. O|B|F Flatfile Indexing. One of the Biohackathon projects.
E N D
Andrew Dalke Dalke Scientific Software, LLC O|B|F Flatfile Indexing One of the Biohackathon projects
Sally, a bioinformatics researcher, needs fast access to many different records from GenBank. She is in a small group with little experience in database management systems so wants a simple system that doesn't involve a client/server model. She also wants the different tools she has (written for the different Bio* projects) to be able to access the system, so she doesn't need to continuously extract data with one tool for use by another. Use case
Have a set of large data files Each contains many records Records have identifiers id, accession, gid, entry name, etc. Want to retrieve a record given an identifier Don't want to set up a database server Make an indexer Background
Nothing new here "Everyone" has written one Spec out a standard and use it Indexer
"Schema" Primary identifier (filename, start byte, length) (Actually, normalized to fileid) * * Secondary identifier Secondary identifier .... .... * * Secondary identifier Secondary identifier
Index as flat-file key_ID.key config.dat P12345 \t 1 \t 10000 \t 100 index \t flat/1 fileid_1 \t /path/to/here fileid_2 \t /path/to/there .... id_ACC.index .... The .key and .index files are fixed width and sorted. Allows fast binary searches. GI22222 \t P00012 GI22222 \t P12345 GI22223 \t P86753 ....
Use BDB tables for the key/value information Faster More scalable Easier to edit, modify More space efficient But it has an external dependency Index in BerkeleyDB Client code can determine the format automatically
Biopython - Andrew Dalke Bioperl - Michele Clamp & Lincoln Stein BioJava - Matthew Pocock BioRuby - Toshiaki Katayama (starting) BioC - Steve Searle And they really do interoperate! Bio* support
Still tweaking the spec How to handle format non-ASCII filenames / internationalization Need a cross-platform regression test suite TODO