200 likes | 324 Views
The search for a self-documenting image-file format for macromolecular crystallography Development of imgCIF R.M. Sweet, Brookhaven Biology Herbert Bernstein, Math and Comp. Sci., Dowling College. Motivation:
E N D
The search for a self-documenting image-file format for macromolecular crystallography Development of imgCIF R.M. Sweet, Brookhaven Biology Herbert Bernstein, Math and Comp. Sci., Dowling College
Motivation: • Since the early days of computing, crystallography has been a discipline that has pushed the limits of computation (I remember a Burroughs 220 tube-type in about 1963). • It’s a data-rich science, results are created by computation and stored as numbers. • Early on, crystallographers created the Crystallographic Information Framework, a relational database schema stored as 80-column characters. • It organized the raw data, metadata describing the experiment, and the resulting structure. • Nowadays, the diffraction experiment for small molecules is like a spectroscopy – give the specimen to the machine; hit <Enter>; look at the answer; hit <Enter> again to submit the CIF file to the Cambridge Crystallographic Data Centre.
High-Throughput macromolecular crystallography is approaching this situation. • The Protein Data Bank was created for this discipline in the ’70s to store the data, long before it was HTP. • The PDB started to use a relational database, and a flat-text form of that schema (mmCIF) in the ’90s. • Some of us felt that the raw data should carry meta data in an organized way from the experiment to the deposition of the structure in the PDB. • This would help the experimenter keep records, and it would help the programmer who wanted to use the data. • He or she ought not to have to worry about how or where the original experiments were done – there should be a complete annotation of the work.
A growing need is FedEx data: Data are taken somewhere by one person, used somewhere else by someone else. It MUST be transparent where they came from, what they’re about.
A little history: There has been over a decade of discussion and activity on this question. • March ’95 – The subject of internally documented images was raised in a workshop on GUI’s at BNL. • July ’95 – The SR-SIG endorsed creation of a standard image format with header at the ACA meeting. • Early ’96 – Intense E-mail discussions made progress. • August ’96 – Report from this group at Seattle IUCr mtg CIF wkshp. • October ’97 – Major workshop at BNL. Led to two years of off-line work to establish the imgCIF/CBF standard: Andy Hammersley, Herb Bernstein, Paul Ellis • August ’99 – Reported to IUCr COMCIFS.
July ’00 – Submitted to COMCIFS • December ’00 – Approved by COMCIFS • May ’05 – ACA data committee says, “Let’s get on with it!” • July ’06 – Bernstein and Sweet hold yet another workshop to regain momentum.
The plan for handling image files is that there should be a header of essentially text, then the image, probably as raw binary.
The imgCIF dictionary is an add-on to the mmCIF dictionary. The dictionary is well documented. Here is shown an example from the dictionary of the data-array loop, this time as hexadecimal characters. loop_ _array_data.array_id _array_data.binary_id _array_data.data image_1 1 ; --CIF-BINARY-FORMAT-SECTION– Content-Type: application/octet-stream; conversions="x-CBF_CANONICAL" Content-Transfer-Encoding: X-BASE16 X-Binary-Size: 3927126 X-Binary-ID: 1 Content-MD5: u2sTJEovAHkmkDjPi+gWsg== # Hexadecimal encoding, byte 0, byte order ...21 # H4< 0050B810 00000000 00000000 00000000 000F423F 00000000 00000000 ... .... --CIF-BINARY-FORMAT-SECTION---- ;
And here is shown a clear text description from the dictionary of the way to read/write components of the file. save__array_data.data _item_description.description ; The value of _array_data.data contains the array data encapsulated in a STAR string. The representation used is a variant on the Multipurpose Internet Mail Extensions (MIME) specified in RFC 2045-2049 by N. Freed et al. The boundary delimiter used in writing an imgCIF or CBF is "--CIF-BINARY-FORMAT-SECTION--" (including the required initial "--"). The Content-Type may be any of the discrete types permitted in RFC 2045; "application/octet-stream" is recommended. If an octet stream was compressed, the compression should be specified by the parameter 'conversions="x-CBF_PACKED"' or the parameter 'conversions="x-CBF_CANONICAL"'. . . .
Experimental details are saved, e.g. beam-collimation method save__diffrn_radiation.collimation _item_description.description ; The collimation or focusing applied to the radiation. ; _item.name '_diffrn_radiation.collimation' _item.category_id diffrn_radiation _item.mandatory_code no _item_aliases.alias_name '_diffrn_radiation_collimation' _item_aliases.dictionary cif_core.dic _item_aliases.version 2.0.1 _item_type.code text loop_ _item_examples.case '0.3 mm double-pinhole' '0.5 mm' 'focusing mirrors' save_
Or the divergence of the x-ray beam save__diffrn_radiation.div_y_source _item_description.description ; Beam crossfire in degrees parallel to the laboratory Y axis (see AXIS category). This is a characteristic of the xray beam as it illuminates the sample (or specimen) after all monochromation and collimation. This is the esd of the directions of photons in the Y-Z plane around the mean source beam direction. Note that some synchrotrons specify this value in milliradians, in which case a conversion would be needed. To go from a value in milliradians to a value in degrees, multiply by 0.180 and divide by Pi. ; _item.name '_diffrn_radiation.div_y_source' _item.category_id diffrn_radiation _item.mandatory_code no _item_type.code float _item_units.code degrees _item_default.value 0.0 save_
At the BNL PXRR we have substantial infrastructure assembled to create the information for the header, and then to use it to create a final report.
When data are taken, the data-collection system is hooked to the group and its project. All of this, plus experimental parameters, go into the image headers.
A plan to complete the project: • A stumbling block has been to get all of the data-reduction software writers to accept and use the standard. • The plan is to get sets of a)beamline guy, b)hardware vendor, and c)software person to collaborate to get the system working, one facility at a time. • Begin to develop the habit of carrying metadata with intermediate results. • Ultimately create the PDB report, nearly complete, from parameters that started in the imgCIF.