120 likes | 215 Views
PepcDB Reporting at CESG: More Trials and Fewer Tribulations. PPCW Bottlenecks Meeting 20 March 2007 Craig A. Bingman cbingman@biochem.wisc.edu (U54 GM074901-01 P50 GM064598 JLM, P.I.). CESG Bioinformatics. George N. Phillips Jr.: Faculty Executive Craig Bingman: Section leader
E N D
PepcDB Reporting at CESG:More Trials and Fewer Tribulations PPCW Bottlenecks Meeting 20 March 2007 Craig A. Bingman cbingman@biochem.wisc.edu (U54 GM074901-01 P50 GM064598 JLM, P.I.)
CESG Bioinformatics • George N. Phillips Jr.: Faculty Executive • Craig Bingman: Section leader • Xiaokang Pan: PepcDB, domains • Gary Wesenberg: Scoring, RT, PDB • Bryan Ramirez: System administrator • Tony Kamenick: Assistant sysadmin Sesame • John L. Markley: CESG P.I. • Zsolt Zolnai: Sesame Project Managment • John Primm: Project Manager • David Aceti: QA, Sesame “Lab Master” All CESG Team Members
TargetDB vs. PepcDB • TargetDB was conceived early/pre-PSI-1 as a mechanism for avoiding duplication of effort between structural genomics centers. • Asynchronous communication between centers and NIH. • TargetDB communicates project status of target only. • TargetDB is single-threaded. • TargetDB was not meant to communicate information to the outside scientific community. • PepcDB was conceived as a mechanism for communication of scientific details between centers and the outside world. • Asynchronous communication with the outside world. • PepcDB communicates target status, protocols and timeline of efforts. • PepcDB is multi-threaded. • PepcDB is a contractural obligation for all PSI-2 centers. • Along with structures deposited in PDB, and the materials repository, PepcDB will be one of the enduring legacies of PSI.
CESG PepcDB, Past and Present Year 2-3 data • Successful implementation of Sesame (hierarchical relationships between db items.) • TargetDB-centric, single-threaded view • Targets were constrained to exist in one workgroup from selection to structure solution. • Protocols were primitive. • Year 4-5 data • Protocols became more descriptive. • Protocols described multiple pipeline stages. • Targets moved through multiple workgroups • Pipeline was assumed to move unidirectionally from Selection->Deposition • PSI-2 data • Atomic protocols describing single pipeline stage. • Pipeline is multipass, multithreaded, characterized by extensive salvage. • Targets move back to vector selection, from initial selection, PCR and entry vector • Pipeline is non-deterministic, adaptive, dynamic to maximize success.
Failure of CESG PepcDB, Mark 1 Codebase had grown by accretion, not design. Code assumed linear, forward progression through pipeline stages. More than half of the code was devoted to data entry error trapping/handling. Global reset was required to handle new pipeline practices, dominated by multipath cloning strategy, multipath expression strategy, salvage intensive operation. New conceptualization of our PepcDB reporting was required. Core concept: Well-formed PepcDB = finite, directed, acyclic graph. Database items = nodes Directed links = edges Data in Sesame needed to be corrected.
Visualization Tool for Graphs dot, a language for describing graphs dot has a very simple syntax digraph G { A -> B -> D; A -> C; } dot has powerful layout minimizers to display hierarchical graphs Implementations are available for perl, python, java, others CESG has used the perl variant of dot/Graphviz to produce plots of linkages between database items.
Digraph G { A -> B; A -> C -> D; } Digraph G { A -> B -> D; A -> C -> D; } Digraph G { A -> B -> D; D -> A; }
CESG PepcDB Stats • Protocols 68 • Targets 7553 • Trials 14044 • Protocol Instances 57195 • Each target has on average two trials • Each trial has on average about four protocols
PSI PepcDB Toolkit • Project database capable of establishing hierarchical relationships between units of work. • Establish master database that manages unique keys for work units. • Implement barcodes (e.g. ZPL) that extend database to physical items. • Implement atomic protocols and associated actions. • Develop tool set for visualizing data. • Develop code capable of assembling lists of parent-child units of work, protocols, actions. • Rehearse data entry prior to pipeline implementation of new techniques. • Reach project-wide agreement on definition of actions and how to link units of work.
Future • Push towards zero errors in PSI-2 PepcDB. • Continue correcting PSI-1 data. • Implement data visualization tools in Sesame. • Expand the scope of data reported to PepcDB. • Report all crystallization trials (year 5-> now) • Consolidate and report data for new tags (elemental analysis, mass, etc.) • Switch over to Sesame for PepcDB report generation.