320 likes | 457 Views
Curation of Chemistry Data from the Laboratory to Publication. Jeremy Frey & Simon Coles School of Chemistry University of Southampton. The Comb e Chem Project. End to End linking of data and information Laboratory to publication and back again
E N D
Curation of Chemistry Data from the Laboratory to Publication Jeremy Frey & Simon Coles School of Chemistry University of Southampton
The CombeChem Project • End to End linking of data and information • Laboratory to publication and back again • Very long data chains can be involved e.g. from a chemistry lab to mouse genetic expression • The exponential world of combinatorial synthesis and high throughput analysis meets the exponentially growing power of computing • “Automation, Semantics & the Grid” Data Curation Workshop
Smart Laboratory Smart HCI Goal Knowledge not just one laboratory but many co-laboratoriesworking together Literature Report Plan & COSHH Information Integration Digital Model Analysis Synthesis Smart Storage Smart Dissemination Data Curation Workshop
Problems with ‘Small Laboratory’ Working Practice “Data from experiments conducted as recently as six months ago might be suddenly deemed important, but those researchers may never find those numbers – or if they did might not know what those numbers meant” “Lost in some research assistant’s computer, the data are often irretrievable or an undecipherable string of digits” “To vet experiments, correct errors, or find new breakthroughs, scientists desperately need better ways to store and retrieve research data” “Data from Big Science is … easier to handle, understand and archive. Small Science is horribly heterogeneous and far more vast. In time Small Science will generate 2-3 times more data than Big Science.” ‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher Education (23/06/2006) Data Curation Workshop
The concept of Publication@Source • Trace all the way back from publication to the original data – provenance • The data is the key - DataGrid • Start as you mean to go on – ELNs are a necessity • Curation of subsequently produced data Data Curation Workshop
Observationsarenever collected on note pads, filter paper or other temporary paper for later transfer into a notebook If you are caught using the “scrap of paper” technique, your improperly recorded data may be confiscated by your TA Data Curation Workshop
Lab books are a big block to publication@source: if it’s not digital, it is more difficult to share Only some equipment is networked This is where it all starts: The Lab & The Lab Book Need a usable digital lab book. Design by analogy to help Chemists and Computer Scientists work together. Data Curation Workshop
COSHHleverage off things we already have to do Data Curation Workshop
PLAN Process Record Data Curation Workshop
getRecord() There is a potential containment problem in pulling back partial RDF graphs from the triple store. Solved by using multiple triple stores but boundaries are a major issue for the future. Data Curation Workshop
SURIG SURIG SURIG Data stores Architecture “Client” Libraries SOAP Planner0 Semantic Data PHP Jena Viewer0 Institutional archives and metadata publication Bench Applications Weights & Measures Java SURIG Other services Data Curation Workshop
The Analytical Laboratory • Capture information from places you would not want to put your eyes • Capture environmental data automatically • Capture people and movements • Provide this information in real time as well as for the laboratory record Data Curation Workshop
Pub-Sub systems provide the flexible & extensible approach to distribution Data Source Data Source BLOG Message Broker Translator Service Mobile phone Web Client Archive Client PDA Data Curation Workshop
Temperature – room, laser Air Conditioning failed Door & interlock, Motion Sensors Data Curation Workshop
Databases - Our experience • What do you do when the actual users keep changing their mind? • Is a traditional relational database suitable? • Danger of re-enforcing scientific bias against relational database for laboratory data. • RDF & Triple stores were again the solution Data Curation Workshop
RDF/RDFS High level Schema for chemical properties Data Curation Workshop
Triple Stores - The Heart of the Semantic Web Scaling - 3Store response Memory leak in testing program! Data Curation Workshop
The Semantic Web! Scaling the triplestores Moved from… • A model of harvesting data from multiple sources into one scalable store to • A model of distributed RDF sources and caching what is needed for the task at hand into multiple stores fit-for-purpose Data Curation Workshop
Experiments on the Grid: The NCS Service HTTPS Data Curation Workshop
ADS Binary raw data archived in Atlas Datastore x300 £’s Data Curation Workshop
A Data-Rich Subject – the Crystallography Problem 1.5,000,000 30,000,000 450,000 Data Curation Workshop
The eCrystals Digital Repository http://ecrystals.chem.soton.ac.uk Data Curation Workshop
Access to the underlying data Data Curation Workshop
The eCrystals ‘Global’ Model Data analysis, transformation, mining, modelling Presentation services / portals Data discovery, linking, citation Publishers: peer-review journals, conference proceedings, etc Aggregator services Publication Laboratory repository Deposit Validation Institutional data repositories Search, harvest Validation Preservation and curation Deposit Data Curation Workshop
Laboratory Repositories and Information Management Data Curation Workshop
Need for a data archive in the laboratory Not just the published spectra! Data Curation Workshop
The R4L Repository Create new compound Add experiment data and metadata Deposit Search / Browse Data Curation Workshop
Several groups making and analysing; the library Administrative Domains transfer or share the data National Archive Research Group Researcher International Database Research Group Institution Data Curation Workshop
Paper organized using RDF SVG “active” graphics Link to data, follow links back to the raw data archive R4L Link to simulation, full simulation data archived in BioSimGrid Data Curation Workshop
Summary: • Making sure other people can find, understand and re-use your data easily and with confidence (even when there is a huge amount of it!) • Make use of Plans to inform the digital context - metadata in advance • Have concern for the “End-to-End life cycle” of chemistry information from the start. • Understanding Usability and Human Computer Interaction is vital for adoption Data Curation Workshop