120 likes | 207 Views
Or How on Earth do we Store this Stuff?. Smart Storage for Physical Properties. Kieron Taylor with Jeremy Frey and Jonathan Essex. What makes up chemical data?. Numbers - big, small, precise and vague Circumstances - How hot? What pressure? Assumptions
E N D
Or How on Earth do we Store this Stuff? Smart Storage for Physical Properties Kieron Taylor with Jeremy Frey and Jonathan Essex
What makes up chemical data? • Numbers - big, small, precise and vague • Circumstances - How hot? What pressure? • Assumptions • This is pretty pure, let's say it's pure • Standard conditions? More or less • That peak on the spectrum isn't important
Using the Data: QSPR Take lots of data Magical statistics occur Validate results Predictive model
So What is Real Data like? Bad - take the commercial Physprop Database Can we handle these melting points?
Let's Make a Database • One data source is not enough • Good(?) data isn't free • Different sources have varied style of content • Most database software not suited to data mining • We cannot plumb these varied sources for data, we must reconcile them to make sensible statistics
Relational Design For one molecule: Cyclohexanone Property Value Error Units Source Method Author Note Solubility 2500 +/-50 mg/L Physprop Laboratory ... 2650+/-60 mg/L Southampton Simulation Me Superceded 2599+/-25 mg/L Southampton Simulation B Me Melting point -31 +/-0.1 C Detherm Laboratory ... Boiling point 155.4 +/-0.5 C Merck Index Laboratory ... Decomposing Property Value Units Solubility 2500 mg/L Melting point -31 C Boiling point 155.4 C Property Value Error Units Source Solubility 2500 +/-50 mg/L Physprop 2650 +/-60 mg/L Our lab Melting point -31 +/-0.1 C Detherm Boiling point 155.4 +/-0.5 C Merck Index Property Value Error Units Source Method Author Solubility 2500 +/-50 mg/L Physprop Laboratory ... 2650 +/-60 mg/L Southampton Simulation Me Melting point -31 +/-0.1 C Detherm Laboratory ... Boiling point 155.4 +/-0.5 C Merck Index Laboratory ... Arbitrary numbers of points are hard to store in relational databases We're not done yet: We still have to account for multiple experimental conditions, statements of validity and molecules. Provenance = Senary relational model?
RDF Triplestore is the Solution • RDF describes trees and networks of entities • Data of this complexity lends itself well to a tree representation • RDF trees enable additional clever things • Triplestores provide persistent RDF models
What can we do with this? • Store almost any chemical data as normal • Track the where, when and how of each and every data point • Filter values down whether real, simulated, old, new, from a particular source, or done by a particular person. • Bolt on RDF schemas such as FOAF and our units system.
What have we done with this? http://green.chem.soton.ac.uk/triangle/query.html
Thanks to: • AKT and Steve Harris for 3store • Rob Gledhill for web tech and discussion • Perl for s/ / /g