520 likes | 540 Views
Project Prospect and the Semantic Web. Colin Batchelor Royal Society of Chemistry, Cambridge, UK batchelorc@rsc.org. Project Prospect and the Semantic Web. Who we are What we’ve done Motivation Means The InChI and the Semantic Web Ontology development for chemistry
E N D
Project Prospectand the Semantic Web Colin Batchelor Royal Society of Chemistry, Cambridge, UK batchelorc@rsc.org
Project Prospectand the Semantic Web Who we are What we’ve done Motivation Means The InChI and the Semantic Web Ontology development for chemistry RXNO and MOP
Royal Society of ChemistryAdvancing the Chemical Sciences • Learned and professional society • Scientific publisher • 25 journals, 8 databases and a growing book program • 8000 articles yearly • Covering a broad spectrum of chemical sciences from systems biology (Molecular BioSystems) to physical and theoretical chemistry (PCCP)
The motivation • Scientific papers are formulaic and consistently structured (but not necessarily IMRD: see later) • There may be infinitely many possible chemical compounds BUT • Nomenclature is productive and susceptible to machine parsing
Data capture Editing and proof-reading
Enhanced HTML Database Text mining (Oscar) Manual QA Enhanced RSS
Regular polysemy … where words stand for multiple things in a consistent way. Examples: • Brand names • Grinding • Figure–ground • Exact–class–part polysemy in chemistry Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.
Regular polysemy Brand names “Learning to buy a Renault and talk to BMW” Grinding “The squirrel scampered down the path and kept stopping and looking at the officers to check they were behind” vs. “[…] the trick was to serve squirrel fresh and not to leave it hanging like other game”
Regular polysemy Figure–ground • Audrey Hepburn painted the door (figure) • Audrey Hepburn walked through the door (ground) • The Incredible Hulk walked through the door (ambiguous)
Can ChEBI handle this? • Imidazoles (!) (CHEBI:24780) • Imidazole (CHEBI:16069) • Imidazole ring not yet • Imidazolyl group not yet (but methyl, benzyl, etc.) … and there are no disambiguation cues
Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesn’t hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions
Disambiguation: toy model CLASS: • w(–1) = a, an, the, this • w(0) plural (bit of a cheat, as not a collocation) PART: • w(–1) = bridging, terminal • w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) • w(+1)w(+2) = “building block”, “protecting group”, “side chain”
Why is this hard? Coordination resolution Part of speech ambiguity: tosylates; noun or verb?
How many numbered compounds actually are named in a given paper? iloprost (1) tributyl-1-hexynylstannane (2) the desired 2-heptyne (3) methyl–Pd(II) iodide 4 or 4′ alkynylstannane 5 the hypervalent stannate 6 (alkynyl)(methyl)Pd(II) complex 7 the desired methylalkyne 8 compounds 9–14 the stannyl precursors 15 and 16 methylated compounds 17 and 18 stannyl precursor 19 iloprost methyl ester 20 “iloprost methyl ester” is the real name, but you need to know that iloprost is a monocarboxylic acid! Why is this hard?
Why is this hard? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors
What are we marking up? • Chemical compounds (InChI, ChEBI) • Chemical classes and parts (ChEBI) • Nanoparticles (in ChEBI from end of October) • Chemical terms from the IUPAC Gold Book • Name reactions (RXNO) • Gene products: function, process, location (GO) • Nucleotide and polypeptide sequence terms (SO) • Cell types (CL)
What InChI is for • Can represent complete molecules (may be ions or radicals) of less than 1024 heavy (non-H) atoms. (however) • Cannot yet represent metal atom geometry. • Cannot yet represent polymers. • Cannot yet represent diradicals etc.
What InChI is not for • Classes of molecule • Parts of molecule (these have been done in ChemBlast)
InChI in RDF (We don’t like this.) We use the RSS content module. (As if articles contained molecules.) And we use info:inchi URIs. Look…
Some RDF <content:items> <rdf:Bag> <rdf:li> <content:item rdf:about="info:inchi/InChI=1/C15H22O9/c1-8(16)19-6-15(7-20-9(2)17)12(21-10(3)18)11-13(24-15)23-14(4,5)22-11/h11-13H,6-7H2,1-5H3/t11?,12-,13+/m1/s1"/> </rdf:li> <rdf:li> <content:item rdf:about="info:inchi/InChI=1/C21H34O9/c1-6-9-14(22)25-12-21(13-26-15(23)10-7-2)18(27-16(24)11-8-3)17-19(30-21)29-20(4,5)28-17/h17-19H,6-13H2,1-5H3/t17?,18-,19+/m1/s1"/> </rdf:li> </rdf:Bag> </content:items> <content:items> <content:item> <owl:Class rdf:ID="GO_0016298"> <rdfs:label>lipase activity</rdfs:label> </owl:Class></content:item> </content:items>
RXNO David Barden Colin Batchelor Celia Gitterman
RXNOthe name reaction ontology (1) • Every chemist knows about famous chemists like Wittig, Cannizzaro, Diels, Alder, benzoin • They’re pretty unambiguous and well-suited to logical definitions • But what organizing principle do we use?
RXNOthe name reaction ontology (2) • Sort reactions by what they do to the ‘skeleton’ of the molecule. • Skeleton-changing reactions: • Joinings, cleavings, rearrangements, ring formation, ring expansion • Skeleton-preserving reactions: • Additions, eliminations, substitutions, protections, deprotections
RXNOthe name reaction ontology (3) • Quality? Subjectivity? • Get our curators to assign reactions to categories without conferring, check percentage agreement, discuss disagreements, improve guidelines, iterate to convergence.
The spectroscopist’s tale The enriched html version came as something of a revelation and the current emphasis on links to, and through biomolecular terminology was very much a plus for us, since my colleagues and I are a mix of physical and biological chemists who are dabbling in inter-disciplinary waters. Given the steadily increasing burden of keeping up with the current literature and accessing earlier publications - a fortiori when conventional disciplinary boundaries are being crossed - the ability to 'grow a tree' from current articles (including one's own) is going to make 'targeted sleuthing' a great deal easier. John Simons, Oxford
The high-throughput screener’s tale An interesting opportunity particularly for managers, students and beginners that are not that deeply immersed in the detail and the terminology. It further opens access to those who want to explore areas they are not specialists in. Great idea! Eberhard Krausz, MPI-CBG Dresden
Lastly… “My only criticism would be the need for a time warning… I spent 4 hours digging about which generated at least six new research ideas printed half a ream of paper and I missed my bus home. At least it was a new excuse my wife had not heard, so another first.” An analytical chemist, The North.