300 likes | 378 Views
Chemical named entity recognition and literature mark-up. Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org. Overview. Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology.
E N D
Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org
Overview Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology.
Project Prospect: What do we find? • Chemical compounds • Chemical terms from the IUPAC Gold Book • Gene products: function, process, location • Nucleotide and polypeptide sequence terms • Cell types
Project Prospect: How do we find it? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors
RDF: how we do it now Content module from RSS 1.0 http://web.resource.org/rss/1.0/modules/content In what sense does an article “contain” pyridine or base pairs? We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.
RDF: what it looks like now <item rdf:about=http://xlink.rsc.org/?DOI=b716356h&RSS=1> <title> [… title] </title> <link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link> <description> [… blah] </description> <content:encoded> [… human-readable stuff</content:encoded> [… dublin core stuff …] <content:items> <rdf:Bag> <rdf:li> <content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1"/> </rdf:li> <rdf:li> <content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/> </rdf:li> </rdf:Bag> </content:items> </item>
Basics for a chemical ontology • Unambiguous representation of objects of chemical discourse • Proper parthood relations
Basics for a chemical ontology:1. Objects of chemical discourse Must be able to represent and clearly distinguish • Compounds • Classes of compound • Parts of molecules • Mixtures Would be nice to have: • Disambiguation cues for the first three
Can ChEBI handle this? • Imidazoles (!) (CHEBI:24780) • Imidazole (CHEBI:16069) • Imidazole ring not yet • Imidazolyl group not yet (but methyl, benzyl, etc.) … and there are no disambiguation cues
Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesn’t hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions
Disambiguation:What a one sense per collocation feature set might look like CLASS: w(–1) = a, an, the, this w(0) plural (bit of a cheat, as not a collocation) PART: w(–1) = bridging, terminal w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w(+1)w(+2) = “building block”, “protecting group”, “side chain”
Basics for a chemical ontology:2. Parthood relations Parthood in ChEBI means at least three things: • is necessarily chemically part of carbonyl group part_of carbonyl compounds
Basics for a chemical ontology:2. Parthood relations • Is possibly chemically part of: Lead(2+) part_of lead diacetate (most lead(2+) isn’t) Electron part_of muonium (!)
Basics for a chemical ontology:2. Parthood relations • Is part of a mixture Kanamycin A part_of kanamycin
Basics for a chemical ontology:2. Parthood relations Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al., “Relations in biomedical ontologies”, 2005) • carbonyl compound has_part carbonyl group • Lead diacetate has_part lead(2+) (?!) • Muonium has_part electron • Kanamycin has_part kanamycin A (?!)
Basics for a chemical ontology:2. Parthood relations Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships • Carbonyl compoundmolecule has_part carbonyl substituent • Muonium atom has_part electron • Kanamycin has_component kanamycin A • Lead diacetate has_component lead(2+) (?!)
Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word-sense disambiguation? What is the best way of distinguishing molecules and samples?
Acknowledgements University of Cambridge: Peter Corbett OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo) www.projectprospect.org
Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word-sense disambiguation? What is the best way of distinguishing molecules and samples?