Chemical named entity recognition and literature mark-up

Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org

Overview Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology.

Project Prospect: What do we find? • Chemical compounds • Chemical terms from the IUPAC Gold Book • Gene products: function, process, location • Nucleotide and polypeptide sequence terms • Cell types

Project Prospect: How do we find it? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors

RDF in an RSS reader

RDF: how we do it now Content module from RSS 1.0 http://web.resource.org/rss/1.0/modules/content In what sense does an article “contain” pyridine or base pairs? We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.

RDF: what it looks like now <item rdf:about=http://xlink.rsc.org/?DOI=b716356h&RSS=1> <title> [… title] </title> <link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link> <description> [… blah] </description> <content:encoded> [… human-readable stuff</content:encoded> [… dublin core stuff …] <content:items> <rdf:Bag> <rdf:li> <content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1"/> </rdf:li> <rdf:li> <content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/> </rdf:li> </rdf:Bag> </content:items> </item>

Basics for a chemical ontology • Unambiguous representation of objects of chemical discourse • Proper parthood relations

Basics for a chemical ontology:1. Objects of chemical discourse Must be able to represent and clearly distinguish • Compounds • Classes of compound • Parts of molecules • Mixtures Would be nice to have: • Disambiguation cues for the first three

Imidazole

An imidazole

The imidazole side-chain/group/ring

Can ChEBI handle this? • Imidazoles (!) (CHEBI:24780) • Imidazole (CHEBI:16069) • Imidazole ring not yet • Imidazolyl group not yet (but methyl, benzyl, etc.) … and there are no disambiguation cues

Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesn’t hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions

Disambiguation:What a one sense per collocation feature set might look like CLASS: w(–1) = a, an, the, this w(0) plural (bit of a cheat, as not a collocation) PART: w(–1) = bridging, terminal w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w(+1)w(+2) = “building block”, “protecting group”, “side chain”

Basics for a chemical ontology:2. Parthood relations Parthood in ChEBI means at least three things: • is necessarily chemically part of carbonyl group part_of carbonyl compounds

Basics for a chemical ontology:2. Parthood relations • Is possibly chemically part of: Lead(2+) part_of lead diacetate (most lead(2+) isn’t) Electron part_of muonium (!)

Basics for a chemical ontology:2. Parthood relations • Is part of a mixture Kanamycin A part_of kanamycin

Basics for a chemical ontology:2. Parthood relations Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al., “Relations in biomedical ontologies”, 2005) • carbonyl compound has_part carbonyl group • Lead diacetate has_part lead(2+) (?!) • Muonium has_part electron • Kanamycin has_part kanamycin A (?!)

Basics for a chemical ontology:2. Parthood relations Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships • Carbonyl compoundmolecule has_part carbonyl substituent • Muonium atom has_part electron • Kanamycin has_component kanamycin A • Lead diacetate has_component lead(2+) (?!)

Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word-sense disambiguation? What is the best way of distinguishing molecules and samples?

Acknowledgements University of Cambridge: Peter Corbett OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo) www.projectprospect.org

Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word-sense disambiguation? What is the best way of distinguishing molecules and samples?

Chemical named entity recognition and literature mark-up

Chemical named entity recognition and literature mark-up

Presentation Transcript

Named Entity Recognition and Transliteration for 50 Languages

Named Entity Recognition

Exploiting Domain Structure for Named Entity Recognition

Named Entity Recognition

Cross-Domain Bootstrapping for Named Entity Recognition

CS544: Named Entity Recognition and Classification

Named Entity Recognition in Tweets: TwitterNLP

Biomedical Named Entity Recognition

NAMED ENTITY RECOGNITION

Named Entity Recognition

Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

Named Entity Recognition

Named Entity Recognition (NER) with NLTK

Named Entity Recognition

CS544: Named Entity Recognition and Classification

Named Entity Recognition and the Stanford NER Software

How Does Named Entity Recognition Work?

Named Entity Recognition