Chemical Entity extraction using the chemicalize.org-technology

Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Where the story of this project started ... A day in October 2008 Some time around 7:45 in the morning ... Novartis Campus Dreirosenbrücke

Vision for textminingIntegration chemical, biological knowledge

Mining for Chemical Knowledge - Rationale • Make text corpora searchable for chemistry • Generate chemistry databases for use in research based on Scientific Papers or Patents • Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications • Patent analyis for MedChem projects Connection table

Mining for chemical Knowledge - Rationale Information on compounds targeting GPCRs HELP Information explosion Source: Banville, Debra L. “Mining chemical structural information from the drug literature.” Drug Discovery Today, Number 1/2 Jan. 2006, p.35-42

Example:Project Prospect – Royal Society of Chemistry • Enhancing Journal Articles with Chemical Features This helps you identifying other articles talking about the same molecule

Mining for Chemical Knowledge – Focus for today • Make text corpora searchable for chemistry • Generate chemistry databasesfor use in research based onScientific Papers or Patents • Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications • Patent analyis for MedChem projects Connection table

A use case for successful patent mining(molecules you sometimes find in your inbox ;-) ) Vardenafil (2003, Bayer) – € 1.24 billion (USD 1.6 billion) Sildenafil (1998, Pfizer) – € 11.7 billion (USD 15.1 billion) Slide inspired by an example from Steve Boyer/IBM; Sales data from Prous Integrity datase

Conventional Database Building

Facts – current standard ... (ACS) owes most of its wealth to its two 'information services' divisions — the publications arm and the Chemical Abstracts Service (CAS), a rich database of chemical information and literature. Together, in 2004, these divisions made about $340 million — 82% of the society's revenue — and accounted for $300 million (74%) of its expenditure. Over the past five years, the society has seen its revenue and expenditure grow steadily ... Source: ACS homepage

Facts Established application Straighforward use De-facto Gold standard Unique data source Very costly No structure export for reasonable price Very limited in large-scale follow-up analysis Most recent patents not available

Not data (search), but integration, analysis and insight, leading to decisionsanddiscovery

Now – What would be the perfect solution? All patent offices require to provide all claimed structures as machine-readable version available for one-click-download 

Text extraction Definition: Extract all molecules that are mentioned in a patent text of interest, convert them to structures and make them available in machine-readable format

Mining for Chemical KnowledgeTechnologies from providers

The objective To provide a tool that provides sophisticated text analysis methods for NIBR scientists and thereby leverages the methods of TMS

Mining for Chemical Knowledge – Novartis Tools – the chemicalize-technology is working under the hood! Clipboard Analysis Identified structures Patent text View structure onMouseOver Export to other applications

Mining for Knowledge – Novartis ToolsInput example: J Med Chem Paper

Mining for Chemical Knowledge – Use Case Medicinal Chemist wants to synthesize competitor compound as tool compound for own project This enables the identification of compounds most representative for a competitor patent Identification of core scaffold Analysis of substitution patterns

Example – A text-based patent A patent example Automated Text extraction 452 compounds Reference 636 compounds 71%

Example – An image-base patent • Text extraction not suitable for this case, it does find only a meager 40 molecules, 1129 in reference – Why? An entirely image-based patent example

Language issues – e.g. Japanese patents

Encountered problems • OCR (Optical Character Recognition)!! • USPTO and WIPO are now available full text in most cases • Typos! • Name2Struct problems (less an issue here)

IBM initiative Patent Mining / ChemVerse database (Steve Boyer) • The objective is to automatically extract all molecules from all patents available and make them searchable in a database • They leverage cloud computing and have access to all full-text patents • This is going absolutely the right direction • They annotate the molecules with information from freely available databases

Future ideas: Patent Analysis • Markush translation, Image+Target • Ranking capabilities of outcome for User • „blurred“ dicos for translating stuff like aryl, cycloalkyl etc. • Select  annotate as entity  on the fly error-correction • Result goes in a database  Crowdsourcing efforts to improve and store results • Suggest functionality

To enable true Patinformatics analyses ... Definition by Tony Trippe:

Acknowledgements NITAS/TMS • Therese Vachon • Daniel Cronenberger • Pierre Parisot • Martin Romacker • Nicolas Grandjean • Clayton Springer • Naeem Yusuff • Bharat Lagu • Alex Fromm • Katia Vella • Olivier Kreim And many other people in different divisions of NIBR for their support

Chemical Entity extraction using the chemicalize.org-technology

Chemical Entity extraction using the chemicalize.org-technology

Presentation Transcript

Joint Entity and Relation Extraction using Card-Pyramid Parsing

Efficient Approximate Entity Extraction with Edit Distance Constraints

Information Extraction Lecture 5 – Named Entity Recognition III

Unsupervised Named-Entity Extraction from the Web: An Experimental Study

Information Extraction Lecture 4 – Named Entity Recognition II

Entity extraction: rule-based methods

Entity Extraction for Query Interpretation Patrick Pantel ǂ

The Entity

Information Extraction and Named Entity Recognition

Nichromet Extraction Gold Technology

Chemical Technology

Japanese Named Entity Extraction with Redundant Morphological Analysis

Information Extraction Lecture 4 – Named Entity Recognition II

Efficient Approximate Entity Extraction with Edit Distance Constraints

Object Extraction using Segmentation

Named Entity Extraction

Unbxd Advancements In Entity Extraction