270 likes | 456 Views
Chemical Entity extraction using the chemicalize.org-technology. Josef Scheiber Novartis Pharma AG – NITAS/TMS. Where the story of this project started . A day in October 2008 Some time around 7:45 in the morning . Novartis Campus. Dreirosenbrücke.
E N D
Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS
Where the story of this project started ... A day in October 2008 Some time around 7:45 in the morning ... Novartis Campus Dreirosenbrücke
Vision for textminingIntegration chemical, biological knowledge
Mining for Chemical Knowledge - Rationale • Make text corpora searchable for chemistry • Generate chemistry databases for use in research based on Scientific Papers or Patents • Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications • Patent analyis for MedChem projects Connection table
Mining for chemical Knowledge - Rationale Information on compounds targeting GPCRs HELP Information explosion Source: Banville, Debra L. “Mining chemical structural information from the drug literature.” Drug Discovery Today, Number 1/2 Jan. 2006, p.35-42
Example:Project Prospect – Royal Society of Chemistry • Enhancing Journal Articles with Chemical Features This helps you identifying other articles talking about the same molecule
Mining for Chemical Knowledge – Focus for today • Make text corpora searchable for chemistry • Generate chemistry databasesfor use in research based onScientific Papers or Patents • Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications • Patent analyis for MedChem projects Connection table
A use case for successful patent mining(molecules you sometimes find in your inbox ;-) ) Vardenafil (2003, Bayer) – € 1.24 billion (USD 1.6 billion) Sildenafil (1998, Pfizer) – € 11.7 billion (USD 15.1 billion) Slide inspired by an example from Steve Boyer/IBM; Sales data from Prous Integrity datase
Facts – current standard ... (ACS) owes most of its wealth to its two 'information services' divisions — the publications arm and the Chemical Abstracts Service (CAS), a rich database of chemical information and literature. Together, in 2004, these divisions made about $340 million — 82% of the society's revenue — and accounted for $300 million (74%) of its expenditure. Over the past five years, the society has seen its revenue and expenditure grow steadily ... Source: ACS homepage
Facts Established application Straighforward use De-facto Gold standard Unique data source Very costly No structure export for reasonable price Very limited in large-scale follow-up analysis Most recent patents not available
Not data (search), but integration, analysis and insight, leading to decisionsanddiscovery
Now – What would be the perfect solution? All patent offices require to provide all claimed structures as machine-readable version available for one-click-download
Text extraction Definition: Extract all molecules that are mentioned in a patent text of interest, convert them to structures and make them available in machine-readable format
The objective To provide a tool that provides sophisticated text analysis methods for NIBR scientists and thereby leverages the methods of TMS
Mining for Chemical Knowledge – Novartis Tools – the chemicalize-technology is working under the hood! Clipboard Analysis Identified structures Patent text View structure onMouseOver Export to other applications
Mining for Knowledge – Novartis ToolsInput example: J Med Chem Paper
Mining for Chemical Knowledge – Use Case Medicinal Chemist wants to synthesize competitor compound as tool compound for own project This enables the identification of compounds most representative for a competitor patent Identification of core scaffold Analysis of substitution patterns
Example – A text-based patent A patent example Automated Text extraction 452 compounds Reference 636 compounds 71%
Example – An image-base patent • Text extraction not suitable for this case, it does find only a meager 40 molecules, 1129 in reference – Why? An entirely image-based patent example
Encountered problems • OCR (Optical Character Recognition)!! • USPTO and WIPO are now available full text in most cases • Typos! • Name2Struct problems (less an issue here)
IBM initiative Patent Mining / ChemVerse database (Steve Boyer) • The objective is to automatically extract all molecules from all patents available and make them searchable in a database • They leverage cloud computing and have access to all full-text patents • This is going absolutely the right direction • They annotate the molecules with information from freely available databases
Future ideas: Patent Analysis • Markush translation, Image+Target • Ranking capabilities of outcome for User • „blurred“ dicos for translating stuff like aryl, cycloalkyl etc. • Select annotate as entity on the fly error-correction • Result goes in a database Crowdsourcing efforts to improve and store results • Suggest functionality
To enable true Patinformatics analyses ... Definition by Tony Trippe:
Acknowledgements NITAS/TMS • Therese Vachon • Daniel Cronenberger • Pierre Parisot • Martin Romacker • Nicolas Grandjean • Clayton Springer • Naeem Yusuff • Bharat Lagu • Alex Fromm • Katia Vella • Olivier Kreim And many other people in different divisions of NIBR for their support