590 likes | 605 Views
Addressing the challenges faced by researchers to analyze IP (patents) and scientific literature using text and image analysis technology. Collaborating with corporate sponsors and partners to gain new insights. Exploring chemical and biomedical information for advancements.
E N D
IBM Research An Inter-Corporate Collaboration on Computer Curation of Intellectual Property & the Scientific Literature
What we are trying to accomplish the challenges of today's researchers Applying text & image analysis technology - to better understand IP (patents) and the scientific literature…… Computer ‘curation’ of the literature - Stephen Boyer Ph D Sboyer@us.ibm.com 408-858-5544
The Problem All content and no discovery ?
What we are trying to accomplish the challenges of today's researchers The problem : Gain a better understanding of IP (patents) and the Scientific Literature The Question: Can we use computers to “read” documents, identify critical entities, and perform meaningful associations – that can help us with our work ? What we did : 1) Apply text analytics technology to analyze Patents & the Scientific Literature (>30 M IP documents & Medline abstracts) 2) Apply image analytics to IP documents 3) Explore how these technologies can be applied to foreign documents (for example Chinese & Japanese patents) The Value : Provide new insights into chemical & biomedical information (still a work in progress).
Collaborators A collaborative work in progress Corporate Sponsors Other informal Collaborators – partners • IBM Research • Novartis • Pfizer • Dupont • Lilly • Boheringer-Ingelheim • Roche / Genentech • AstraZeneca (AZ) • Bristol-Myers Squibb (BMS) • NIH • University of Texas • EMBL - EBI • University of Dundee • UC Davis • ChemAxon • CambridgeSoft • Dalhouise • Univ of New Mexico
O C N O N N N H O S O N N Why this is important ! What are the differences between these two molecules? Chemistry: 1 Carbon, 1 Nitrogen, 1 double bond, 1 hydrogen Business: $1.7B in revenue An opportunity loss of $320M A revenue gain of $320M • Pfizer patented molecule • Annual sales of >$1.7 billion • Sildenafil (Viagra) • 1st to market, but didn’t patent (cover) full Chemical space • Bayer patented molecule • Annual sales of ~$320 Million • Vardenafil (Levitra) • Late to market, found “similar” • molecule and gained “share”
Example IP Challenge the challenges of today's researchers Additional Properties Relationships How do I find entities from the docs? How do I find entities’ relationships? New IP Web, Scientific & News Worldwide Patents Medline How do I exploit other Information sources? New Insights
Can you find the key molecule’s in an unstructured text , for example a scientific journal or patent? Chemical nomenclature can be daunting a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl ester prepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m, 1H),1.49 (s, 9H). b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol: 0.100 g of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H), 2.60 (m, 2H). EXAMPLE 24 (2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by
identify the chemical names – then convert them to structures [chemical names -> structures] ! entity identification a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl esterprepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m, 1H),1.49 (s, 9H). b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol: 0.100 g of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H), 2.60 (m, 2H). EXAMPLE 24 (2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by What is this compound ??
Problem – I need to find information about Valium nomenclature issues Valium (Trade Name) CAS # 439-14-5 (Chemical ID #) Diazepam (Generic Name) = = = Valium has > 149 “names” ALBORAL, ALISEUM, ALUPRAM , AMIPROL ,ANSIOLIN , ANSIOLISINA , APAURIN, APOZEPAM, ASSIVAL , ATENSINE , ATILEN , BIALZEPAM , CALMOCITENE, CALMPOSE , CERCINE, CEREGULART, CONDITION, DAP, DIACEPAN, DIAPAM , DIAZEMULS , DIAZEPAN , DIAZETARD , DIENPAX, DIPAM , DIPEZONA, DOMALIUM , DUKSEN, DUXEN, E-PAM, ERIDAN, EVACALM, FAUSTAN, FREUDAL, FRUSTAN, GIHITAN, HORIZON, KIATRIUM, LA-III , LEMBROL, LEVIUM, LIBERETAS , METHYL DIAZEPINONE, MOROSAN , NEUROLYTRIL NOAN NSC-77518 PACITRAN PARANTEN PAXATE PAXEL PLIDAN QUETINIL QUIATRIL QUIEVITA RELAMINAL RELANIUM RELAX RENBORIN RO 5-2807 S.A. R.L. SAROMET SEDAPAM SEDIPAM SEDUKSEN SEDUXEN , SERENACK SERENAMIN SERENZIN SETONIL SIBAZON SONACON STESOLID STESOLIN , TENSOPAM TRANIMUL TRANQDYN TRANQUASE TRANQUIRIT , TRANQUO-TABLINEN , UMBRIUM UNISEDIL USEMPAX AP VALEO VALITRAN VALRELEASE VATRAN VELIUM, VIVAL VIVOL WY-3467
There are many different chemical names for Valium 7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE 7-CHLORO-1-METHYL-5-PHENYL-3H-1,4-BENZODIAZEPIN-2(1H)-ONE 7-CHLORO-1-METHYL-5-PHENYL-1,3-DIHYDRO-2H-1,4-BENZODIAZEPIN-2-ONE 7-CHLORO-1-METHYL-2-OXO-5-PHENYL-3H-1,4-BENZODIAZEPINE 1-METHYL-5-PHENYL-7-CHLORO-1,3-DIHYDRO-2H-1,4-BENZODIAZEPIN-2-ONE 7-CHLORO-1,3-DIHYDRO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE 7-CHLORO-1-METHYL-5-3H-1,4-BENZIODIAZEPIN-2(1H)-ONE entity identification = Valium CAS # 439-14-5 Diazepam = =
Problems of ‘taxonomy” & name normalization Valium Taxonomies & Dictionaries The scientist simply wants information about valium Choose keywords Medline In-house database Chem. Abstracts Patent database DIAPAM 439-14-5 (Chemical ID) Pereira notebook 23a 7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE Sedapam Multiple documents contain Information about Valium 7-CHLORO-1,3-DIHYDRO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE Diazepam
Considerations – for searching documents (or web pages) for chemical substances Chemicals have a wide variety of trivial and official names. No text search can find chemicals which are named using one of the alternative names. Synonym expansion is insufficient. Searching by structure will find all such cases. Name normalization is important Source J Cooper / IBM
Finding similarity structures not just text ! Further, we would like to find compounds which are supersets of the given structure. For example: toluene and methylnaphthalene Find documents with similar structures Text searches won’t find documents with similar structures Source J Cooper / IBM
The Solution The proposed solution Applying text and image analytics to better understand IP (patents) & the scientific literature…… Computer ‘curation’ of the literature -
Patents contain molecular data in multiple forms : Text – Image – manually created chemical complex work units (CWU’s) As text Chemical names found in the text of documents As bitmap images Pictures of chemicals found in the document Images And as (Manually Created) Chemical Complex Work Units (CWU’s)
Text Analytics Lets start with text analysis … The computer ‘reads’ documents and attempts to determine domain specific entities ; for example ; chemical names, gene names, disease names, etc.
Step 1: Identify the chemical entities Chemical Entities Extracted from page 7-chloro-1.3-dihydro-1-methyl-5-phenyl-2H-1,4-benzodiazepin-2-one N-aminoacetyl-5-chloro-N-methylanathranilic acid Phosphorus pentachloride benzene aluminum chloride 5-chloro-N-methyl-N-phthalimidoacetylanthranilic acid hydrazine Step 2: Extract chemical names and load into tables Entity extraction
Step 3: Convert words to structures Name Structure Program language-free entities Connection tables 6 6 0 0 0 0 0 0 0 0999 V2000 6.7092 5.6087 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 6.7076 4.5056 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6607 3.9551 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.6160 4.5062 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.6121 5.6136 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6583 6.1591 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0 0 0 2 3 1 0 0 0 0 3 4 2 0 0 0 0 4 5 1 0 0 0 0 5 6 2 0 0 0 0 6 1 1 0 0 0 0 M END Convert the chemicals into machine readable formats ! 7-CHLORO-1-METHYL-5- PHENYL-2H-1,4- BENZODIAZEPIN-2-ONE SMILES strings: c1ccccc1 INChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H
Step 4: Automate the process Scale up & automate the process - HealthCare Life Science Data warehouse Any text IBM Servers Web Pages Medline Patents Valium Benzene • 11 Million patent documents • 18 Million Medline abstracts • 100 Million chemical structures >12 Million unique
Summary of overall text analysis operations for chemistry – (HMM, CRF, CFG) Name=Structure Words Paper - - - - - - - - - - - - - - - - - - - - - - - - Chemical Names Computational Resources Overall process flow for text analysis 2D Structure SMILES String toluene [CC1=CC=CC=C1] - - - . methyl benzene Dictionary of the English Language – minus – the Dictionary of Desired Entities • Options to compute • 300 properties per molecule Blue Gene – enabled -
Summary of overall text analysis operations for chemistry [Name=Structure] Words Paper - - - - - - - - - - - - - - - - - - - - - - - - Chemical Names Computational Resources Overall process flow for text analysis SMILES String 2D Structure toluene - - - [CC1=CC=CC=C1] . methyl benzene Dictionary of the English Language – minus – the Dictionary of Desired Entities • Options to compute • 300 properties per molecule Blue Gene – enabled -
Whyuse Blue Gene? • Find and compute the 3D structure of every molecule on every page of every patent (and Medline abs.) • Identify every protein (from our dictionary of >350K proteins) on every page of every patent (and Medline abs.) • Identify every disease (from our list of 14,500 ) on every page of every patent and map it to Medline MeSh codes • Identify the occurrence of every biomarker (from our dictionary of 485 biomarkers) on every page of every patent • ……….your request goes here ! • Equivalent to 240K simultaneous Google searches - Compute properties, & find relationships, Data warehouse
Examples Chemicals derived from text analytics
Examples Chemicals derived from text analytics
Examples of structures created via automated chemical annotation Chemicals derived from text analytics
Improper spacing within the chemical name: 2-_(Bicyclo_[2.2._1]_hept-5-en-2-ylamino)_-5-_[2-_(4-chloro-3-methylphenoxy)_ethyl]-l,_3-_thiazol-4_(5H)-one Run on lists: indane, 1,2,_3,4- tetrahydroquinoline, 3,_4-dihydro-2H-1,_4-benzoxazine, 1,5-naphthyridine, 1, 8- naphthyridine Numbering of compounds: Comparative Example 3, 2-bromo-4- (1, 3-dioxo-1, 3-dihydro-2H-isoindol-2-yl) butanoic acid 4-(1,3-dioxo-1,3-dihydro-2H-isoindol-2-yl) butanoic acid Formatting issues: 2-[2-(bicyclo [2.2. 1] hept-5-en-2-ylamino) -4-oxo-4, 5-dihydro-1, 3-thiazol-5-yl] -N-<BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> (4-metlioxyphenyl)-N-methylacetamide Missing or Incorrect Parenthesis: 5-(2-anilinoethyl)-2-[(2-cyclohex-1-en-1-ylethyl)amino}-1,3-thiazol-4(5H)-one Leading Causes of Annotator Problems * Typical problems encountered when dealing with OCR text * using WO/2005/075471 as an example
But there are 392 more patents which are not found due to typos and ORC errors: OCR Errors: Compound Names Searching full-text patents (WO, EP, US, FR, GB, DE, JP) for the term “Simvastatin” yields 9030 patents (3666 INPADOC families).
OCR Errors: Chemical Names If you think that was bad... look at the IUPAC names:
Transposed Characters Some errors cannot originate from an erroneous OCR process. Accidentally transposed characters are another source for variations: ehtyl 1565 patents mehtyl 840 patents compuond 231 patents relaese 44 patents formual 1689 patents
Chemical Name Annotation of US patents backfile (1976-2005) & US patent applications (2002 -2005) Rule 112 Analysis - Preliminary Results – as of June 20 , 2006 - 65,645,252 = # of Molecules identified - (total)* 3,623,248 = # of Unique Molecules 1,830,575 = # of Molecules Passing the Lipinski Rules 363,993 = # of documents with possible 112 violations 17,122 = # of 2005 pre-grants w/ possible 112 violations * All identified molecules were successfully converted to Smiles strings
Analysis & Results Post processing with pipeline pilot Molecules TOTAL 65,645,252 UNIQUE 3,623,248 DRUG¹ 1,830,575 ¹ Passing Lipinski’s “Rule of 5” http://en.wikipedia.org/wiki/Lipinski's_Rule_of_Five
IBM's Research Collaboration on Computer Curation Automated Text & Image Analysis ! ‘Annotation Factory’ Data Warehouse Data • Annotators • Chemicals • Biomarkers • Genes • Proteins • Cell Lines • Cell Types • People • Institutions • Diseases • Symptoms • Other Full-Text Chemical Structures Journals Attributes Medline Search Patents Entities Edgar Analysis Relationships Web Co-occurrence Lipinksi Rules Section 112 Trends, Molecular Networks & Time lines "UIMA" Blue Gene Scitegic Pipeline Pilot and other Partner Tools
What about processing image data ?? Image entity recognition IBM pioneered a process for converting images of chemical structures – into Mol files (machine readable representations of chemical structures…) We can also analyze the image content of patents & journals
Seminal paper on converting chemical images into MOL files Optical Recognition of Chemical Structures (OROCS)
Optical recognition of chemical structures (OROCS) – How it works Scan Separate Vectorize O=C(CN1C2(C3=CC=CC=C3)OC(C)=CC1=O)N(C)C4=C2C=C(Cl)C=C4 Segment Cleanup OCR Structure Recognition Aggregation Post Process
Optimization of Image processing process Extract the images From the page Isolate the chemical images Pre-processing of the images makes a significant difference OCR the chemical image SMILE String
This shows the selective extraction of image data from within the patent Individual images
Image Extracted from the page Structure Generated from the image SMILE String Generated from the image Chemical derived from OCR of image data Examples : Results from OCR of chemical images Source : Dr John Kinney
Learning from the Exceptions • Radicals, polymers, organometallics • Name lookup table differences • “formal” • Structure conventions differ • i.e., CH3MgBr vs. CH3Mg+.Br- • Ionization state/stereochemistry • Internal error corrections • Some names are incomplete and therefore ambiguous!
Differences of opinion Often tagged as ambiguous Where do the punctuation marks belong?
Image-to-Structure software very effective on clean, crisp images Like text, image quality in documents varies greatly! Improper structure assignments are common Structures from Images
Clipped images from documents are used. Processing of full-page images is slow and gives many errors. OSRA (NIH) run to produce SDFile output PipelinePilot Protocol used to analyze and filter resulting structure set. Structure Recognition Process
Presence of non-element atoms, R, X, etc. Inappropriate internal coordinates (bond length and angles) of the 2D representation. Over-assigned stereochemistry can be corrected rather than removing the entire structure Criteria for filtering invalid structures
Examples of common errors in translation Error Example Structure Filter Rule The minimum bond distance where neither atoms is Hydrogen is required to be greater than 0.85 Å. Double bond interpreted as two single bonds Aromatic bond interpreted as exocyclic bond from ring The minimum bond angle from an exocyclic terminal atom to the ring atoms was required to be greater than 50°.
Examples of common errors in translation Example Structure Filter Rule Error Atom found in center of single bond The maximum bond angle of a carbon with exactly two single bonds was required to be less than 155°. Single bond divided into two single bonds The minimum bond angle which includes any terminal atom was required to be greater than 10°.
Conversion Statistics • 20,081 patents with 487,537 clip files 35% clean
Text Processing Operations Image Processing Operations PTO/ Data Processing Operations Text [ChemList ] Chem CWU’s ‘Clip’ Images [Name=Structure] CDX / MOL files OSRA /Clide SMILES SDF files SDF files Multi-step post processing – Operations Multi-step post processing – Operations Multi-step post processing – Operations