Chemical name interpretations & Molecular time lines -

Chemical name interpretations & Molecular time lines -

This shows detailed record view – with molecular links -

This shows the chemicals report with molecular timeline & mouse over of chemical names

Exploring co-table analysis of Molecules with Gene ID’s For example – show me all of the co-occurrences of these (x) molecules with these (any / all) gene’s !

1 From the main menu select the Analyze tab

From the analyze menu select the Cotable tab ! 2

Now Enter the Inchi keys for the molecules of interest - 3 Click here to enter a sample (test) set of molecules

Now select - patent field – to explore “patents” ! 4 These are the molecules of interest – (Inchi keys to explore) Select Patent field here

Now select - facet = patent field + Gene then click analyze 5 Molecules Facet = Patents + Genes

This shows the “cotable” results = co-occurrences of molecules + NCBI –Gene ID’s These are the NCBI Gene ID #’s To transpose the charts or export the data – click here

This shows the transposed chart – of co-occurrences of molecules + NCBI –Gene ID’s Click here to see the patents containing this molecule + this particular gene

Co-table Analysis For example : Show me all documents where imitrex was Mentioned with “any” …..sign and / or symptoms (note: these are terms such as headache, vomiting, nausea ..etc ..there are > 680 of them).

Draw a compound of interest 1 2 Click – view compound in co-table

3 Select a MeSH category for Co-occurance analysis 4 Click analyze

This shows the number of documents that contained the source molecule and ANY of the MeSH – C23 terms Click on the numbers to “link to ” the documents

Type in a new MeSH code to change the analysis from ‘signs & symptoms’ (C23) to diseases (C01)

This shows the number of documents that contained the source molecule and ANY of the MeSH – disease (C01) terms

This shows the comparison of 2 drugs and the co-occurrence of MeSH Symptoms (C23) terms

This shows the comparison of different statins and the co-occurrence of MeSh terms Chemical Structures vs. Signs and Symptoms Medline co-occurrence of Statin structures vs. MeSH –

Screen shoots from our SIMPLE / SIIP Web application

Chemical Search using ChemAxon w/ DB2 Search Proximal Search Nearest Neighbor Search

Clustering Claims Originality BioTerm Analysis Discovery

Landscape Analysis Visualization Networks

IBM’s - Massively Parallel Probabilistic Architecture Question Synthesis Final Merging & Ranking Question/Topic Analysis Hypothesis & Evidence Scoring Query Decomposition Hypothesis Generation Trained Models Soft Filtering Hypothesis & Evidence Scoring Hypothesis Generation Hypothesis & Evidence Scoring Hypothesis Generation Soft Filtering Answer, Confidence Watson generates and scores many hypotheses using an extensible collection of Natural Language Processing, Machine Learning and Reasoning Algorithms. Thesegather and weigh evidence over both unstructured and structured content to determine the answer with the best confidence. E. Sources A. Sources Deep Evidence Scoring Answer Scoring Supporting Evidence Retrieval Primary Search Candidate Answer Generation Evidence Retrieval Deep Evidence Scoring 25 Source – J Kreulen

Technical Issues to consider when applying QA systems like Watson Nature of Domain: Open vs. ClosedClosed domain implies all knowledge is contained within a specific domain characterized by ontologies and there is no need to go outside the domain.Jeopardy is an open-domain example where it is general knowledge. Knowledge/Data Sources: AvailabilityQA systems are natural language search engines. Watson goes beyond NL search. If knowledge sources are incomplete, unavailable, insufficient or inadequate then it is not possible for the system to provide an answer. In some cases one would need to envisage Interactive QA that require human interaction to guide the search. Another very important consideration is the availability of sufficient sample data for training (i.e. training corpus). Need for multi-modalityIs there a need for Transcription from Speech to Text before a question is answered? This would require integration of Speech to Text capabilities that are not really ready for real-time applications. Latency Watson is capable of processing 500GB of information per second with 3 sec response to questions and used most of its knowledge source in memory (as opposed to disk) for speed. What is the latency requirement for the application? Multi-Lingual or Cross-Lingual Support Watson can support only English at this time; with language-specific parsers other languages can be supported . If knowledge sources or QA is required in multiple languages then that would not be a good candidate. Additionally if cultural context have to be accommodated in the answer then it would not be prudent to deploy QA systems directly interacting with users. Question Type Decomposition and classification of the question is critical to how QA systems work. Bulk of the question types in Jeopardy were Factoid questions. Watson did not include 2 question categories: One is Audio/Video type questions that require looking at a video to answer and another are questions that require special instructions (e.g. verbal instructions to explain a question.) Answer Types Watson is not designed to curate a task-oriented system. It can handle temporal and geo-spatial reasoning in its answers. As it stands it cannot handle business process type of reasoning (to do task B tasks A, C must be completed etc.) DeepQA Application (Java/C++) Apace Hadoop + Apache UIMA SUSE Linux Enterprise Server 11 Watson Infrastructure • 90 Power 750 Servers • Each Server 3.5GHz POWER7 8 Core Processor with 4 threads/core • Total: 2880 POWER7 Cores with 16TB RAM • Processing speed: 500Gb/sec; 80 TeraFLOPS • 94th on Top 500 Supercomputers • Note: This hardware is for Jeopardy. Any other application of Watson will require appropriate sizing and optimization for purpose.

I would like to acknowledge the IBM Almaden Research – team Jeff Kreulen Ying Chen Scott Spangler Alfredo Alba Tom Griffin Eric Louie Su Yan Issic Cheng Prasad Ramachandran Bin He Ana Lelescu Qi He Linda Kato Ana Lelescu Brad Wade John Colino Meenakshi Nagarajan Timothy J Bethea German Attanasio Laura Anderson Robert Prill + a host of folks from IBM China Labs -

Back-up slides

Challenges ahead – • Access to full – text • Language issues • Chinese • Japanese • Korean • Other • Legal issues • Web data • Integration with Medical content

Attempts to process Chinese Patent Documents Extracting chemical structures form Chinese patents… Chemicals from Chinese Patents -

Computer Curation Process Overview & integration with our collaborators - Services Hosted at IBM Almaden User Applications Annotation Factory ChemVerse Selected Internet Content Knime or Pipeline Pilot U.S. Patents (1976 -—2009) ChemVerse db (Semantic Associations) e Classifier & Other Data Associations View selected Documents & Reports BIW U.S. Pre- Grants (All) ADU* Database + compu ted Meta Data IP Database (e.g. DB2) Data Sources Parse & Extract data PCT & EPO Apps Cognos/DDQB/ Other Apps Medline Abstracts (>18 M) In-House Content Computational Analytics Annotator 1 Chem Axon Search Annotator 2 SIMPLE * ADU = Automated Data Update

Chemical name interpretations & Molecular time lines -