230 likes | 302 Views
Unstructured Machine Learning: Providing the link between Genetic Data and Published Research. Dr Tony C Smith Reel Two, Inc. 9 Hartley Street Hamilton, New Zealand +64 7 839 7808 www.reeltwo.com. What is Machine Learning?. creating computer programs that get better with experience
E N D
Unstructured Machine Learning: Providing the link between Genetic Data and Published Research Dr Tony C Smith Reel Two, Inc. 9 Hartley Street Hamilton, New Zealand +64 7 839 7808 www.reeltwo.com
What is Machine Learning? • creating computer programs that get better with experience • learn how to make expert judgments • discover previously hidden, potentially useful information (data mining) How does it work? • user provides learning system with examples of concept to be learned • induction algorithm infers a characteristic model of the examples • model is used to predict whether or not future novel instances are also examples – and it does this very consistently, and very, very quickly!
weight heavy normal light dirt firmness good mild clean hard soft good poor good poor Structured Learning Mushroom Data Weight Damage Dirt Firmness Quality heavy high mild hard poor heavy high mild soft poor normal high mild hard good light medium mild hard good Light clear clean hard good normal clear clean soft poor heavy medium mild hard poor . . .
Unstructured Learning • data does not have fixed fields with specific values • examples: images, continuous signals, expression data, text • learning proceeds by correlating the presence or absence of any and all salient attributes Document Classification • given examples of documents covering some topic, learn a semantic model that can recognize whether or not other documents are relevant • prioritize them: i.e. quantify “how relevant” documents are to the topic • not limited to keywords (nor is it misled by them) • adapt to the user’s needs (ephemeral or long-term)
How Text Mining Works • Users supply the system with training data • Documents that are good examples of the desired category • The system builds ‘classifiers’ • Statistical models based on the training data • The system classifies novel data • Identifies other documents about the desired category • Results are displayed or stored • Files can be viewed, routed to end users or stored in databases
Client-specific categories Classified documents are ranked by relevance View contents of individual documents – sentences are highlighted by their relevance to the category Classification System Familiar Windows-style interface Drag-and-drop documents to create custom categories
The Initial Problem: Individual curators evaluate data differently Activation of p38 MAP Kinase MAPK-KK Cascade Protein Modification The Initial Solution: The Gene Ontology (GO) – A controlled vocabulary with defined relationships between items. GO consists of more than 13,000 nodes, or ‘GO Terms’, divided into three main trees: Biological Process, Cellular Component and Molecular Function Of these, only about 3800 GO Terms are ‘active’ – that is, terms appended with more than just one or two publications. The Gene Ontology – A Good First Step While scientists can agree to use the word "kinase," they must also agree to support this by stating how and why they use "kinase," and consistently apply it. Only in this way can they hope to compare gene products and find out if and how they are related.
GO KDS) bridges the gap by classifying all of MEDLINE. • New documents are classified as they’re added • Scientists can now annotate gene targets quickly and reliably • GO KDS is updated along with GO and MEDLINE Using GO “as is” takes too long and delivers too little The Gene Ontology Knowledge Discovery System GO KDS – Filling the gaps in GO GO is only a partial solution • Enormous gap between GO-annotated docs (27,000) and full MEDLINE database (12 million entries). • Updates lag behind. • Scientists must understand and agree to use the GO • Knowledge changes and alters definitions.
All sub-terms for the listed term: click on a term to further refine your search Current GO term(s) open Location of listed term in GO Enter a keyword to search in this GO category Opens abstract in separate window Color of stars identifies the GO branch: number of stars indicates confidence of category placement KDS discovers novel classifications Original GO classifications (by domain-expert) GO KDS Interface Tour
GO KDS Key Benefits www.go-kds.com • Quickly sort documents into most relevant categories to the user • Replace laborious annotation by domain experts with a trainable, automated system • Discover conceptual links between previously unrelated scientific domains • Identify key articles for pertinent research • Integrate public, private and proprietary documents
Patent preparation Searching patent databases Collecting relevant documents Synthesizing information Life Science Research Finding relevant literature Prioritizing articles/reports Discovering hidden connections Distributing information Drug Approval Collecting information Organizing/Collating documents Satisfying approval criteria How is document classification useful?
Intelligent Text Mining: Therapeutic Courses One Reel Two client is using Classification System to rapidly sort through large volumes of medical documentation in disparate therapeutic areas. The Problem: Client must generate E-Learning Courses from hundreds of pages of reports, literature and product documentation supplied by client Old Solution: Manually read through documents to find paragraphs related to ‘Diagnosis’, Etiology, Epidemiology etc. New Solution: Use Reel Two Classification System to build a custom taxonomy, then automatically classify and extract relevant document sections into Therapeutic Area categories
Identifying ‘Mechanism of Action’ in life science patents Patents are classified according to a taxonomy built by the client: Alzheimer’s Patents MoA: 5-HT Inhibitor MoA: Acetylcholinesterase MoA: Antioxidant MoA: Antiviral… • ACTIVITY - Analgesic; neuroprotective; nootropic; antiparkinsonian; neuroleptic; tranquilizer; antiinflammatory; antidepressant; anabolic; anorectic; anticonvulsant; uropathic; gastrointestinal; antiaddictive; gynecological. MECHANISM OF ACTION - Neurotransmitter release modulator. • In an in vitro assay, 2-chloro-5-(3-(R)-pyrrolidinylmethoxy)-3-pyridinecarbaldoxime (Ia) exhibited a Ki value for binding to neuronal nicotinic acetylcholine receptors of 0.012 nM. Intelligent Text Mining – Patent Analysis Search patent filings for the ideas or concepts behind one’s analysis • Explore state of prior art, competitive landscape or ‘innovation gaps’ • Overcome intentionally vague language in patent filings Example Project Sample Output • The Mechanism of Action listed for this patent is "Neurotransmitter release modulator." However Classification System identified that this chemical modulator binds to the acetylcholine receptor, which is the true mechanism of action, and classified this patent in “MoA: Acetylcholinesterase”.
“Life Science Information Management will form the largestunmet need for IT companies in the 21st Century”Caroline Kovak,General Manager, IBM Life Sciences
Appendix: GO KDS Interface 1. Search for a particular GO term by opening one of the main branches
Appendix 2. ‘Drill down” through the taxonomy to find a term of interest. Click on that term.
Appendix 3. Select the desired GO term. ‘Open’ the category by clicking on ‘new search with this term.’
Appendix 4. Scroll down to view abstracts.
Appendix 5. Discover conceptual links to other GO categories. Click on the category to add the term to your search.
Appendix 6. View the data intersection between GO categories. Scroll through to view abstract.
Appendix 7. GO terms identify concepts embodied in the abstracts, enabling quick review.
Appendix 8. Select an abstract of interest, and click to open the complete abstract.
Appendix 9. The abstract will open in a new window, allowing you to continue with your search, or to link directly to the journal.