150 likes | 277 Views
Developing a Semantic Search Application A Pharma Case Study. Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Program Chair – Text Analytics World Taxonomy Boot Camp: Washington DC, 2013. KAPS Group: General.
E N D
Developing a Semantic Search ApplicationA PharmaCase Study Tom ReamyChief Knowledge Architect KAPS Group http://www.kapsgroup.com Program Chair – Text Analytics World Taxonomy Boot Camp: Washington DC, 2013
KAPS Group: General • Knowledge Architecture Professional Services – Network of Consultants • Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching • Attensity, Clarabridge, Lexalytics, • Strategy– IM & KM - Text Analytics, Social Media, Integration • Services: • Taxonomy/Text Analytics development, consulting, customization • Text Analytics Fast Start – Audit, Evaluation, Pilot • Social Media: Text based applications – design & development • Clients: • Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, etc. • Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Presentations, Articles, White Papers – http://www.kapsgroup.com
Project • Agile Methodology • Goal – evaluate semantic technologies ability to: • Replace manual annotation of scientific documents – automated or semi-automated • Discover new entities and relationships • Provide users with self-service capabilities • Goal – feasibility and effort level
Components – Technology, Resources • Cambridge Semantics, Linguamatics, SAS Enterprise Content Categorization • Initial integration – passing results as XML • Content – scientific journal articles • Taxonomy – Mesh – select small subset • Access to a “customer” – critical for success
Three rounds - Iterations • Visualization – faceted search, sort by date, author, journal • Cambridge Semantics • Round 1 – PDF from their database • Needed to create additional structure and metadata • No such thing as unstructured content • Round 2 & 3 – XML with full metadata from PubMed • Entity Recognition – Species, Document Type, Study Type, Drug Names, Disease Names, Adverse Events
Components & Approach • Rules or sample documents? • Need more precision and granularity than documents can do • Training sets – not as easy as thought • First Rules – text indicators to define sections of the document • Objectives, Abstract, Purpose, Aim – all the “same” section • Separate logic of the rules from the text • Stable rules, changing text • Scores – relevancy with thresholds • Not just frequency of words
Document Type Rules • (START_2000, (AND, (OR, _/article:"[Abstract]", _/article:"[Methods]“, _/article:"[Objective]", • _/article:"[Results]", _/article:"[Discussion]“, (OR, • _/article:"clinical trial*", _/article:"humans", • (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe", _/article:"use", _/article:"animals"), • Clinical Trial Rule: • If the article has sections like Abstract or Methods • AND has phrases around “clinical trials / Humans” and not words like “animals” within 5 words of “clinical trial” words – count it and add up a relevancy score
Rules for Drug Names and Diseases • Primary issue – major mentions, not every mention • Combination of noun phrase extraction and categorization • Results – virtually 100% • Taxonomy of drug names and diseases • Capture general diseases like thrombosis and specific types like deep vein, cerebral, and cardiac • Combine text about arthritis and synonyms with text like “Journal of Rheumatology”
Rules for Drug Names and Diseases • (OR, _/article/title:"[clonidine]", • (AND, _/article/mesh:"[clonidine]",_/article/abstract:"[clonidine]"), • (MINOC_2, _/article/abstract:"[clonidine]") • (START_500, (MINOC_2,"[clonidine]"))) • Means – any variation of drug name in title – high score • Any variation in Mesh Keywords AND in abstract – high score • Any variation in Abstract at least 2x – good score • Any variation in first 500 words at least 2x – suspect
Rules for Drug Names and Diseases • Results: • Wide Range by type -- 70-100% recall and precision • Focus mostly on precision – difficult to test recall • One deep dive area indicated that 90%+ scores for both precision and recall could be built with moderate level of effort • Not linear effort – 30% accuracy does not mean 1/3 done
Iteration 3 • Complete treatment of disease state: • Indication (disease you want to treat) • Concomitant disease • Adverse or side effects • Use XML metadata – some variant of “adverse” • Any combination of words associated with a disease (depression) and any of the words that indicated an adverse event or effect
Conclusion • Project was a success! • Useful results – as defined by the customer • Reasonable and doable effort level – both for initial development and maintenance • Essential Success Factors • Rules not documents, training sets (starting point) • Full platform for disambiguation of noun phrase extraction, major-minor mention • Separation of logic and text • Semantic Search works! • If you do it smart!
Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com www.TextAnalyticsWorld.com March 17-19, San Francisco