Literature Mining BMI 730

Literature Mining BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University

Announcement • HW #3 is cancelled. The grades will be adjusted accordingly.

Acknowledgement Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

Acknowledgement • Dr. Hongyu Peng (Brandies Univ.) • Dr. Hagit Shatkay (http://www.shatkay.org) • provided part of the slides.

Connecting the dots • Story of Thalidomide (from sedative to birth defects to anti-cancer drug)

Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

Information Retrieval (IR) • Finding the papers • IR systems aim to identify the text segments (be it full articles, abstracts, paragraphs or sentences) that pertain to a certain topic (e.g., yeast cell cycle). • E.g., PubMed, Google Scholar • Ad hoc IR • Text categorization (pre-defined set of papers) • Advanced – integrate Entity Recognition

Ad Hoc IR • User provide query • Boolean model • Index based (e.g. “Gene and CD”)

Boolean Queries DB: Database of documents. Vocabulary: {t1,…,tM } (Terms in DB, produced by the tokenization stage) Index Structure:A term  all the documents containing it. acquired immunodeficiency asthma blood blood pressure Index Database

(54,745 Pubmed entries) CD cytosine deaminase Cortical dysplasia compact disk... Crohn‘s disease Chagas' disease capillary density • Ad Hoc IR • User provide query • Boolean model • Challenges • Synonymy (AGP1, aka, Amino Acid Permease1) • Polysemy

Ad Hoc IR • User provide query • Vector-based model • Similarity query, e.g., Vector based. Semantic search TIME (Sept 5, 2005): Search engines are good at matching words … The next step is semantic search – looking for meaning, not just matching key words. … Nervana, which analyzes language by linking word patterns contextually to answer questions in defined subject areas, such as medical-research literature.

The Vector Model DB: Database of documents. Vocabulary: {v1,…,vM } {Terms in DB} Document dDB: Vector,<w1d,…,wMd>, of weights. Weighting Principles • Document frequency: Terms occurring in a few documents are moreuseful than terms occurring in many. • Local term frequency: Terms occurring frequently within a document are likely to be significant for the document. • Document length: A term occurring the same # of times in a long document and in a short one has less significance in the long one. • Relevance: Terms occurring in documents judged as relevant to a query, are likely to be significant (WRT the query). [Sparck Jones et al. 98]

1 if ti d 0 otherwise Wid = Wid= fid fi (fi= # of docs containing ti) Some Weighting Schemes: Binary Wid = fid = # of times ti occurs in d. TF Consider Local term frequency TF X IDF (one version...) Consider Local term frequency and (Inverse) Document frequency

q• d |q||d| Vector-Based similarity Document d= <w1d,…,wMd>DB Query q = < w1q,…,wMq> (q could itself be a document in DB...) Sim(q, d) = cosine (q, d ) = q d [Salton89, Witten et al99] Introductory IR.

Probabilistic Models Query q ; Document d • Goal: Find all d’s such that P(relevant | d, q) is high P(relevant | d, q) Maximize log-odds: Log[ ] P(Irrelevant | d, q) [Sparck Jones et al. 98, Sahami98, Ponte&Croft 98, Hoffman 99]

Latent Semantics Analysis [Dumais, Deerwester et al,1988,1990] • Motivation: • Overcoming synonymy and polysemy. • Reducing dimensionality. • Idea: • Project from “explicit term” space to a lower dimension, “abstract concept” space. • Methodology: • PCA applied to the document-term matrix. • Highest singular values are used as the features for representing documents.

Information Retrieval- Details(cont.) Cancer Apoptosis Elongation Text Categorization (semantic) Automatically place documents in right categories so as to make them easy-to-find. ... ...

Information Retrieval-Details(cont.) Rule-Based Text Classification A knowledge-engineering approach. Boolean rules (DNF), based on the presence/absence of specific terms within the document, decide its membership in the class. (e.g. the CONSTRUE system [Hayes et al. 90,92]) Example: If ( (<GENE_Name> ⋀transcript) ⋁ ((<GENE_Name> ⋀ Western Blot) ⋁ ((<GENE_Name> ⋀ Northern Blot)) Then GeneExpressionDoc Else ⌝GeneExpressionDoc

Information Retrieval-Details(cont.) Machine Learning for Text Classification (supervised) • Take a training set of pre-classified documents • Build a model for the classes from the training examples • Assign each new document to the class that best fits it • (e.g. closest or most-probable class.) • Types of class assignment: • Hard: Each document belongs to exactly one class • Soft: Each document is assigned a “degree of membership” in several classes • Methods • Nearest neighbor • Summarizing document vectors • SVM, Bayesian, boosting

Evaluating Extraction and Retrieval • To say how good a system is we need: • Performance metrics (numerical measures) • Benchmarks, on which performance is measured (the gold-standard).

Evaluating Extraction and Retrieval(cont.) Performance Metrics N items (e.g. documents, terms or sentences) in the collection REL: Relevant items (documents, terms or sentences) in the collection. These SHOULD be extracted or retrieved. RETR: Retrieved items (e.g. documents, terms or sentences) are actually extracted/retrieved Some correctly (A= |REL ⋀RETR|), Some incorrectly (B = |RETR – REL|) |RETR|= A+B

Collection REL RETR Evaluating Extraction and Retrieval(cont.) Performance Metrics(cont.) |NotREL – RETR| = C |Collection| = N |RETR – REL| = B |REL-RETR| = D |REL ⋀ RETR| = A

Performance Metrics (cont.) Precision: P = A/(A+B) How many of the retrieved/extracted items are correct Recall: R = A/(A+D) How many of the items that should be retrieved are recovered Accuracy:(A+C)/N (Ratio of Correctly classified items) Combination Scores: F-score:2PR / (P+R) Harmonic mean, in the range [0,1] Fβ-score: (1+β2)PR / (β2·P + R) β>1 Prefer recall, β <1 Prefer precision E-measure: 1 – F(β)-score Inversely proportional to performance (Error measure).

1 25% Recall 2 3 50% 4 75% 5 6 100% 7 Performance Metrics (cont.) Precision-Recall Curves 4 relevant documents in the collection. 7 retrieved and ranked.

Performance Metrics (cont.) Accounting for Ranks For a given rankn, Pn: Precision at rank n (P@n) R-Precision: PR where R is the number of relevant documents Average Scores Average Precision: Average the precision over all the ranks in which a relevant document is retrieved. Mean Average Precision: Mean of the Average Precision over all the queries. Micro-Average: Average over individual items across queries Macro-Average: Average over queries

Entity Recognition (ER) • Identifying the substance(s) • Rule and contextual based approach (manual) – e.g., ‘-ase’ for enzyme • Rule and contextual based approach (machine learning) • Dictionary-based approach • How the names are written - CDC28, cdc28, cdc28p, cdc-28 • Curation of the dictionary

Entity Recognition (ER) • Major Challenge • Lack of standardization of names • ‘cdc2’ refers to two completely unrelated genes in budding and fission yeast • ‘SDS’ - serine dehydratase gene vs. Sodium Dodecyl Sulfate vs. Shwachman-Diamond syndrome • Synonymy (AGP1, aka, Amino Acid Permease1) • Polysemy

Entity Recognition (ER) • Simpler version – if this symbol is for gene or its product • iHOP (Information hyperlinked over proteins) http://www.pdg.cnb.uam.es/UniPub/iHOP

Vocabulary • Many, many • SNOWMED, ICD, … • ICD (International Statistical Classification of Diseases and Related Health Problems)

Vocabulary • ICD • 573.3 Hepatitis, unspecified • Toxic (noninfectious) hepatitis • Use additional E code to identify cause • 571.4 Chronic hepatitis • Excludes: • viral hepatitis (acute) (chronic) (070.0-070.9) • 571.49 Other • Chronic hepatitis: • active • aggressive • Recurrent hepatitis • 070 Viral hepatitis • Includes: • viral hepatitis (acute) (chronic) • Excludes: • cytomegalic inclusion virus hepatitis (078.5)

Unified Medical Language system (UMLS)

Information Extraction (IE) • Extract pre-defined types of fact — in particular, relationships between biological entities. • Co-occurrence based method • Natural language processing (NLP) based method

Information Extraction Usually it requires • Identify the relevant sentences • Parse to extract specific information • Assume “well-behaved” fact sentences • Using co-occurrence relationships alone does not require parsing or good fact-structure

Text Mining (TM) • The discovery by computer of new, previously unknown information, by automatically extracting information from different written records.

Blood Viscosity Platelet aggregability Vascular Reactivity Fish Oil Fish Oil Raynaud’s Syndrome Raynaud’s Syndrome Can Reduce Text Mining • Based on transitivity of relationships in co-occurrence graph. • This idea can be used to discover new facts by co-occurrence • Web Tool : Arrowsmith Reduces (and co-occurs) Increased (and co-occurs) [Swanson 86,Swanson87,Swanson90, Swanson and Smalheiser99, Weeber et al. 2001, Stapley & Benoit 2000, Srinivasan 2003, Srivinasan 2004]

Integration: combining text and biological data Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

Literature Mining BMI 730

Literature Mining BMI 730

Presentation Transcript

Mining Medical Literature

Introduction to Microarray Data Analysis BMI/IBGP 730

IBGP/BMI 730 Introduction to Bioinformatics Director: Prof. Victor Jin

Protein Structure, Classification and Prediction BMI 730

Literature Mining and Ontology BMI/IBGP 730 Autumn, 2010

Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730

Networks and Systems Biology BMI 730

Biological literature mining

Literature Mining and Ontology BMI/IBGP 705 Winter, 2012

Introduction to Machine Learning BMI/IBGP 730

Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011

Clustering and Classification – Introduction to Machine Learning BMI 730

Introduction to Microarry Data Analysis - II BMI 730

Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730

Pathways, Networks and Systems Biology BMI 730

Pathways, Networks and Systems Biology BMI 730

Lecture 2: Sequence Alignment BMI/IBGP 730

ChIP Sequencing BMI/IBGP 730

Bioinformatics Topics Not Covered in this Course BMI 730

Network Biology BMI 730

Introduction to Microarry Data Analysis - II BMI 730

Pathways, Networks and Systems Biology BMI 730