430 likes | 589 Views
Literature Mining BMI 730. Kun Huang Department of Biomedical Informatics Ohio State University. Announcement. HW #3 is cancelled. The grades will be adjusted accordingly. Acknowledgement. Jensen et al. Nature Reviews Genetics 7 , 119–129 (February 2006) | doi:10.1038/nrg1768.
E N D
Literature Mining BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University
Announcement • HW #3 is cancelled. The grades will be adjusted accordingly.
Acknowledgement Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768
Acknowledgement • Dr. Hongyu Peng (Brandies Univ.) • Dr. Hagit Shatkay (http://www.shatkay.org) • provided part of the slides.
Connecting the dots • Story of Thalidomide (from sedative to birth defects to anti-cancer drug)
Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768
Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768
Information Retrieval (IR) • Finding the papers • IR systems aim to identify the text segments (be it full articles, abstracts, paragraphs or sentences) that pertain to a certain topic (e.g., yeast cell cycle). • E.g., PubMed, Google Scholar • Ad hoc IR • Text categorization (pre-defined set of papers) • Advanced – integrate Entity Recognition
Ad Hoc IR • User provide query • Boolean model • Index based (e.g. “Gene and CD”)
Boolean Queries DB: Database of documents. Vocabulary: {t1,…,tM } (Terms in DB, produced by the tokenization stage) Index Structure:A term all the documents containing it. acquired immunodeficiency asthma blood blood pressure Index Database
(54,745 Pubmed entries) CD cytosine deaminase Cortical dysplasia compact disk... Crohn‘s disease Chagas' disease capillary density • Ad Hoc IR • User provide query • Boolean model • Challenges • Synonymy (AGP1, aka, Amino Acid Permease1) • Polysemy
Ad Hoc IR • User provide query • Vector-based model • Similarity query, e.g., Vector based. Semantic search TIME (Sept 5, 2005): Search engines are good at matching words … The next step is semantic search – looking for meaning, not just matching key words. … Nervana, which analyzes language by linking word patterns contextually to answer questions in defined subject areas, such as medical-research literature.
The Vector Model DB: Database of documents. Vocabulary: {v1,…,vM } {Terms in DB} Document dDB: Vector,<w1d,…,wMd>, of weights. Weighting Principles • Document frequency: Terms occurring in a few documents are moreuseful than terms occurring in many. • Local term frequency: Terms occurring frequently within a document are likely to be significant for the document. • Document length: A term occurring the same # of times in a long document and in a short one has less significance in the long one. • Relevance: Terms occurring in documents judged as relevant to a query, are likely to be significant (WRT the query). [Sparck Jones et al. 98]
1 if ti d 0 otherwise Wid = Wid= fid fi (fi= # of docs containing ti) Some Weighting Schemes: Binary Wid = fid = # of times ti occurs in d. TF Consider Local term frequency TF X IDF (one version...) Consider Local term frequency and (Inverse) Document frequency
q• d |q||d| Vector-Based similarity Document d= <w1d,…,wMd>DB Query q = < w1q,…,wMq> (q could itself be a document in DB...) Sim(q, d) = cosine (q, d ) = q d [Salton89, Witten et al99] Introductory IR.
Probabilistic Models Query q ; Document d • Goal: Find all d’s such that P(relevant | d, q) is high P(relevant | d, q) Maximize log-odds: Log[ ] P(Irrelevant | d, q) [Sparck Jones et al. 98, Sahami98, Ponte&Croft 98, Hoffman 99]
Latent Semantics Analysis [Dumais, Deerwester et al,1988,1990] • Motivation: • Overcoming synonymy and polysemy. • Reducing dimensionality. • Idea: • Project from “explicit term” space to a lower dimension, “abstract concept” space. • Methodology: • PCA applied to the document-term matrix. • Highest singular values are used as the features for representing documents.
Information Retrieval- Details(cont.) Cancer Apoptosis Elongation Text Categorization (semantic) Automatically place documents in right categories so as to make them easy-to-find. ... ...
Information Retrieval-Details(cont.) Rule-Based Text Classification A knowledge-engineering approach. Boolean rules (DNF), based on the presence/absence of specific terms within the document, decide its membership in the class. (e.g. the CONSTRUE system [Hayes et al. 90,92]) Example: If ( (<GENE_Name> ⋀transcript) ⋁ ((<GENE_Name> ⋀ Western Blot) ⋁ ((<GENE_Name> ⋀ Northern Blot)) Then GeneExpressionDoc Else ⌝GeneExpressionDoc
Information Retrieval-Details(cont.) Machine Learning for Text Classification (supervised) • Take a training set of pre-classified documents • Build a model for the classes from the training examples • Assign each new document to the class that best fits it • (e.g. closest or most-probable class.) • Types of class assignment: • Hard: Each document belongs to exactly one class • Soft: Each document is assigned a “degree of membership” in several classes • Methods • Nearest neighbor • Summarizing document vectors • SVM, Bayesian, boosting
Evaluating Extraction and Retrieval • To say how good a system is we need: • Performance metrics (numerical measures) • Benchmarks, on which performance is measured (the gold-standard).
Evaluating Extraction and Retrieval(cont.) Performance Metrics N items (e.g. documents, terms or sentences) in the collection REL: Relevant items (documents, terms or sentences) in the collection. These SHOULD be extracted or retrieved. RETR: Retrieved items (e.g. documents, terms or sentences) are actually extracted/retrieved Some correctly (A= |REL ⋀RETR|), Some incorrectly (B = |RETR – REL|) |RETR|= A+B
Collection REL RETR Evaluating Extraction and Retrieval(cont.) Performance Metrics(cont.) |NotREL – RETR| = C |Collection| = N |RETR – REL| = B |REL-RETR| = D |REL ⋀ RETR| = A
Performance Metrics (cont.) Precision: P = A/(A+B) How many of the retrieved/extracted items are correct Recall: R = A/(A+D) How many of the items that should be retrieved are recovered Accuracy:(A+C)/N (Ratio of Correctly classified items) Combination Scores: F-score:2PR / (P+R) Harmonic mean, in the range [0,1] Fβ-score: (1+β2)PR / (β2·P + R) β>1 Prefer recall, β <1 Prefer precision E-measure: 1 – F(β)-score Inversely proportional to performance (Error measure).
1 25% Recall 2 3 50% 4 75% 5 6 100% 7 Performance Metrics (cont.) Precision-Recall Curves 4 relevant documents in the collection. 7 retrieved and ranked.
Performance Metrics (cont.) Accounting for Ranks For a given rankn, Pn: Precision at rank n (P@n) R-Precision: PR where R is the number of relevant documents Average Scores Average Precision: Average the precision over all the ranks in which a relevant document is retrieved. Mean Average Precision: Mean of the Average Precision over all the queries. Micro-Average: Average over individual items across queries Macro-Average: Average over queries
Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768
Entity Recognition (ER) • Identifying the substance(s) • Rule and contextual based approach (manual) – e.g., ‘-ase’ for enzyme • Rule and contextual based approach (machine learning) • Dictionary-based approach • How the names are written - CDC28, cdc28, cdc28p, cdc-28 • Curation of the dictionary
Entity Recognition (ER) • Major Challenge • Lack of standardization of names • ‘cdc2’ refers to two completely unrelated genes in budding and fission yeast • ‘SDS’ - serine dehydratase gene vs. Sodium Dodecyl Sulfate vs. Shwachman-Diamond syndrome • Synonymy (AGP1, aka, Amino Acid Permease1) • Polysemy
Entity Recognition (ER) • Simpler version – if this symbol is for gene or its product • iHOP (Information hyperlinked over proteins) http://www.pdg.cnb.uam.es/UniPub/iHOP
Vocabulary • Many, many • SNOWMED, ICD, … • ICD (International Statistical Classification of Diseases and Related Health Problems)
Vocabulary • ICD • 573.3 Hepatitis, unspecified • Toxic (noninfectious) hepatitis • Use additional E code to identify cause • 571.4 Chronic hepatitis • Excludes: • viral hepatitis (acute) (chronic) (070.0-070.9) • 571.49 Other • Chronic hepatitis: • active • aggressive • Recurrent hepatitis • 070 Viral hepatitis • Includes: • viral hepatitis (acute) (chronic) • Excludes: • cytomegalic inclusion virus hepatitis (078.5)
Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768
Information Extraction (IE) • Extract pre-defined types of fact — in particular, relationships between biological entities. • Co-occurrence based method • Natural language processing (NLP) based method
Information Extraction Usually it requires • Identify the relevant sentences • Parse to extract specific information • Assume “well-behaved” fact sentences • Using co-occurrence relationships alone does not require parsing or good fact-structure
Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768
Text Mining (TM) • The discovery by computer of new, previously unknown information, by automatically extracting information from different written records.
Blood Viscosity Platelet aggregability Vascular Reactivity Fish Oil Fish Oil Raynaud’s Syndrome Raynaud’s Syndrome Can Reduce Text Mining • Based on transitivity of relationships in co-occurrence graph. • This idea can be used to discover new facts by co-occurrence • Web Tool : Arrowsmith Reduces (and co-occurs) Increased (and co-occurs) [Swanson 86,Swanson87,Swanson90, Swanson and Smalheiser99, Weeber et al. 2001, Stapley & Benoit 2000, Srinivasan 2003, Srivinasan 2004]
Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768
Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768
Integration: combining text and biological data Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768
Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768