160 likes | 349 Views
Seeking Abbreviations From MEDLINE. Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han. C hallenge. Huge Contains 12 million citations back to 1966 Growing 400,000 citations per year Common and Uncontrolled Use of Abbreviates
E N D
Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han
Challenge • Huge • Contains 12 million citations back to 1966 • Growing • 400,000citations per year • Common and Uncontrolled Use of Abbreviates • Authors create abbreviations in different ways
Example: Abb. Def. • VDR vitamin D receptor • PTU propylthiouracil • JNK c-Jun N-terminal kinase • IFN interferon • ATL adult T-cell leukemia • Beta-EPbeta-endorphin
Previous Methods • Match Initial letters (Acronyms) filter out some common words Office of Nuclear Waste Isolation (ONWR) • Use Heuristics Favor matches on the first letter or syllable boundaries, Upper case Letters difficulty: finding optimal weights • Use Lexicon Rules
Motivation • An automatic method for detecting abbreviations first step for understanding biological literatures • Rather than text mining skills, using sequence alignment to find candidates • Classification algorithm is used to confirm candidates
Method • Scanning text for possible abbreviations • Aligning the candidates • Converting the abbreviations and alignments into a feature vector • Classify by a machine learning method
Finding Candidates • Pattern long form (abbreviation) • Assumption: ignore abb. longer than two words • Within Parentheses, stop at a comma/semicolon • 3N for long form
Aligning Abbreviations • Longest Common Substring(LCS) • Seq Alignment: Maximize # of matched letters long form --- abbreviation O(NM)
Aligning Abbreviations • Antioxidant response element A R E A R E A R E
Computing Features in an alignment • Rather using scoring matrix, they use feature vector • Choosing Features Lower/Upper Case -1.21 Beginning of Word 5.54 End of Word -1.40 Syllable Boundary 2.08 After Aligned Letter 1.50 Letters Aligned 3.67 Words Skipped -5.82 Aligned Letters per Word 0.70 Constant -9.70
Scoring Alignment • Supervised Learning • Training set=1000 random candidates • Binary Logistic Regression Classifier P: probability of seeing an abbreviations X: feature vector W: weight vector Finding w which maximized the difference between positive and negative examples
Evaluation • A gold standard corpus http://www.medstract.org/gold-strandards.html • Recall (84%) Ten correct not retrieved out of 50 correct docs 40/50=80% • Precision(81%) 65 out of 100 docs are relevant 65/100=65%
Test it on-line http://abbreviation.stanford.edu
Disadvantage • Strict form • Without considering the context • Some grammar rule is ignored in feature vector http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12702985&dopt=Abstract
Conclusions • A novel algorithm for finding abbreviation • Combing sequence alignment and machine learning • Further work is expected to improve performance
Reference • Chang JT, Schütze H, and Altman RB (2002). Creating an Online Dictionary of Abbreviations from MEDLINE. The Journal of the American Medical Informatics Association. 9(6): 612-20. • http://abbreviation.stanford.edu/ • Acronym finder. http://www.acronymfinder.com/ • Iliopoulos I, Enright A, Ouzounis C. Textquest: document clustering of medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput 2001; 384-395.