Computational Linguistic Techniques Applied to Drugname Matching

Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003

Drugname Matching • String matching to rank similarity between drug names • Two classes of string matching • orthographic: Compare strings in terms of spelling without reference to sound • phonological: Compare strings on the basis of a phonetic representation • Two methods of matching • distance: How far apart are two strings? • similarity: How close are two strings?

Distance and Similarity Measures: Orthographic/ Phonological • Orthographic • Distance: string-editEx: contac / zantac = 2/6 = 0.33 • Similarity: LCSR, DICEEx: contac / zantac = 4/6= 0.66Ex: co on nt ta ac / za an nt ta ac = 6/12 = 0.50 • Phonological • Distance: SoundexEx: contac/zantac = 1/4 = 0.25 • Similarity: ALINEEx: contac/zantac = 0.64

Distance vs. Similarity: Examples • Example 1: hordes vs lords • Distance = 2 (replace h with l, and delete e). • Similarity = 2 (bigrams or and rd in common). • Example 2: water vs wine • Distance = 3 (replace a w/ i, t w/ n, delete r). • Similarity = 0 (no bigrams in common). • We can compare (global) similarity and distance: • sim(w1,w2)/length • 1 − dist(w1,w2)/length

Orthographic Distance: string-edit • Count up the number of steps it takes to transform one string into another • Examples: • Distance between hordes and lords is 2. • Distance between water and wine is 3. • For “global distance”, we can divide by length of longest string : 2/6 and 3/5 above

Orthographic Similarity: LCSR, DICE • LCSR: Divide length of longest common sub-sequence by length of longest string • Example: reagir and repair have longest common subsequence reair.Similarity score = 5/max(6,6)= 5/6 = 0.83 • DICE: Double the number of shared character bigrams and divide by total number of bigrams in each string • Example: reagir and repair have bigram sets {re,ea,ag,gi,ir} and {re,ep,pa,ai,ir}, respectively, and shared bigrams are {re,ir}. Similarity score = (2 ∙ 2)/(5+5) = 2/5 = 0.40

Phonological Matching • Distance-based phonological matching • Soundex • Similarity-based phonological matching • ALINE

Phonological Distance • Soundex Examples: • king and khyngge reduce to k52 • knight and night reduce to k523 and n23 • pulpit and phlebotomy reduce to p413

What went wrong? • Truncation of word to four characters • Alternative: Use entire string • Ignoring vowels • Use more sophisticated phonetic rules • Using numbers instead of decomposable features • Use decomposable features

Phonological Similarity • Another possible approach: Compare syllable count, initial/final sounds, stress locations • Misses frequently confused pairs • Alternative: Use phonological features to compare two words by their sounds. • x#→k(s): +consonantal, +velar, +stop, -voice • #x→z: +consonantal, +alveolar, +fricative, +voice • Phonological similarity of two words: Optimal match between their phonological features. • Zantac • Xanax

Kondrak – ALINE (2000) • Two fundamental components of ALINE: • Similarity Function: Uses linguistic feature analysis measurements based on salience, e.g., ±alveolar and ±stop more salient than ±voice • Method for choosing optimal alignment: creates alignment based on a weighted multi-feature analysis • Designed to align phonetic sequences for many different CL applications • Developed originally for identifying cognates in vocabularies of related languages (e.g., colour, couleur) • Feature weights can be fine-tuned for specific application. • Efficient: Dynamic programming algorithm: quadratic

ALINE Features: Weights and Values

Places of Articulation: Numerical Values

Manner of Articulation:Numerical Values • stop 1.0Example: p, b • affricate 0.9Example: th • fricative 0.8Example: f, v

Tuning of ALINE Parameters • Parameters have default settings for cognate matching task, but not appropriate for drugname matching • Parameter tuning: • calculate weights for drugname matching • “Hill Climbing” search against gold standard • Tuned parameters for drugname task • maximum score • insertion/deletion penalty • vowel penalty • phonological feature values

Comparison of Outputs • ALINE: 0.792 zantac xanax 0.639 zantac contac 0.486 xanax contac • EDIT: 0.500 zantac xanax 0.667 zantac contac 0.333 xanax contac • LCSR: 0.545 zantac xanax 0.667 zantac contac 0.364 xanax contac • DICE: 0.222 zantac xanax 0.600 zantac contac 0.000 xanax contac

Evaluation • Precision and recall against online gold standard: USP Quality Review, Mar, 2001. • 582 unique drug names, 399 true confusion pairs, 169,071 possible pairs (combinatorically induced) • Example (using DICE):+ 0.889 atgam ratgam+ 0.875 herceptin perceptin- 0.870 zolmitriptan zolomitriptan+ 0.857 quinidine quinine- 0.857 cytosar cytosar-u+ 0.842 amantadine rimantadine: : : :- 0.800 erythrocin erythromycin

Comparison of Precision at Different Recall Values

Precision of Techniques withPhonetic Transcription

Conclusion • Experimentation with different algorithms and their combinations against gold standard. • ALINE: Strong foundation for search modules in automating the minimization of medication errors • Fine-tuning based on comparisons with gold standard (e.g., re-weighting of phonological features). • Related to pattern recognition: Discover patterns of predictable matches based on feature values

Computational Linguistic Techniques Applied to Drugname Matching

Computational Linguistic Techniques Applied to Drugname Matching

Presentation Transcript

Molecular Techniques Applied to Biological Material

Software Reliability Techniques Applied to Constellation

Lecture 7 - Meshing Applied Computational Fluid Dynamics

MATCHING TECHNIQUES

New Techniques in Computational photography

APPLIED LINGUISTIC Vocabulary

Computational Geometry and Geometric Shape Matching

Software Reliability Techniques Applied to Constellation

An Applied Ontological Approach to Computational Semantics

Data Management and Linguistic Analysis: MDS applied to RODA

Matching Techniques and Baluns

Department of Applied Hydrodynamics Laboratory of Applied and Computational Hydrodynamics

COMPUTATIONAL TECHNIQUES

Numerical Methods and Computational Techniques

Neuro Linguistic Programming Techniques

Software Reliability Techniques Applied to Constellation

Software Reliability Techniques Applied to Constellation

What is applied business techniques

Top Applied Behavior Analysis Techniques