200 likes | 312 Views
Computational Linguistic Techniques Applied to Drugname Matching. Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003. Drugname Matching. String matching to rank similarity between drug names Two classes of string matching
E N D
Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003
Drugname Matching • String matching to rank similarity between drug names • Two classes of string matching • orthographic: Compare strings in terms of spelling without reference to sound • phonological: Compare strings on the basis of a phonetic representation • Two methods of matching • distance: How far apart are two strings? • similarity: How close are two strings?
Distance and Similarity Measures: Orthographic/ Phonological • Orthographic • Distance: string-editEx: contac / zantac = 2/6 = 0.33 • Similarity: LCSR, DICEEx: contac / zantac = 4/6= 0.66Ex: co on nt ta ac / za an nt ta ac = 6/12 = 0.50 • Phonological • Distance: SoundexEx: contac/zantac = 1/4 = 0.25 • Similarity: ALINEEx: contac/zantac = 0.64
Distance vs. Similarity: Examples • Example 1: hordes vs lords • Distance = 2 (replace h with l, and delete e). • Similarity = 2 (bigrams or and rd in common). • Example 2: water vs wine • Distance = 3 (replace a w/ i, t w/ n, delete r). • Similarity = 0 (no bigrams in common). • We can compare (global) similarity and distance: • sim(w1,w2)/length • 1 − dist(w1,w2)/length
Orthographic Distance: string-edit • Count up the number of steps it takes to transform one string into another • Examples: • Distance between hordes and lords is 2. • Distance between water and wine is 3. • For “global distance”, we can divide by length of longest string : 2/6 and 3/5 above
Orthographic Similarity: LCSR, DICE • LCSR: Divide length of longest common sub-sequence by length of longest string • Example: reagir and repair have longest common subsequence reair.Similarity score = 5/max(6,6)= 5/6 = 0.83 • DICE: Double the number of shared character bigrams and divide by total number of bigrams in each string • Example: reagir and repair have bigram sets {re,ea,ag,gi,ir} and {re,ep,pa,ai,ir}, respectively, and shared bigrams are {re,ir}. Similarity score = (2 ∙ 2)/(5+5) = 2/5 = 0.40
Phonological Matching • Distance-based phonological matching • Soundex • Similarity-based phonological matching • ALINE
Phonological Distance • Soundex Examples: • king and khyngge reduce to k52 • knight and night reduce to k523 and n23 • pulpit and phlebotomy reduce to p413
What went wrong? • Truncation of word to four characters • Alternative: Use entire string • Ignoring vowels • Use more sophisticated phonetic rules • Using numbers instead of decomposable features • Use decomposable features
Phonological Similarity • Another possible approach: Compare syllable count, initial/final sounds, stress locations • Misses frequently confused pairs • Alternative: Use phonological features to compare two words by their sounds. • x#→k(s): +consonantal, +velar, +stop, -voice • #x→z: +consonantal, +alveolar, +fricative, +voice • Phonological similarity of two words: Optimal match between their phonological features. • Zantac • Xanax
Kondrak – ALINE (2000) • Two fundamental components of ALINE: • Similarity Function: Uses linguistic feature analysis measurements based on salience, e.g., ±alveolar and ±stop more salient than ±voice • Method for choosing optimal alignment: creates alignment based on a weighted multi-feature analysis • Designed to align phonetic sequences for many different CL applications • Developed originally for identifying cognates in vocabularies of related languages (e.g., colour, couleur) • Feature weights can be fine-tuned for specific application. • Efficient: Dynamic programming algorithm: quadratic
Manner of Articulation:Numerical Values • stop 1.0Example: p, b • affricate 0.9Example: th • fricative 0.8Example: f, v
Tuning of ALINE Parameters • Parameters have default settings for cognate matching task, but not appropriate for drugname matching • Parameter tuning: • calculate weights for drugname matching • “Hill Climbing” search against gold standard • Tuned parameters for drugname task • maximum score • insertion/deletion penalty • vowel penalty • phonological feature values
Comparison of Outputs • ALINE: 0.792 zantac xanax 0.639 zantac contac 0.486 xanax contac • EDIT: 0.500 zantac xanax 0.667 zantac contac 0.333 xanax contac • LCSR: 0.545 zantac xanax 0.667 zantac contac 0.364 xanax contac • DICE: 0.222 zantac xanax 0.600 zantac contac 0.000 xanax contac
Evaluation • Precision and recall against online gold standard: USP Quality Review, Mar, 2001. • 582 unique drug names, 399 true confusion pairs, 169,071 possible pairs (combinatorically induced) • Example (using DICE):+ 0.889 atgam ratgam+ 0.875 herceptin perceptin- 0.870 zolmitriptan zolomitriptan+ 0.857 quinidine quinine- 0.857 cytosar cytosar-u+ 0.842 amantadine rimantadine: : : :- 0.800 erythrocin erythromycin
Conclusion • Experimentation with different algorithms and their combinations against gold standard. • ALINE: Strong foundation for search modules in automating the minimization of medication errors • Fine-tuning based on comparisons with gold standard (e.g., re-weighting of phonological features). • Related to pattern recognition: Discover patterns of predictable matches based on feature values