190 likes | 351 Views
Text Classification for Healthcare Information Support. Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University, Taiwan. Background. Text categorization (TC) as a fundamental component for information processing Many TC techniques were developed
E N D
Text Classification for Healthcare Information Support Rey-Long Liu (劉瑞瓏) Dept. of Medical Informatics Tzu Chi University, Taiwan
Background • Text categorization (TC) as a fundamental component for information processing • Many TC techniques were developed • Unfortunately, high-quality TC is often an unrealizable ideal • Very high precision • Very high recall
Consultancy General Users (e.g. patients) Healthcare Professionals Classified Inquiry Inquiry Classification Confirmation Query Relevant Information Classified Query Classified Information Information Gathered Classified Information Base Information Gathering Systems High-Quality TC Background (Cont.) • An application scenario: healthcare information support
Outline • Interaction as an approach to high-quality TC • Main consideration • Reducing the amount of the interaction • Criteria & straightforward interaction strategies • An intelligent interaction strategy: COM (Content Overlapping Measurement) • Empirical evaluation • Chinese cancer texts classification • Conclusion
Interaction for High-Quality TC • Interaction with the user • Possibly a “final” approach • More application scenarios • Information recommendation & archiving • Definite relevant vs. potentially relevant • Main consideration • Reducing the number of interactions
Interaction for High-Quality TC (Cont.) • Evaluation criteria • Confirmation Precision (CP) • Related to cognitive load to users • Confirmation Recall (CR) • Related to the quality of TC
(A) Setting two thresholds to identify the DOA range for confirmation (o: positive validation document; x: negative validation document) : Acceptance Threshold Rejection Threshold Max DOA Min DOA x o x x o x o x o o o o x x x o o (B) Confirmation strategy: Prob = 1.0 (when RT DOA(d, c) AT) Prob = 0 (when DOA(d, c) < RT) Prob = 0 (when DOA(d, c) > AT) • Uniform Confirmation (UC): Preferring CR Interaction for High-Quality TC (Cont.) • Straightforward interaction strategies
(A) Tuning a threshold in the hope to optimize F1 (o: positive validation document; x: negative validation document): The classifier’s Threshold (T) Max DOA Min DOA o x o x x o x o x o o o x x x o o • Probabilistic Confirmation (PC): Preferring CP (B) Confirmation strategy: Prob = 1.0 (when DOA(d, c) = threshold) Prob = 0 (when DOA(d, c) = Min) Prob = 0 (when DOA(d, c) = Max) Interaction for High-Quality TC (Cont.)
Underlying Classifier Feature Selection ICCOM (1) Content Overlap Measurement (COM) Training Documents for Classifier Building Classifier Building Training Documents for Threshold Tuning (validation) (2) Threshold Tuning based on Content Overlapping Threshold Tuning Incoming Document Classified/Filtered Documents Documents to be Confirmed Training Testing ICCOM: Interactive Confirmation by COM (3) Content Overlap Measurement (COM) Classification
ICCOM: Interactive Confirmation by COM (content overlapping measurement) • ProcedureCOM(c, d), where • (1) c is a category, • (2) d is a document for thresholding or testing • Return: Degree of content overlap (DCO) between d and c • Begin • (1) DCO = 0; • (2) For each term t that is positively correlated with c but does not appear ind, do • (2.1) DCO = DCO - 2(t,c); • (3) For each term t that is negatively correlated with c but appears ind, do • (3.1) DCO = DCO - (number of occurrences of t in d) 2(t,c); • (4) Return DCO; • End.
ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.)
ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.) “positively-correlated” if AD>BC; otherwise “negative-correlated” N: total number of documents, A: # documents that are in c and contain t, B: # documents that are not in c but contain t, C: # documents that are in c but do not contain t, and D: # documents that are not in c and do not contain t.
Rejection Threshold (RT) The classifier’s threshold (T) Max DOA Min DOA o o o x x x o x o o o x x x o o Rejection Invoking COM to compute DCO Positive Confirmation Threshold (PCT) Negative Confirmation Threshold (NCT) o o x o x o o o o o Acceptance Confirmation x x o Rejection Confirmation ICCOM: Interactive Confirmation by COM (thresholding)
ICCOM: Interactive Confirmation by COM (collaboration with the classifier) • ProcedureInteractiveHighQualityTC(c, d, T, RT, PCT, NCT), where • (1) c is a category, • (2) d is the document to be processed, • (3) T is the classifier’s threshold for c, • (4) RT is the rejection threshold for c, • (5) PCT is the positive confirmation threshold for c, and • (6) NCT is the negative confirmation threshold for c. • Return: • A decision (acceptance, rejection, or confirmation) for d with respect to c. • Begin • (1) DOAd = Invoke the classifier to compute DOA of d with respect to c; • (2) If (DOAd RT), Return “rejection”; • (3) Else • (3.1) DCOd = Invoke COM to compute DCO of d with respect to c; • (3.2) If (DOAd T) • (3.2.1) If (DCOd PCT), Return “acceptance”; • (3.2.2) Return “confirmation”; • (3.3) Else • (3.3.1) If (DCOd NCT), Return “rejection”; • (3.3.2) Return “confirmation”; • End.
Empirical Evaluation • Chinese disease (cancer) texts • 16 types of cancers (e.g. liver cancer, lung cancer, …, etc.) top-ranked by the department of health in Taiwan • Collected by sending cancer names to “知識+” (knowledge+) in Yahoo! at Taiwan • For each cancer, there are 5 subcategories • Cause, symptom, curing, side-effect, and prevention • Therefore, we have 80 (16*5) categories with 2850 documents • 90% for training; 10% for testing • 2-fold cross validation (classifier building vs. thresholding)
Empirical Evaluation (cont.) Classification of cancer information
Empirical Evaluation (cont.) Classification of 40 symptom description without cancer names Note: For the 40 test symptom documents, RO+ICCOM conducts 35 and 51 confirmations in the 1st and 2nd folds, respectively
Conclusion • High-quality TC is essential but often unrealizable • Interactive confirmation may be one final resort • Information recommendation & archiving • Healthcare information support • COM as a classifier-independent strategy for interaction