Nevin L. Zhang Room 3504, phone: 2358-7015, Email: lzhang@cst.hk Home page

THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGYCSIT 5220: Reasoning and Decision under Uncertainty L09: Model-Based Classification and Clustering Nevin L. ZhangRoom 3504, phone: 2358-7015, Email: lzhang@cs.ust.hkHome page

Probabilistic Models (PMs) for Classification PMs for Clustering L09: Model-Based Classification and Clustering

The problem: Given data: Find mapping (A1, A2, …, An) |- C Possible solutions ANN Decision tree (Quinlan) … (SVM: Continuous data) Classification

Probabilistic Approach to Classification

Will Boss Play Tennis?

Bayesian Networks for Classification • Naïve Bayes model often has good performance in practice • Drawbacks of Naïve Bayes: • Attributes mutually independent given class variable • Often violated, leading to double counting. • Fixes: • General BN classifiers • Tree augmented Naïve Bayes (TAN) models • …

Bayesian Networks for Classification • General BN classifier • Treat class variable just as another variable • Learn a BN. • Classify the next instance based on values of variables in the Markov blanket of the class variable. • Pretty bad because it does not utilize all available information because of Markov boundary

Bayesian Networks for Classification • Tree-Augmented Naïve Bayes (TAN) model • Capture dependence among attributes using a tree structure. • During learning, • First learn a tree among attributes: use Chow-Liu algorithm • Special structure learning problem, easy • Add class variable and estimate parameters • Classification • arg max_c P(C=c|A1=a1, …, An=an) • BN inference • Many other methods

Task: Find a tree model over observed variables that has maximum likelihood given data. Maximized loglikelihood Chow-Liu Trees

Mutual Information Chow-Liu Trees • Task is equivalent to finding maximum spanning tree of the following weighted and undirected graph:

Maximum Spanning Trees

http://www.cs.cmu.edu/~guestrin/Class/15781/recitations/r10/11152007chowliu.pdfhttp://www.cs.cmu.edu/~guestrin/Class/15781/recitations/r10/11152007chowliu.pdf Illustration of Kruskal’s Algorithm

Probabilistic Models (PMs) for Classification PMs for Clustering L09: Probabilistic Models (PMs) for Classification and Clustering

An Medical Application • In medical diagnosis, sometimes gold standard exists • Example: Lung Cancer • Symptoms: • Persistent cough, Hemoptysis (Coughing up blood), Constant chest pain, Shortness of breath, Fatigue, etc • Information for diagnosis: symptoms, medical history, smoking history, X-ray, sputum. • Gold standard: • Biopsy: the removal of a small sample of tissue for examination under a microscope by a pathologist

An Medical Application • Sometimes gold standard does not exist • Example: Rheumatoid Arthritis (RA) • Symptoms: Back Pain, Neck Pain, Joint Pain, Joint Swelling, Morning Joint Stiffness, etc • Information for diagnosis: • Symptoms, medical history, physical exam, • Lab tests including a test for rheumatoid factor. • (Rheumatoid factor is an antibody found in the blood of about 80 percent of adults with RA. ) • No gold standard: • None of the symptoms or their combinations are not clear-cut indicators of RA • The presence or absence of rheumatoid factor does not indicate that one has RA.

LC Analysis of Hannover Rheumatoid Arthritis Data • Class specific probabilities • Cluster 1: “disease” free • Cluster 2: “back-pain type” • Cluster 3: “Joint type” • Cluster 4: “Severe type”

To Cluster Continuous Data

Learning Gaussian Mixture Models

http://www.socr.ucla.edu/Applets.dir/MixtureEM.html

Nevin L. Zhang Room 3504, phone: 2358-7015, Email: lzhang@cst.hk Home page