140 likes | 264 Views
Learning Classifiers from Distributional Data. Harris T. Lin , Sanghack Lee, Ngot Bui and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu. Introduction.
E N D
Learning Classifiers from Distributional Data Harris T. Lin, Sanghack Lee, Ngot Bui and VasantHonavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu
Introduction • Traditional Classification Instance representation:tuple of feature values • BUT due to • Variability in sample measurements • Difference in sampling rate of each feature • Advances in tools and storages • One may want to repeatthe measurementfor each feature and for each individual for reliability • Example domains • Electronic Health Records • Sensor readings • Extracting text features • … • How to represent? White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N
Introduction • How to represent? • Align samples White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N
Introduction • How to represent? • Align samples White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y
Introduction • How to represent? • Align samples • Measurements may not be synchronous • Missing data • Unnecessary big and sparse dataset • Need to adjust for weights White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Y Y ? Y ? Y ? ? Y ? ? Y ? ? Y ? ?
Introduction • How to represent? • Aggregation White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N
Introduction • How to represent? • Aggregation • May lose valuable information • Which aggregation function? • The distribution of each sample set may contain information White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y max max avg avg Y
Introduction • How to represent? • Proposed approach • Just as drawn • Bag of feature values • “Distributional” representation • Adapt learning models to this new representation • Contribution • Introduce problem of learning from Distributional data • Offers 3 basic solution approaches White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N
Problem Formulation • Distributional Instance:x = (B1, …, BK)where Bk represents a bag of values of the kth feature • Distributional Dataset:D = {(x1, c1), …, (xn, cn)} • Distributional Classifier Learning Problem: New instance (x1, c1) (x2, c2) … (xn, cn) Learner Classifier Predicted class
Distributional Learning Algorithms • Considers discrete domain for simplicity • 3 basic approaches • Aggregation • Simple aggregation (max, min, avg, etc.) • Vector distance aggregation (Perlich and Provost [2006]) • Generative Models • Naïve Bayes (with 4 different distributions) • Bernoulli • Multinomial • Dirichlet • Polya(Dirichlet-Multinomial) • Discriminative Models • Using standard techniques to transform the above generative models into its discriminative counterpart
Result Summary • Dataset: • 2 real-world datasets and 1 synthetic dataset • Dataset sizes: • Results:DIL algorithms that take advantage of the information available in the distributional instance representation outperform or match the performance of their counterparts that fail to fully exploit such information • Main critics:Results from discrete domain may not carry over to numerical features
Related Work # Features = 1 Tuple of bags of features Bag of tuples of features Y Y Multiple Instance Learning Document Distributional Tabular Size of bag = 1 Numerical Domains Topic Models • Supervised • Multi-modal Topic Models Discrete Domains
Future Work • Consider ordinal and numerical features • Consider dependencies between features • Adapt other existing Machine Learning methods (e.g. kernel methods, SVMs, decision trees, nearest neighbors, etc.) • Unsupervised setting: clustering distributional data
Conclusion • Opportunities • Variability in sample measurements • Difference in sampling rate of each feature • One may want to repeat the measurementfor each featureand for each individual for reliability • Contributions • Introduce problem of learning from Distributional data • Offer 3 basic solution approaches • Suggest that the distribution embedded in the Distributional representation may improve performance Y