Learning Classifiers from Distributional Data

Learning Classifiers from Distributional Data Harris T. Lin, Sanghack Lee, Ngot Bui and VasantHonavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu

Introduction • Traditional Classification Instance representation:tuple of feature values • BUT due to • Variability in sample measurements • Difference in sampling rate of each feature • Advances in tools and storages • One may want to repeatthe measurementfor each feature and for each individual for reliability • Example domains • Electronic Health Records • Sensor readings • Extracting text features • … • How to represent? White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N

Introduction • How to represent? • Align samples White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N

Introduction • How to represent? • Align samples White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y

Introduction • How to represent? • Align samples • Measurements may not be synchronous • Missing data • Unnecessary big and sparse dataset • Need to adjust for weights White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Y Y ? Y ? Y ? ? Y ? ? Y ? ? Y ? ?

Introduction • How to represent? • Aggregation White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N

Introduction • How to represent? • Aggregation • May lose valuable information • Which aggregation function? • The distribution of each sample set may contain information White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y max max avg avg Y

Introduction • How to represent? • Proposed approach • Just as drawn • Bag of feature values • “Distributional” representation • Adapt learning models to this new representation • Contribution • Introduce problem of learning from Distributional data • Offers 3 basic solution approaches White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N

Problem Formulation • Distributional Instance:x = (B1, …, BK)where Bk represents a bag of values of the kth feature • Distributional Dataset:D = {(x1, c1), …, (xn, cn)} • Distributional Classifier Learning Problem: New instance (x1, c1) (x2, c2) … (xn, cn) Learner Classifier Predicted class

Distributional Learning Algorithms • Considers discrete domain for simplicity • 3 basic approaches • Aggregation • Simple aggregation (max, min, avg, etc.) • Vector distance aggregation (Perlich and Provost [2006]) • Generative Models • Naïve Bayes (with 4 different distributions) • Bernoulli • Multinomial • Dirichlet • Polya(Dirichlet-Multinomial) • Discriminative Models • Using standard techniques to transform the above generative models into its discriminative counterpart

Result Summary • Dataset: • 2 real-world datasets and 1 synthetic dataset • Dataset sizes: • Results:DIL algorithms that take advantage of the information available in the distributional instance representation outperform or match the performance of their counterparts that fail to fully exploit such information • Main critics:Results from discrete domain may not carry over to numerical features

Related Work # Features = 1 Tuple of bags of features Bag of tuples of features Y Y Multiple Instance Learning Document Distributional Tabular Size of bag = 1 Numerical Domains Topic Models • Supervised • Multi-modal Topic Models Discrete Domains

Future Work • Consider ordinal and numerical features • Consider dependencies between features • Adapt other existing Machine Learning methods (e.g. kernel methods, SVMs, decision trees, nearest neighbors, etc.) • Unsupervised setting: clustering distributional data

Conclusion • Opportunities • Variability in sample measurements • Difference in sampling rate of each feature • One may want to repeat the measurementfor each featureand for each individual for reliability • Contributions • Introduce problem of learning from Distributional data • Offer 3 basic solution approaches • Suggest that the distribution embedded in the Distributional representation may improve performance Y

Learning Classifiers from Distributional Data

Learning Classifiers from Distributional Data

Presentation Transcript

Machine Learning – Classifiers and Boosting

Learning Kernel Classifiers

LEARNING FROM DATA

Learning Relational Bayesian Classifiers from RDF Data

Predicting Income from Census Data using Multiple Classifiers

Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

Predictive Learning from Data

Active Learning of Binary Classifiers

Learning Classifiers For Non-IID Data

Learning From Data

Predictive Learning from Data

Predictive Learning from Data

Predictive Learning from Data

Machine Learning – Classifiers and Boosting

Predictive Learning from Data

Predictive Learning from Data