100 likes | 206 Views
Parallelized Boosting. Mehmet Basbug Burcin Cakir Ali Javadi Abhari Date: 10 Jan 2013. Motivating Example. Many examples, many attributes Can we find a good (strong) hypothesis relating the attributes to the final labels?. Attributes. Labels. Examples. Table 1. Example Data Format.
E N D
Parallelized Boosting Mehmet Basbug Burcin Cakir Ali Javadi Abhari Date: 10 Jan 2013
Motivating Example • Many examples, many attributes • Can we find a good (strong) hypothesis relating the attributes to the final labels? Attributes Labels Examples Table 1. Example Data Format
User Interface • User specifies the desired options in two ways: • Configuration File:Information about the number of nodes/cores, memory, number of iterations. To be parsed by the preprocessor. • Behavioral Classes:Defining the hypotheses "behaviors" --------------------------------------------------------------------------- <configurations.config> --------------------------------------------------------------------------- [Configuration 1] working_directory = '/scratch/pboost/example' data_files = 'diabetes_train.dat' test_files = 'diabetes_test.dat' fn_behavior = 'behaviors_diabetes.py' boosting_algorithm = 'confidence_rated' max_memory = 2 xval_no = 10 round_no = 1000 --------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- <behaviors_diabetes.py> ---------------------------------------------------------------------------------------------------- from parallel_boosting.utility.behavior import Behavioral class BGL_Day_Av(Behavioral) def behavior(self, bgl_m, bgl_n, bgl_e) { return (self.data[:,day_m]+self.data[:,day_m], self.data[:,day_m]) / 3 def fn_generator(self){ bgl_m = list() bgl_n = list() bgl_e = list() for k in range(1, (self.data.shape[1]-4)/3+1) bgl_m = 3*k bgl_n = 3*k+1 bgl_e = 3*k +2 self.insert(bgl_m,bgl_n,bgl_e) } ----------------------------------------------------------------------------------------------------
Pre-Processing • User-defined python classes: to obtain different function behaviors. to obtain a set of hypotheses. • Configuration file: to get the path of the required data and definitions. • Function Definitions Table: to store the hypothesesand make it available to different cores • Hypothesis Result Matrix • Sorting Index Matrix: to save the sorting indices of each example Table 2. Function Definitions Table
Pre-Processing (cont.') Table 3. Function Output Table Applying each function to each example is a parallelizable task. Therefore, another important step that needs to be implemented in the preprocessing part is to read the machine informationfrom the configuration file. Table 4. Sorting Index Table
Training the boosting algorithm slave Sorting index is partitioned Error matrices for each slave Dt h1t Weak Learner (Slave) Calculate error for each combination (hypothesis, labeling, threshold) for the hypothesis in the given set for given distribution over examples(Dt) Return the hypothesis with the least error master Boosting (Master) Start with a distribution over examples(Dt) For each round t=1...T send Dt to each slave receive best hypotheses from each slave(h1t,h2t) find the one with the least error (ht) update Dtusing ht calculate the coefficient at Return the linear combination of hts h2t Dt slave Features - Super fast --- memory based --- single pass through data --- store indexes rather than results(16 bit vs 64 bit) --- LAPACK & numexpr --- embarrassingly parallelized - Several Boosting algorithms - Flexible xval structure
Post-Processing • Combines and reports the collected results • The result after each round of iteration is stored by the master: Set of hypothesis(ht) and their respective coefficients(at), and the error. • Plot training and testing error vs. number of rounds • Plot ROC curve of training and testing error • Confusion matrix showing false/true positives/negatives • Create standalone final classifier • Report running time, amount of memory used, number of cores, ... • Clean up extra intermediary data stored on disk