Experience with Simple Approaches

Experience with Simple Approaches Wei Fan‡ Erheng Zhong† Sihong Xie† Yuzhao Huang† Kun Zhang$ Jing Peng# Jiangtao Ren† ‡IBM T. J. Watson Research Center †Sun Yat-sen University $Xavier University of Louisiana #Montclair State University

RDT: Random Decision Tree (Fan et al’03) • “Encoding data” in trees. • At each node, an un-used feature is chosen randomly • A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. • A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen • Stop when one of the following happens: • A node becomes too small or belong to same class • Or the total height of the tree exceeds some limits:

Illustration of RDT B1: {0,1} B2: {0,1} B3: continuous B1 chosen randomly Random threshold 0.3 B2: {0,1} B3: continuous B2: {0,1} B3: continuous B3 chosen randomly B2 chosen randomly Random threshold 0.6 B3: continous

Petal.Length< 2.45 | setosa Petal.Width< 1.75 50/0/0 virginica versicolor 0/1/45 0/49/5 P(setosa|x,θ) = 0 P(versicolor|x,θ) = 49/54 P(virginica|x,θ) = 5/54 Probabilisticview of decision trees - PETs Given an example x : • , E.g. (C4.5, CART) • confidences in the predicted labels • the dependence of P(y|x,θ) on θ is non-trivial For example :

Problems of probability estimation via conventional DTs • Probability estimates tend to approach the extremes of 1 and 0. • --------------------------------------------- • Additional inaccuracies result from the small number of examples at a leaf. • --------------------------------------------- • Same probability is assigned to the entire region of space defined by a given leaf. C4.4 (Provost,03) BC44(Zhang,06), RDT(Fan,03)

bRDT • “bRDT” is the averaging of RDT and BC44, where RDT is Random Decision Tree and BC44 is Bagged C4.4

Sampling strategy for Task 1 &2 For station Z, negative instances are partitioned into “blocks” such that the size of each block is Approximately 3 times as that of the positive. Negative Block 1 Positive ……………… Block n

Task 1 & 2 - Result • For V station, row 2 and 3, corresponding to task 1 and 2 • The optimal classifiers of task 1 and 2 for station W, X, Y, Z are the same. Thus there’s only one row for these 4 stations

Task 1 - ROC

Task 2 - ROC

Task 3 – Feature Expansion Example Three instances with only one feature, A and Bare positive while C is negative.A(0.9) B(1.0) C(1.1) Distant (A, B) = Distant (B, C) 0.01 vs. 0.01 ExpandA(0.9,0.81,0.64)B(1.0,1.0,0.69)C(1.1,1.21,0.74) Distant (A, B) < Distant (B, C) 0.049 vs. 0.056

Task3 – Result of test 3 Parameter-free

Experience with Simple Approaches