exhaustkd

exhaustkd Searching exhaustively for an optimal setting of Timbl’s -k and -d parameters

Overview • Introduction: • Timbl’s k & distance weighting • Idea: • Read knn sets and distances from Timbl’s output • Implementation: • exhaust.py & cvexhaust.py • Examples: • diminutive & Prosit data • Discussion: • limitations & improvements

Introduction Knn classification without distance weighting k=1 ? X X k=2 k=3 X Y Y X Y X X

Introduction (cont.) Knn classification with distance weighting X k=1 ? X k=2 k=3 Y X Y X Y X X

Introduction (cont.) • Distance weighting methods: • Z (no weighting) • ID (inverse distance) • IL (inverse linear) • EDa (exponential decay with alpha a)

Idea • Knn classification is actually a two-step process: • Determine the nearest neighbor sets (= those instances with similar features, optionally using feature weighting and MVDM) • Determine the majority class within all nearest neighbors at maximally the k-th distance, optionally using distance weighting • We can do step 2 without repeating step 1!

Idea (cont.) • The +v option allows you to write the knn sets and their distances to the output file • +vn := write nearest neighbors • +vdi :=write distance (of the instance to be classified, and of its knn sets) • +vdb := write class distribution (of the instance to be classified, and of its knn sets) • Example: • Timbl -f dimin.train -t dimin.test -k3 +vn+di+db

=,=,=,=,+,k,u,=,-,bl,u,m,E,E { E 4.00000, P 3.00000 } 0.0000000000000 # k=1, 1 Neighbor(s) at distance: 0.00000 # =,=,=,=,+,k,u,=,-,bl,u,m,{ P 1 } # k=2, 1 Neighbor(s) at distance: 0.0594251 # =,=,=,=,+,m,K,=,-,bl,u,m,{ E 1 } # k=3, 5 Neighbor(s) at distance: 0.103409 # =,=,=,=,+,m,O,z,-,bl,u,m,{ E 1, P 1 } # =,=,=,=,+,st,K,l,-,bl,u,m,{ E 1 } # =,=,=,=,+,m,y,r,-,bl,u,m,{ E 1, P 1 } +,m,I,=,-,d,A,G,-,d,},t,J,J { J 8.00000 } 0.28274085738293 # k=1, 1 Neighbor(s) at distance: 0.282741 # -,v,@,r,+,v,A,l,-,p,},t,{ J 1 } # k=2, 6 Neighbor(s) at distance: 0.311890 # =,=,=,=,=,=,=,=,+,k,},t,{ J 1 } # =,=,=,=,=,=,=,=,+,p,},t,{ J 1 } # =,=,=,=,=,=,=,=,+,xr,},t,{ J 1 } # =,=,=,=,=,=,=,=,+,l,},t,{ J 1 } # =,=,=,=,=,=,=,=,+,h,},t,{ J 1 } # =,=,=,=,=,=,=,=,+,fr,},t,{ J 1 } # k=3, 1 Neighbor(s) at distance: 0.325529 # +,m,K,=,-,d,@,=,-,pr,a,t,{ J 1 }

Idea (cont.) • From this output, you can • read the knn members and their distances • repeat classification for smaller k’s and other distance weightings • without calculating the knn sets and their distances again • Hence • classification is potentially much faster • exhaustively trying all combinations of k and distance weighting becomes feasible

Implementation • Python scripts • exhaustkd • cvexhaustkd • Requirements: • Minimally python 2.1 • Expenv libraries • Input • List of classes • Timbl output file(s) produced with +vn+di+db and a high k • Some option settings • Output • tables with performance measures (accuracy, recall, precision, F-score) for all combinations of k and d

$ exhaustkd -h usage: exhaustkd [options] CLASSES FILE exhaustkd [options] CLASSES <FILE purpose: Timbl's +vn+di+di option causes it to add the nearest neigbors and their distances to its output. This output can then be passed on to exhaustkd to perform an exhaustive classification over a range over k's and distance weighting metrics. It will always try Z, ID, and IL. Optionally, various settings of ED can be tried. exhaustkd tabulates the performance for all settings. Which evaluation metrics are reported depends on the PATTERN of the -o option. A PATTERN is a comma-separated list of one or more of the following symbols: A = accuracy K = kappa P = combined precision R = combined recall F = combined f-score pC = precision on class C rC = recall on class C fC = F-score on class C args: CLASSES classes as a comma separated list FILE classifier output file

options: --version show program's version number and exit -h, --help show this help message and exit -aFLOAT1,FLOAT2,...,FLOATn, --alphas=FLOAT1,FLOAT2,...,FLOATn values to try as the alpha constant in the exponential decay metric (default is none) -bFLOAT, --beta=FLOAT beta in F score calculation (default is 1.0) -dSTRING, --delimiter=STRING column delimiter (default is ' ') -f, --full-output output all available evaluation metrics for every setting -kINT, --max-k=INT the maximum number of nearest neighbors to try (default is 1) -nINT, --n-best=INT the number of settings reported in the n-best list (default is 10) -oPATTTERN, --output=PATTTERN output patttern (default is 'A,P,R,F') -rINT, --random-seed=INT seed for random generator (default is current system time) -t{once|continue|random}, --tie-resolution={once|continue|random} tie resolution by increasing k once (default), by increasing k continuously, or by choosing randomly -%, --percent output in percentages

examples: exhaustkd -k5 -a1.0,2.0 X,Y,Z output_file perform an exhaustive classification into classes X,Y,Z with k from 1 to 5, and distance metrics Z, ID, IL, ED1.0 and D2.0 exhaustkd -% -opX,rX,fX X,Y,Z <output_file output precision, recall, and f score percentages on class X exhaustkd -k10 -tcontinue -0A X,Y,Z output_file ouput accuracy when using continuous tie resolution upto k=10

Example: diminutive • Commands • Timbl -f dimin.train -t dimin.test -o out -k5 +vn+db+di • exhaustkd -d, -k5 -a1,5 -oA -% P,T,J,E,K out

================================================================================================================================================================ Accuracy (%) ================================================================================ k Z ID IL ED1.0 ED5.0 1 96.74 96.74 96.74 96.74 96.74 2 97.37 96.74 96.63 96.42 96.63 3 96.42 96.42 97.05 96.53 96.95 4 95.05 95.68 96.42 95.68 96.32 5 95.37 95.26 96.42 95.47 95.79 Rank: Score: k: d: 1 97.37 1 Z 2 97.05 2 IL 3 96.95 2 ED5.0 4 96.74 1 ID 5 96.74 0 Z 6 96.74 0 IL 7 96.74 0 ID

Example: Prosit breaks • Commands: • For each of the 10 folds, a Timbl with -k31 -o out0?? • cvexhaustkd -a1,5 -k30 -% -oA,P,R,F,pB,rB,fB B,- out0?? >exhaustive-report

================================================================================================================================================================ Accuracy (%) ================================================================================ k: Z: ID: IL: ED1.0: ED5.0: 1 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35 2 96.00 0.44 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35 3 96.00 0.44 96.00 0.44 95.89 0.48 96.01 0.44 95.99 0.43 4 96.28 0.41 96.02 0.47 95.99 0.47 96.02 0.46 96.02 0.47 5 96.28 0.41 96.29 0.41 96.22 0.49 96.28 0.41 96.28 0.40 6 96.37 0.44 96.28 0.41 96.24 0.41 96.27 0.42 96.27 0.41 7 96.37 0.44 96.38 0.44 96.36 0.46 96.37 0.44 96.36 0.45 8 96.41 0.45 96.42 0.47 96.37 0.44 96.41 0.47 96.41 0.47 9 96.41 0.45 96.42 0.44 96.42 0.46 96.41 0.44 96.43 0.43 10 96.40 0.42 96.45 0.43 96.45 0.45 96.45 0.43 96.45 0.43 11 96.40 0.42 96.41 0.41 96.46 0.46 96.40 0.42 96.43 0.43 12 96.43 0.41 96.44 0.44 96.46 0.47 96.43 0.45 96.45 0.44 13 96.43 0.41 96.43 0.41 96.47 0.44 96.42 0.41 96.44 0.42 14 96.44 0.47 96.46 0.45 96.47 0.46 96.46 0.45 96.48 0.46 15 96.44 0.47 96.44 0.47 96.47 0.47 96.44 0.47 96.46 0.46 16 96.41 0.49 96.45 0.47 96.45 0.46 96.45 0.47 96.47 0.47 17 96.41 0.49 96.41 0.48 96.46 0.47 96.40 0.49 96.45 0.47 18 96.40 0.48 96.45 0.47 96.47 0.47 96.44 0.47 96.46 0.47 19 96.40 0.48 96.40 0.47 96.47 0.48 96.39 0.48 96.44 0.45 20 96.37 0.49 96.41 0.46 96.48 0.48 96.41 0.47 96.43 0.46 20 96.37 0.49 96.41 0.46 96.48 0.48 96.41 0.47 96.43 0.46 21 96.37 0.49 96.37 0.48 96.48 0.49 96.37 0.49 96.40 0.47 22 96.38 0.49 96.39 0.46 96.48 0.47 96.40 0.47 96.41 0.46 23 96.38 0.49 96.38 0.49 96.48 0.49 96.38 0.49 96.41 0.48 24 96.39 0.51 96.41 0.48 96.48 0.50 96.41 0.48 96.43 0.47 25 96.39 0.51 96.39 0.50 96.47 0.49 96.39 0.51 96.42 0.49 26 96.40 0.48 96.43 0.50 96.49 0.49 96.43 0.49 96.45 0.49 27 96.39 0.48 96.40 0.48 96.48 0.49 96.40 0.48 96.43 0.49 28 96.40 0.46 96.41 0.48 96.48 0.49 96.41 0.48 96.41 0.47 29 96.40 0.46 96.40 0.46 96.48 0.49 96.40 0.46 96.43 0.47 30 96.39 0.49 96.43 0.46 96.48 0.49 96.42 0.47 96.43 0.47

Discussion: Time • Normal time: • An avarage 10 fold CV Tmbl experiment on the Prosit breaks requires about 30 hours (min. 20 to max. 50 hours) • Here we have k x distance weighting = 30 x 5 = 150 CV experiments • Thus, this would normally require about 150 x 30 = 4500 hours = 188 days • Time with exhaustkd: • A single 10 fold CV experiment with dumping of the knn sets requires about 30 hours • Running cvexhaustkd takes about 3 minutes (!) • Therefore, we have reduced the required by a factor 150 • (BTW the “seconds taken” reported by Timbl are a little off :-)

Discussion: Memory & Space • Memory: • Exhaustkd works locally, reading an instance and its nn’s from file, classifying, and adding the result to a confusion matrix • Consumes very little memory (2-5MB) • Disk Space: • writing knn’s to output can take a lot of space • E.g. upsampled Prosit break data with k=31 requires about 1.8GB

Discussion: limitations • Obviously, the k of exhaustkd can never be larger than the real k (= the k of the original Timbl experiment) • Actually, the k of exhaustkd must be one less than the real k • Reason: tie resolution • In case of a tie, k is increased by one • Also, output of exhaustkd may differ slightly from Timbl’s output • Reason: tie resolution • If a tie is still unresolved after increasing k,Timbl resorts to a random choice • The exact random behaviour cannot be reproduced by exhaustkd

Discussion: limitations (cont.) • Timbl output (accuracy and #ties): • Exhaustkd output (average accuracy and SD): k Z ID IL ED1.0 1 96.74 4/5 96.74 4/5 96.63 3/5 96.74 4/5 2 97.37 10/12 96.74 0/1 96.74 4/5 96.42 0/1 3 96.42 6/8 96.42 0/1 96.74 0/1 96.53 1/1 4 95.05 3/5 95.68 0/0 96.74 1/1 95.68 0/0 5 95.37 5/5 95.26 0/0 96.42 0/0 95.47 0/0 k: Z: ID: IL: ED1.0: 1 96.78 0.05 96.74 0.00 96.74 0.00 96.74 0.00 2 97.25 0.09 96.74 0.00 96.63 0.00 96.42 0.00 3 96.46 0.05 96.42 0.00 97.05 0.00 96.53 0.00 4 95.05 0.00 95.68 0.00 96.42 0.00 95.68 0.00 5 95.37 0.00 95.26 0.00 96.42 0.00 95.47 0.00

Discussion: Limitations (cont.) • Limit on number of nn’s: • Currently, if the number of nn’s exceeds 500, Timbl will only write the first 500 • Because you don’t want to dump the whole instance base (!) • However, would be nice if this was an option

Discussion: Plans • Exhaustkd can be faster: • Code not really profiled yet • Code can be (partly) compiled to C • Exhaustkd can be combined with methods that optimise feature weighting (-w) and featture metric (-m) options • Paramsearch/Iterative Deepening • Experiment with exhaustkd’s 3 options for tie resolution: • Random • Increase k once • Increase k continously until tie is resolved • Wild plans: • Can exhaustkd be a part of Timbl?

exhaustkd

exhaustkd

Presentation Transcript