Relationship Matrix and SVD Recommender Algorithm

Given a relationship matrix, r, between 2 entities, u, m. (e.g., the Netflix "Ratings" matrix, rum = rating user, u, gives to movie, m. The SVD Recommender uses the ratings in r to train 2 smaller matrixes, a user-feature matrix, U, and a feature-movie matrix, M. Once U and M are trained, SVD quickly predicts the rating u would give to m as the dot product, pum=UuoMm. Starting with a few features, Uu vector = extents to which a user "likes" feature; Mm = the level at which each feature characterizes m. SVD trains U and M using gradient descent minimization of sse (sum square error = RATINGS pum-rum)2). rm n u 2 1 Mm v 4 1 Mn 1 1  Uu UvF em n u 1 v 3 pm n u 1 v 1 G= ( (u1,n)r eu1,nMn,..., (uL,n)r euL,nMn, ..., (v,m1)r ev,m1Uv, ..., v,mLRe(v,mL)Uv ) F(t)≡( Uu+tGu, Mm+tGm ) pu,m(t)≡( (Uu+tGu)o(Mm+tGm) ) (um)r g=GmGu (um)r (2teg+2t2hg+2t3g2+eh+th2+t2gh) = 0 (e+th+t2g)(2tg+h) = sse(t) = (u,m)r (pu,m(t) - ru,m)2 = (u,m)r ( (Uu+tGu)(Mm+tGm) - ru,m )2 0 = dsse(t)/dt= (u,m)r 2( (Uu+tGu)(Mm+tGm) - ru,m )1 h=GmUu+GuMm t32(um)r t23(um)r g2+ + t(um)r hg (2eg+h2) + (um)r eh = 0 e=UuMm-rum a b c d (u,m)r 2(UuMm-rum+t(GmUu+GuMm)+t2GuGm) d/dt( t2GuGm + t(GmUu+GuMm) + UuMm))=0 (um)r ( UuMm-ru,m+t(GmUu+GuMm)+t2GuGm) ( 2tGuGm+ GmUu+GuMm) = 0 [(-b3/27a3+bc/6a2-d/2a) +{(-b3/27a3+bc/6a2-d/2a)2+(c/3a-b2/9a2)3}1/2]1/3 r__p__ e____ee_g___ h__ a 164 p 0.3414 2 1 1 -1 1 1 -1 -2 b -168 q -0.004 4 1 1 -3 9 9 -3 -6 c -16 r -0.032 1 1 F -1-3 GR d 20 0.0365 t desc 40628 t root 164 -161 -21.9 1 -0.036 164 -168 -16 20 164 -6.0 -161 -16 -161 5.927 -21.9 20 -21.9 0.802 19.19 [(-b3/27a3+bc/6a2-d/2a) -{(-b3/27a3+bc/6a2-d/2a)2+(c/3a-b2/9a2)3}1/2]1/3 - b/3a Solving at3+bt2+ct+d=0, t= + ( q-[q2+(r-p2)3])1/3 + p or t = ( q+[q2+(r-p2)3])1/3 p = -b/(3a) q = p3+(bc-3ad)/(6a2) r = c/(3a) F=(Uu,Mm) feature_vector,pu,m=UuMmprediction, ru,mrating, eu,m=ru,m-pu,merror, G Gradient(sse) sse(F)=(u,m)r(UuMm - ru,m)2 e.g., if  two r=ratings, ru,m=2 and rv,n=4, laying things out as: d/dt( (Uu+tGu)(Mm+tGm) ) 2(UuMm+t2GuGm+tGmUu+tGuMm-ru,m) Something is wrong with the cubic root formula!

There are 3 zeros of the derivative function: -0.337, 0.361 and 1.000. Function values are 0.047, 18.47 and 4.0. See attachment and the chart. Blue chart is the derivative and the red chart is the function.

a 164 p 0.3414 -0.007 -0.001 b -168 q -0.004 -0.198 -0.106 c -16 r -0.032 d 20 x z=x+e/3 q^2+4p^3/27 1 1 -0.013 e -1.02 f -0.09 p -0.447 t ERR g 0.121 q 0.0090 u 1 r____ p__ e_________ ee_____ g_______ h____ 2 2 1 1 -1 1 1 -1 -2 3 4 1 1 -3 9 9 -3 -6 4 1 1 fv -1 -3 GR Given: 0 = ax³ + bx² + cx + d Divide through by a 0 = x³ + ex² + fx + g e = b/a f = c/a g = d/a Step 2: Do horizontal shift, x = z − e/3 which removes square term, leaving 0 = z³ + pz + q where z = x + e/3 p = f − e² / 3 q = 2e³/27 − ef/3 + g Step 3: Introduce u and t: u⋅t = (p/3)³ , u − t = q so t = −½q ± ½√(q² + 4p³/27) u = ½q ± ½√(q² + 4p³/27) 0 = z³ + pz + q has a root at: z = ∛(t) − ∛(u) 'u' and 't' both have a '±' sign. It doesn't matter which you pick to plug into the above equation... as long as you pick the same sign for each.

r_______ p_______ e_________ ee______ g_____ h_______ a 164 2 1 1 1 1 1 1 2 1 b 168 4 1 1 3 9 9 3 6 1 c 60 1 1 fv 0 t 5 mset 5 mse 1 3 GR 1 1 fvt d 8 r_____ p_________ e_____ ____ ee____ g_____ h______________ 1 3 1 1 1 0 2 0 4 6 4 2 5 4 1.2 4 5 1 1 1 3 4 9 ** **7 7 10 8 1.7 1 1 1 fv 0.1 t 3.9362 mset 7 mse 3 1 2 GR 1.3 1.1 1.2 fvt r_______ p_______ e_________ ee______ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.1 t 2.980 mset 5 mse Linear and pseudo-binary line search r_____ p_________ e_____ ____ 1 3 1 1 1 0 2 4 5 1 1 1 3 4 1 1 1 fv 0.2 t 1.7848 mset r_______ p_______ e_________ ee______ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.2 t 1.193 mset 5 mse r_____ p_________ e_____ ____ 1 3 1 1 1 0 2 4 5 1 1 1 3 4 1 1 1 fv 0.3 t 2.2170 mset r_______ p_______ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.3 t 0.124 mset 5 mse r_____ p_________ e_____ ____ 1 3 1 1 1 0 2 4 5 1 1 1 3 4 1 1 1 fv 0.25 t 1.5761 mset r_____ p_________ e_____ ____ 1 3 1 1 1 0 2 4 5 1 1 1 3 4 1 1 1 fv 0.225 t 1.5878 mset r_______ p_______ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.4 t 0.353 mset 5 mse r_______ p________ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.5 t 2.562 mset 5 mse r_____ p__________ e__________ ee____ g_____ h______________ 1 3 1 1 1 0 2 0 4 6 4 2 5 4 1.47 4 5 1 1 1 3 4 9 ** **7 7 10 8 2.66 1 1 1 fv 0.2375 t 1.5571 mset 7 mse 3 1 2 GR 1.71 1.23 1.47 fvt r_______ p________ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.6 t 7.529 mset 5 mse rr______________ p________________ e_____ ____ 1 3 1.47 2.52 2.17 -1.525 4 5 2.66 4.55 3.2948 -0.559 1.70 1.71 1.23 1.47 fv 0.1 t 0.5542 mset r_______ p________ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.7 t 16.13 mset 5 mse r_______ p_________ e_________ ee______ g_____ h_______ 2 1 1 1 1 1 1 2 1.3 4 1 1 3 9 9 3 6 2.0 1 1 fv 0.338 t 0.023 mset 5 mse 1 3 GR 1.3 2.0 fvt r______________ p________________ e_____ ____ 1 3 1.47 2.52 2.17 -1.525 4 5 2.66 4.55 3.2948 -0.559 1.70 1.71 1.23 1.47 fv 0.2 t 3.9363 mset r_______ p_________ e_________ ee________ g_____ h_______ 2 1.3 1.7 0.209 0.043 0. 0. 0.7 1.3 4 2.0 4.056 -0. 0.0 0. -0 -0. 2.0 1.3 2.0 fv 0.1 t 0.009 mset 0.023 mse 0. -0 GR 1.3 2.0 fvt r______________ p________________ e_____ ____ 1 3 1.47 2.52 2.17 -1.525 4 5 2.66 4.55 3.2948 -0.559 1.70 1.71 1.23 1.47 fv 0.05 t 0.5582 mset r_______ p_________ e_________ 2 1.3 1.7 0.209 4 2.0 4.056 -0. 1.3 2.0 fv 0.2 t 0.002 mset rr______________ p________________ e_____ ____ 1 3 1.47 2.52 2.17 -1.525 4 5 2.66 4.55 3.2948 -0.559 1.70 1.71 1.23 1.47 fv 0.075 t 0.4258 mset r_______ p_________ e________ 2 1.3 1.7 0.209 4 2.0 4.056 -0. 1.3 2.0 fv 0.3 t 0.003 mse r_______ p_________ e_________ ee________ g_____ h_______ 2 1.3 1.7 0.209 0.043 0. 0. 0.7 1.4 4 2.0 4.056 -0. 0.0 0. -0 -0. 1.9 1.3 2.0 fv 0.23 t 0.001 mset 0.023 mse 0. -0 GR 1.4 1.9 fvt Since calculus isn't working (to find the min mse along f(t)=f+tG ), will this type of binary search be efficient enough?Maybe so! In all dimensions, the mse(t) equation is quartic (dimension=4) so The general shape is as below (where any subset of the local extremes can coelese).

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 1 \a=Z2 2 3 3 5 2 5 3 3 /rvnfv~fv~{goto}L~{edit}+.005~/XImse<omse-.00001~/xg\a~ 3 2 5 1 2 3 5 3 .001~{goto}se~/rvfv~{end}{down}{down}~ 4 3 3 3 5 5 2 /xg\a~ 5 5 3 4 3 6 2 1 2 1 7 4 1 1 4 3 8 4 3 2 5 3 9 1 4 5 3 2 LRATE omse 10 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3 1 3 3 2 0.001 0.1952090 fv A22: +A2-A$10*$U2 /* error for u=a, m=1 */ A30: +A10+$L*(A$22*$U$2+A$24*$U$4+A$26*$U$6+A$29*$U$9) /* updates f(u=a) */ U29: +U9+$L*(($A29*$A$30+$K29*$K$30+$N29*$N$30+$P29*$P$30)/4)/* updates f(m=8 */ AB30: +U29 /* copies f(m=8) feature update in the new feature vector, nfv */ W22: @COUNT(A22..T22) /* counts the number of actual ratings (users) for m=1 */ X22: [W3] @SUM(W22..W29) /*adds ratings counts for all 8 movies = training count*/ AD30: [W9] @SUM(SE)/X22 /* averages se's giving the mse */ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 21 working error and new feature vector (nfv) 22 0 0 0 **0 ** 3 6 35 23 0 0 ** 0 ** 0 3 6 24 0 0 0 ** 0 2 5 25 0 ** ** 3 3 26 0 0 **1 3 27 **** ** 0 3 4 28 ** 1 0 ** 3 4 29 ** ** 0 0 2 4 L mse 30 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.1952063 nfv A52: +A22^2 /*squares all the individual erros */ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 square errors 53 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 54 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 55 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 58 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 59 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SE 60 --------------------------------------------------------------- 61 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 1 1 3 3 3 3. 2 2 3 2 0.125 0.225073 62 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 3 3 3 3. 1 2 3 2 0.141 0.200424 63 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 3 3 3 3. 1 3 3 2 0.151 0.197564 64 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.151 0.196165 65 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.151 0.195222 66 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.195232 67 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.195228 68 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.195224 69 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.195221 70 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.195218 71 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.195214 72 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.195211 {goto}se~/rvfv~{end}{down}{down}~ "value copy" fv to output list Notes: In 2 rounds mse is as low as Funk gets it in 2000 rounds. After 5 rounds mse is lower than ever before (and appears to be bottoming out). I know I shouldn't hardcode parameters! Experiments should be done to optimize this line search (e.g., with some binary search for a low mse). Since we have the resulting individual square_errors for each training pair, we could run this, then for mask the pairs with se(u,m) > Threshold. Then do it again after masking out those that have already achieved a low se. But what do I do with the two resulting feature vectors? Do I treat it like a two feature SVD or do I use some linear combo of the resulting predictions of the two (or it could be more than two)? We need to test out which works best (or other modifications) on Netflix data. Maybe on those test pairs for which the training row and column have some high errors, we apply the second feature vector instead of the first? Maybe we invoke CkNN for test pairs in this case (or use all 3 and a linear combo?) This is powerful! We need to optimize the calculations using pTrees!!! /rvnfv~fvcopies fv to nfv after converting fv to values. {goto}L~{edit}+.005~increments L by .005 /XImse<omse-.00001~/xg\a~IF mse still decreasing, recalc mse with new L .001~ Reset L=.001 for next round /xg\a~ Start over with next round

Research Summary We datamine big data (big data ≡ trillions of rows and, sometimes, thousands of columns (which can complicate data mining trillions of rows). How do we do it? I structure the data table as [compressed] vertical bit columns (called "predicate Trees" or "pTrees"). I process those pTrees horizontally (because processing across thousands of column structures is orders of magnitude faster than processing down trillions of row structures. As a result, some tasks that might have taken forever can be done in a humanly acceptable amount of time. What is data mining? Largely it is classification (assigning a class label to a row based on a training table of previously classified rows). Clustering and Association Rule Mining (ARM) are important areas of data mining also, and they are related to classification. The purpose of clustering is usually to create [or improve] a training table. It is also used for anomaly detection, a huge area in data mining. ARM is used to data mine more complex data (relationship matrixes between two entities, not just single entity training tables). Recommenders recommend products to customers based on their previous purchases or rents (or based on their ratings of items)". To make a decision, we typically search our memory for similar situations (near neighbor cases) and base our decision on the decisions we (or an expert) made in those similar cases. We do what worked before (for us or for others). I.e., we let near neighbor cases vote. But which neighbor vote? "The Magical Number Seven, Plus or Minus Two..." Information"[2] is one of the most highly cited papers in psychology cognitive psychologistGeorge A. Miller of Princeton University's Department of Psychology in Psychological Review. It argues that the number of objects an average human can hold in working memory is 7 ± 2 (called Miller's Law). Classification provides a better 7. Some current pTree Data Mining research projects FAUST pTree PREDICTOR/CLASSIFIER (FAUST= Functional Analytic Unsupervised and Supervised machine Teaching): FAUST pTree CLUSTER/ANOMALASER pTrees in MapReduce MapReduce and Hadoop are key-value approaches to organizing and managing BigData. pTree Text Mining:: capturie the reading sequence, not just the term-frequency matrix (lossless capture) of a text corpus. Secure pTreeBases: This involves anonymizing the identities of the individual pTrees and randomly padding them to mask their initial bit positions. pTree Algorithmic Tools: An expanded algorithmic tool set is being developed to include quadratic tools and even higher degree tools. pTree Alternative Algorithm Implementation: Implementing pTree algorithms in hardware (e.g., FPGAs) should result in orders of magnitude performance increases? pTree O/S Infrastructure: Computers and Operating Systems are designed to do logical operations (AND, OR...) rapidly. Exploit this for pTree processing speed. pTree Recommender: This includes, Singular Value Decomposition (SVD) recommenders, pTree Near Neighbor Recommenders and pTree ARM Recommenders.

FAUST clustering (the unsupervised part of FAUST) The Dot Product Projection (DPP)Check for gaps in DPPd(y) or DPPpq(y)≡ (y-p)o(p-q)/|p-q| (parameterized over a grid of d=(p-q)/|p-q|Spheren. d The Dot Product Radius (DPR) Check gaps in DPRpq(y) ≡ √ SDp(y)- DPPpq(y)2 This class of partitioning or clustering methods relies on choosing a dot product projection so that if we find a gap in the F-values, we know that the 2 sets of points mapping to opposite sides of that gap are at least as far apart as the gap width.). The Coordinate Projection Functionals (ej)Check gaps in ej(y) ≡ yoej = yj The Square Distance Functional (SD)Check gaps in SDp(y) ≡ (y-p)o(y-p) (parameterized over a pRn grid). The Square Dot Product Radius (SDPR) SDPRpq(y) ≡ SDp(y)- DPPpq(y)2 (easier pTree processing) DPP-KM1. Check gaps in DPPp,d(y) (over grids of p and d?). 1.1 Check distances at any sparse extremes. 2. After several rounds of 1, apply k-means to the resulting clusters (when k seems to be determined). DPP-DA2. Check gaps in DPPp,d(y) (grids of p and d?) against the density of subcluster. 2.1 Check distances at sparse extremes against subcluster density. 2.2 Apply other methods once Dot ceases to be effective. DPP-SD) 3. Check gaps in DPPp,d(y) (over a p-grid and a d-grid) and SDp(y) (over a p-grid). 3.1 Check sparse ends distance with subcluster density. (DPPpd and SDp share construction steps!) SD-DPP-SDPR) (DPPpq , SDp and SDPRpq share construction steps! SDp(y)≡ (y-p)o(y-p) = yoy - 2 yop +pop DPPpq(y) ≡ (y-p)od=yod-pod= (1/|p-q|)yop - (1/|p-q|)yoq Calc yoy, yop, yoq concurrently? Then constant multiplies 2*yop, (1/|p-q|)*yop concurrently. Then add | subtract. Calculate DPPpq(y)2. Then subtract it from SDp(y)

FAUST DPP CLUSTER on IRiS with DPP(y)=(y-p)o(q-p)/|q-p|, where p is the min (or n) corner and q is the max (x) corner of the circumscribing rectangle (mdpts or avg (a) is used also). DPP 60 59 60 58 60 58 60 59 59 58 60 58 59 62 63 61 61 60 58 60 57 59 64 56 56 57 57 59 60 58 57 58 61 62 58 61 61 58 60 59 61 57 60 57 56 58 59 59 60 60 25 27 22 29 24 26 25 37 25 31 34 29 30 24 35 27 26 31 23 32 23 31 21 25 28 SL SW PL PW set 51 35 14 2 set 49 30 14 2 set 47 32 13 2 set 46 31 15 2 set 50 36 14 2 set 54 39 17 4 set 46 34 14 3 set 50 34 15 2 set 44 29 14 2 set 49 31 15 1 set 54 37 15 2 set 48 34 16 2 set 48 30 14 1 set 43 30 11 1 set 58 40 12 2 set 57 44 15 4 set 54 39 13 4 set 51 35 14 3 set 57 38 17 3 set 51 38 15 3 set 54 34 17 2 set 51 37 15 4 set 46 36 10 2 set 51 33 17 5 set 48 34 19 2 set 50 30 16 2 set 50 34 16 4 set 52 35 15 2 set 52 34 14 2 set 47 32 16 2 set 48 31 16 2 set 54 34 15 4 set 52 41 15 1 set 55 42 14 2 set 49 31 15 1 set 50 32 12 2 set 55 35 13 2 set 49 31 15 1 set 44 30 13 2 set 51 34 15 2 set 50 35 13 3 set 45 23 13 3 set 44 32 13 2 set 50 35 16 6 set 51 38 19 4 set 48 30 14 3 set 51 38 16 2 set 46 32 14 2 set 53 37 15 2 set 50 33 14 2 ver 70 32 47 14 ver 64 32 45 15 ver 69 31 49 15 ver 55 23 40 13 ver 65 28 46 15 ver 57 28 45 13 ver 63 33 47 16 ver 49 24 33 10 ver 66 29 46 13 ver 52 27 39 14 ver 50 20 35 10 ver 59 30 42 15 ver 60 22 40 10 ver 61 29 47 14 ver 56 29 36 13 ver 67 31 44 14 ver 56 30 45 15 ver 58 27 41 10 ver 62 22 45 15 ver 56 25 39 11 ver 59 32 48 18 ver 61 28 40 13 ver 63 25 49 15 ver 61 28 47 12 ver 64 29 43 13 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 ver 58 26 40 12 ver 50 23 33 10 ver 56 27 42 13 ver 57 30 42 12 ver 57 29 42 13 ver 62 29 43 13 ver 51 25 30 11 ver 57 28 41 13 vir 63 33 60 25 vir 58 27 51 19 vir 71 30 59 21 vir 63 29 56 18 vir 65 30 58 22 vir 76 30 66 21 vir 49 25 45 17 vir 73 29 63 18 vir 67 25 58 18 vir 72 36 61 25 vir 65 32 51 20 vir 64 27 53 19 vir 68 30 55 21 vir 57 25 50 20 vir 58 28 51 24 vir 64 32 53 23 vir 65 30 55 18 vir 77 38 67 22 vir 77 26 69 23 vir 60 22 50 15 vir 69 32 57 23 vir 56 28 49 20 vir 77 28 67 20 vir 63 27 49 18 vir 67 33 57 21 vir 72 32 60 18 vir 62 28 48 18 vir 61 30 49 18 vir 64 28 56 21 vir 72 30 58 16 vir 74 28 61 19 vir 79 38 64 20 vir 64 28 56 22 vir 63 28 51 15 vir 61 26 56 14 vir 77 30 61 23 vir 63 34 56 24 vir 64 31 55 18 vir 60 30 18 18 vir 69 31 54 21 vir 67 31 56 24 vir 69 31 51 23 vir 58 27 51 19 vir 68 32 59 23 vir 67 33 57 25 vir 67 30 52 23 vir 63 25 50 19 vir 65 30 52 20 vir 62 34 54 23 vir 59 30 51 18 3 4 5 6 7 8 9 50 DPP 27 23 21 26 36 32 33 32 20 27 27 24 25 31 30 27 26 SL SW PL PW ver 66 30 44 14 ver 68 28 48 14 ver 67 30 50 17 ver 60 29 45 15 ver 57 26 35 10 ver 55 24 38 11 ver 55 24 37 10 ver 58 27 39 12 ver 60 27 51 16 ver 54 30 45 15 ver 60 34 45 16 ver 67 31 47 15 ver 63 23 44 13 ver 56 30 41 13 ver 55 25 40 13 ver 55 26 44 12 ver 61 30 46 14 26 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 CL3 w outliers removed p=aaax q=aaan F Cnt 0 4 1 2 2 5 3 13 4 8 5 12 6 4 7 2 8 11 9 5 10 4 11 5 12 2 13 7 14 3 15 2 30 37 29 30 29 28 40 30 10 19 11 15 12 5 24 8 12 10 19 16 15 19 17 17 16 6 0 30 10 19 11 15 12 5 24 8 12 10 19 16 15 19 17 17 16 6 0 16 13 18 19 11 12 17 19 18 16 20 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 25 Checking [0,4] distances (s42 Setosa outlier) F 0 1 2 3 3 3 4 s14 s42 s45 s23 s16 s43 s3 s14 0 8 14 7 20 3 5 s42 8 0 17 13 24 9 9 s45 14 17 0 11 9 11 10 s23 7 13 11 0 15 5 5 s16 20 24 9 15 0 18 16 s43 3 9 11 5 18 0 3 s3 5 9 10 5 16 3 0 IRIS: 150 irises (rows), 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). first 50 are Setosa (s), next 50 are Versicolor (e), next 50 are Virginica (i) irises. CL1 F<17(50 Set) 17<F<23 CL2 (e8,e11,e44,e49,i39) gap>=4 p=nnnn q=xxxx F Count 0 1 1 1 2 1 3 3 4 1 5 6 6 4 7 5 8 7 9 3 10 8 11 5 12 1 13 2 14 1 15 1 19 1 20 1 21 3 26 2 28 1 29 4 30 2 31 2 32 2 33 4 34 3 36 5 37 2 38 2 39 2 40 5 41 6 42 5 43 7 44 2 45 1 46 3 47 2 48 1 49 5 50 4 51 1 52 3 53 2 54 2 55 3 56 2 57 1 58 1 59 1 61 2 64 2 66 2 68 1 Thinning=[6,7 ] CL3.1 <6.5 44 ver 4 vir CL3.2 >6.5 2 ver 39 vir No sparse ends 23<F CL3 (46 vers,49 vir) Check distances in [12,28] s16,,i39,e49, e11, {e8,e44, i6,i10,i18,i19,i23,i32 outliers F 12 13 13 14 15 19 20 21 21 21 26 26 28 s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30 e31 s34 0 5 8 5 4 21 25 28 32 28 30 28 31 s6 5 0 4 3 6 18 21 23 27 24 26 23 27 s45 8 4 0 6 9 18 18 21 25 21 24 22 25 s19 5 3 6 0 6 17 21 24 27 24 25 23 27 s16 4 6 9 6 0 20 26 29 33 29 30 28 31 i39 21 18 18 17 20 0 17 21 24 21 22 19 23 e49 25 21 18 21 26 17 0 4 7 4 8 8 9 e8 28 23 21 24 29 21 4 0 5 1 7 8 8 e11 32 27 25 27 33 24 7 5 0 4 7 9 7 e44 28 24 21 24 29 21 4 1 4 0 6 8 7 e32 30 26 24 25 30 22 8 7 7 6 0 3 1 e30 28 23 22 23 28 19 8 8 9 8 3 0 4 e31 31 27 25 27 31 23 9 8 7 7 1 4 0 Here we project onto lines through the corners and edge midpoints of the coordinate-oriented circumscribing rectangle. It would, of course, get better results if we choose p and q to maximize gaps. Next we consider maximizing the STD of the F-values to insure strong gaps (a heuristic method). Checking [57.68] distances i10,i36,i19,i32,i18, {i6,i23} outliers F 57 58 59 61 61 64 64 66 66 68 i26 i31 i8 i10 i36 i6 i23 i19 i32 i18 i26 0 5 4 8 7 8 10 13 10 11 i31 5 0 3 10 5 6 7 10 12 12 i8 4 3 0 10 7 5 6 9 11 11 i10 8 10 10 0 8 10 12 14 9 9 i36 7 5 7 8 0 5 7 9 9 10 i6 8 6 5 10 5 0 3 5 9 8 i23 10 7 6 12 7 3 0 4 11 10 i19 13 10 9 14 9 5 4 0 13 12 i32 10 12 11 9 9 9 11 13 0 4 i18 11 12 11 9 10 8 10 12 4 0

"Gap Hill Climbing": mathematical analysis 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k 4 b c q 3 d e f 2 1 0 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 0 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k l m n 4 b c q r s 3 d e f o p 2 g h 1 i 0 =p d2-gap d2-gap p d1-gap d1-gap q= q d2 d1 d1 d2 1. To increase gap size, we hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher StDev would increase the likelihood that gaps would be larger since more dispersion allows for more and/or larger gaps. This is very heuristic but it works. 2. We are more interested in growing the largest gap(s) of interest ( or largest thinning). To do this we could do: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.). The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q???

Xod=Fd(X)=DPPd(X) d1 x1od x1 x2 : xN x2od = - ( j=1..nXj dj)2 = i=1..N(j=1..nxi,jdj)2 xNod dn V(d)≡VarianceXod=(Xod)2 - (Xod)2 M1 M2 : MC For Dot Product Gap based Classification, we can start with X = the table of the C Training Set Class Means, where Mk≡MeanVectorOfClassk. = i(jxi,jdj) - (jXj dj) (kXk dk) (kxi,kdk) + j<kxi,jxi,kdjdk = ijxi,j2dj2 1 1 1 2 Then Xi = Mean(X)i and N N N N and XiXj = Mean Mi1 Mj1 . : +2j<kXjXkdjdk - " = jXj2 dj2 +2j<kXjXkdjdk - jXj2dj2 2a11d1 V(d)= +j1a1jdj MiC MjC XjXk)djdk ) +(2j=1..n<k=1..n(XjXk- 2a22d2 = j=1..n(Xj2 - Xj2)dj2 + +j2a2jdj : 2anndn +jnanjdj V(d)=jajjdj2 V(d) = + jkajkdjdk ijaijdidj subject to i=1..ndi2=1 dTo A o d = V(d) d1 : dn V i XiXj-XiX,j : d1 ... dn V(d)≡Gradient(V)=2Aod 2a11 2a12 ... 2a1n 2a21 2a22 ... 2a2n : ' 2an1 ... 2ann d1 : di : dn or Theorem1:  k{1,...,n}, d=ek will hill-climb V to its globally maximum. Let d=ek s.t. akk is a maximal diagonal element of A, Theorem2 (working on it): d=ek will hill-climb V to its globally maximum. Maximizing theVariance How do we use this theory? For Dot Product gap based Clustering, we can hill-climb akk below to a d that gives us the global maximum variance. Heuristically, higher variance means more prominent gaps. Given any table, X(X1, ..., Xn), and any unit vector, d, in n-space, let We can separate out the diagonal or not: These computations are O(C) (C=number of classes) and are instantaneous. Once we have the matrix A, we can hill-climb to obtain a d that maximizes the variance of the dot product projections of the class means. FAUST Classifier MVDI (Maximized Variance Definite Indefinite: Build a Decision tree. 1. Find the d that maximizes the variance of the dot product projections of the class means each round. 2. Apply DI each round (see next slide).  d0, one can hill-climb it to locally maximize the variance, V, as follows: d1≡(V(d0)); d2≡(V(d1)):... where

FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK): Let mi≡meanCi s.t. dom1dom2 ...domKMni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj} Definite_i = ( Mx<i, Mn>i ) Indefinite_i_i+1 = [Mn>i, Mx<i+1] Then recurse on each Indefinite. For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEANsMEANe Definite_i_______ Indefinite_i_i+1______ class Mx<i MN>i class MN>i Mx<i+1 s-Mean 50.49 34.74 14.74 2.43 s(i=1) -1 25 e-Mean 63.50 30.00 44.00 13.50 e(i=2) 10 37 se 25 10 empty i-Mean 61.00 31.50 55.50 21.50 i(i=3) 48 128 ei 37 48 F < 18  setosa (35 seto) 1ST ROUND D=MeansMeane 18 < F < 37  versicolor (15 vers) 37  F  48  IndefiniteSet2 (20 vers, 10 virg) 48 < F  virginica (25 virg) F < 7  versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7  F  10  IndefSet3 ( 3 vers, 5 virg) 10 < F  virginica ( 0 vers, 5 virg) F < 3  versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3  F  7  IndefSet4 ( 2 vers, 1 virg) Here we will assign 0  F  7 versicolor 7 < F  virginica ( 0 vers, 3 virg) 7 < F virginica Test: F < 15  setosa (15 seto) 1ST ROUND D=MeansMeane 15 < F < 15  versicolor ( 0 vers, 0 virg) 15  F  41  IndefiniteSet2 (15 vers, 1 virg) 41 < F  virginica ( 14 virg) F < 20  versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F  virginica ( 0 vers, 1 virg) 100% accuracy. Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k=1... (and Mean could be replaced by VOM or?) Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k=1... (and Mean could be replaced by VOM or?) Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?) Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster. Option-4: D seq.: always pick the means pair which are furthest separated from each other. Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(meani), F(meanj) Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially, IS=X)

FAUST MVDI (-1, 16.5=avg{23,10})s sCt=50 (16.5, 38)e eCt=24 (48.128)i iCt=39 d=(.33, -.1, .86, .38) (-1,8)e Ct=21 (10,128)i Ct=9 indef[38, 48]se_i seCt=26 iCt=13 indef[8,10]e_i eCt=5 iCt=4 Definite Indefinite i-Mean 62.8 29.2 46.1 14.5 i -1 8 e-Mean 59 26.9 49.6 18.4 e 10 17 i_e 8 10 empty d=(-.55, -.33, .51, .57) d0=(.33, -.1, .86,.38) 16.5  xod0 < 38 xod0 < 16.5 38  xod0 48 48 < xod0 Setosa Virginica Versicolor d1=(-.55, -.33, .51, .57) xod1 < 9 xod1 9 Virginica Versicolor on IRIS 15 records from each Class for Testing (Virg39 was removed as an outlier.) Definite_____ Indefinite s-Mean 50.49 34.74 14.74 2.43 s -1 10 e-Mean 63.50 30.00 44.00 13.50 e 23 48 s_ei 23 10 empty i-Mean 61.00 31.50 55.50 21.50 i 38 70 se_i 38 48 In this case, since the indefinite interval is so narrow, we absorb it into the two definite intervals; resulting in decision tree:

FAUST MVDI SatLog 413train 4atr 6cls 127test Using class means: FoMN Ct min max max+1 mn4 83 101 104 82 113 8 110 121 122 mn3 85 103 108 85 117 79 105 128 129 mn1 69 106 115 94 133 12 123 148 149 Using full data: (much better!) mn4 83 101 104 82 59 8 56 65 66 mn3 85 103 108 85 62 79 52 74 75 mn1 69 106 115 94 81 12 73 95 96 d=(0.39 0.89 0.35 0.10 ) F[a,b) 0 92 104 118 127 146 156 157 161 179 190 Class 2 2 2 2 2 2 5 5 5 5 7 7 7 7 7 7 1 1 1 1 1 1 1 4 4 4 4 4 3 3 3 3 d=(-.11 -.22 .54 .81) F[a,b) 89 102 Class 5 2 d=(-.15 -.29 .56 .76) F[a,b) 47 65 81 101 Class 7 5 5 2 2 d=(-.81, .17, .45, .33) F[a,b) 21 3541 59 Class 3 1 d=(-.01, -.19, .7, .69) d=(-.66, .19, .47, .56) F[a,b) 57 6169 87 Class 5 7 F[a,b) 5256667375 Class 333 3 4 11 cl=4 cl=7 Cl=7 Gradient Hill Climb of Variance(d) d1 d2 d3 d4 Vd) 0.00 0.00 1.00 0.00 282 0.13 0.38 0.64 0.65 700 0.20 0.51 0.62 0.57 742 0.26 0.62 0.57 0.47 781 0.30 0.70 0.53 0.38 810 0.34 0.76 0.48 0.30 830 0.36 0.79 0.44 0.23 841 0.37 0.81 0.40 0.18 847 0.38 0.83 0.38 0.15 850 0.39 0.84 0.36 0.12 852 0.39 0.84 0.35 0.10 853 Fomn Ct min max max+1 mn2 49 40 115 119 106 108 91 155 156 mn5 58 58 76 64 108 61 92 145 146 mn7 69 77 81 64 131 154 104 160 161 mn4 78 91 96 74 152 60 127 178 179 mn1 67 103 114 94 167 27 118 189 190 mn3 89 107 112 88 178 155 157 206 207 Gradient Hill Climb of Var(d)on t25 d1 d2 d3 d4 Vd) 0.00 0.00 0.00 1.00 1137 -0.11 -0.22 0.54 0.81 1747 MNod Ct ClMn ClMx ClMx+1 mn2 45 33 115 124 150 54 102 177 178 mn5 55 52 72 59 69 33 45 88 89 Gradient Hill Climb of Var(d)on t257 0.00 0.00 1.00 0.00 496 -0.15 -0.29 0.56 0.76 1595 Same using class means or training subset. Gradient Hill Climb of Var(d)on t75 0.00 0.00 1.00 0.00 12 0.04 -0.09 0.83 0.55 20 -0.01 -0.19 0.70 0.69 21 Gradient Hill Climb of Var(d)on t13 0.00 0.00 1.00 0.00 29 -0.83 0.17 0.42 0.34 166 0.00 0.00 1.00 0.00 25 -0.66 0.14 0.65 0.36 81 -0.81 0.17 0.45 0.33 88 On the 127 sample SatLog TestSet: 4 errors or 96.8% accuracy. speed? With horizontal data, DTI is applied one unclassified sample at a time (per execution thread). With this pTree Decision Tree, we take the entire TestSet (a PTreeSet), create the various dot product SPTS (one for each inode), create ut SPTS Masks. These masks mask the results for the entire TestSet. Gradient Hill Climb of Var(d)on t143 0.00 0.00 1.00 0.00 19 -0.66 0.19 0.47 0.56 95 0.00 0.00 1.00 0.00 27 -0.17 0.35 0.75 0.53 54 -0.32 0.36 0.65 0.58 57 -0.41 0.34 0.62 0.58 58 For WINE: min max+1 8.40 10.33 27.00 9.63 28.65 9.9 53.4 7.56 11.19 32.61 10.38 34.32 7.7 111.8 8.57 12.84 30.55 11.65 32.72 8.7 108.4 8.91 13.64 34.93 11.97 37.16 13.1 92.2 Awful results! Gradient Hill Climb of Var t156161 0.00 0.00 1.00 0.00 5 -0.23 -0.28 0.89 0.28 19 -0.02 -0.06 0.12 0.99 157 0.02 -0.02 0.02 1.00 159 0.00 0.00 1.00 0.00 1 -0.46 -0.53 0.57 0.43 2 Inconclusive both ways so predict purality=4(17) (3ct=3 tct=6 Gradient Hill Climb of Var t146156 0.00 0.00 1.00 0.00 0 0.03 -0.08 0.81 -0.58 1 0.00 0.00 1.00 0.00 13 0.02 0.20 0.92 0.34 16 0.02 0.25 0.86 0.45 17 Inconclusive both ways so predict purality=4(17) (7ct=15 2ct=2 Gradient Hill Climb of Var t127 0.00 0.00 1.00 0.00 41 -0.01 -0.01 0.70 0.71 90 -0.04 -0.04 0.65 0.75 91 0.00 0.00 1.00 0.00 35 -0.32 -0.14 0.59 0.73 105 Inconclusive predict purality=7(62 4(15) 1(5) 2(8) 5(7)

FAUST MVDI Concrete d0= -0.34 -0.16 0.81 -0.45 xod3<969 xod0<320 xod2<28 xod>=19.3 xod2>=662 xod2>=92 xod0>=634 xod>=18.6 d1= .85 -.03 .52 -.02 d2= .85 -.00 .53 .05 Class=m (test:1/1) Class= l or m Cl=l *test 6/9) Class=m errs0/1) Class=m errs8/12) Cl=h (test:11/12) Class=m errs0/4) Class=m errs0/0) Class=l (test:1/1) Class=m (test:2/2) xod<13.2 xod<13.2 .00 .00 1.00 .00 1.0 8.0 6 4 l 4.0 5.0 0 0 m 2.0 9.0 0 0 h 0 2 2 99 .97 .19 .08 .16 d1 13.4 19.6 0 0 l 16.9 19.9 4 3 m 13.5 16.0 0 0 h 0 13.45 18.6 99 0.97 0.19 0.06 0.15 14.4 19.6 0 0 l 16.8 18.8 0 0 m 13.5 15.8 11 1 h 0 14.366 17.816 99 Class=l errs:0/4) Class=h errs:0/5) Class=h errs:0/5) Class=h errs:0/1) d3= .81 .04 .58 .01 xod4>=681 xod3>=868 Cl=m (test:1/1) Cl=l (test:0/3) d4 = .79 .14 .60 .03 xod4<640 Cl=l *test 2/2) xod3<544 Cl=m *test 0/0) 7 test errors / 30 = 77% For Concrete min max+1 train 335.3 657.1 0 l 120.5 611.6 12 m 321.1 633.5 0 h Test 0 l ****** 1 m ****** 0 h ****** 0 321 3.0 57.0 0 l 3.0 361.0 11 m 28.0 92.0 0 h 0 l ***** 2 m ***** 0 h 92 ***** 999 .97 .17 -.02 .15 d0 13.3 19.3 0 0 l 16.4 23.5 0 0 m 12.2 15.2 25 5 h 0 13.2 19.3 23.5 Seeds d3 547.9 860.9 4 l 617.1 957.3 0 m 762.5 867.7 0 h 0 l ******* 0 m ******* 0 h . 0 ******* 617 8 test errors / 32 = 75% d2 544.2 651.5 0 l 515.7 661.1 0 m 591.0 847.4 40 h 1 l ****** 0 m ****** 11 h 662 ****** 999

PX dot d>a = PdiXi>a a AND 2 pTrees masks P(mrmv)/|mrmv|oX<a P(mvmr)oX>(mr+mv)/2od masks vectors that makes a shadow on mr side of the midpt b r r r v v r mr r v v v r r v mv v r v v r v r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b grb grb grb grb grb grb grb grb grb bgr bgr bgr bgr bgr bgr bgrbgr bgr bgr D g D = mrmv For classes r and v For classes r and b FAUST Oblique Classifier:formula: P(X dot D)>aX any set of vectors. D=oblique vector (Note: if D=ei, PXi > a ). E.g.,? Let D=vector connecting class means and d= D/|D| To separate r from v: D = (mvmr), a = (mv+mr)/2 o d = midpoint of D projected onto d FAUST-Oblique: Create tbl, TBL(classi, classj, medoid_vectori, medoid_vectorj). Notes: If we just pick the one class which when paired with r, gives max gap, then we can use max gap or max_std_Int_pt instead of max_gap_midpt. Then need stdj (or variancej) in TBL. Best cutpoint? mean, vector_of_medians, outmost, outmost_non-outlier? P(mbmr)oX>(mr+m)|/2od "outermost = "furthest from means (their projs of D-line); best rankK points, best std points, etc. "medoid-to-mediod" close to optimal provided classes are convex. In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} finds them. r

Separate classR, classV using midpoints of means (mom) method: calc a vomV vomR d-line d v2 v1 std of these distances from origin along the d-line a FAUST Oblique PR = P(X dot d)<a D≡ mRmV= oblique vector. d=D/|D| View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2o d(Very same formula works when D=mVmR, i.e., points to left) Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) median{v2|vV}, ... ) dim 2 r r vv r mR r v v v v r r v mV v r v v r v dim 1

L1(x,y) ValueArray z1 0 2 4 5 10 13 14 15 16 17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17 18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0 2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9 10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0 2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12 13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5 8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15 17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13 0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9 10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10 11 13 15 12/8/12 L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1 1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1 1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2 1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1 1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2 1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4 1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2 1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2 1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1 1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1 1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2 3 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9

L1(x,y) ValueArray z1 0 2 4 5 10 13 14 15 16 17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17 18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0 2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9 10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0 2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12 13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5 8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15 17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13 0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9 10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10 11 13 15 L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1 1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1 1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2 1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1 1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2 1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4 1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2 1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2 1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1 1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1 1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2 3 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 This just confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis. Confirms zf as an anomaly or outlier, since it was already declared so during the linear gap analysis. After having subclustered with linear gap analysis, it would make sense to run this round gap algoritm out only 2 steps to determine if there are any singleton, gap>2 subclusters (anomalies) which were not found by the previous linear analysis.

yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 gap: 10-6 gap: 5-2 cluster PTree Masks (by ORing) z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 gap: 6-9 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0

yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 gap: 3-7 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 AND each red with each blue with each green, to get the subcluster masks (12 ANDs)

z FAUST Clustering Methods: MCR (Using Midlines of circumscribing Coordinate Rectangle) (Xv1,Xv2,Xv3)=Xv =MaxVect (nv1,Xv2,Xv3) g3 g4 g3 (Xv1,nv2,Xv3) (nv1,nv2,Xv3) g2 Sub Clus1 Sub Clus2 g1 g1 f1 f1 f2 (Xv1,Xv2,nv3) (nv1,Xv2,nv3) 0111 0011 f4 f3 f3 MinVect=nv=(nv1,nv2,nv3) (Xv1,nv2,nv3) 0101 0001 y g2 =½1½½ x =½½1½ 0½½½= 0110 0010 f2 = ½0½½ =½½0½ nv= 0000 0100 1111 =Xv 1011 1101 1001 =1½½½ f1 = 0½½½ 1110 1010 g1 = 1½½½ 1000 1100 =½½½1 =½½½0 For any FAUST clustering method, we proceed in one of 2 ways: gap analysis of the projections onto a unit vector, d, and/or gap analysis of the distances from a point, f (and another point, g, usually): Given d, fMinPt(xod) and gMaxPt(xod). Given f and g, dk≡(f-g)/|f-g| So we can do any subset (d), (df), (dg), (dfg), (f), (fg), fgd), ... Define a sequence fk,gkdk fk≡((nv1+Xv1)/2,...,nvk,...,(nvn+Xvn)/2) dk=ek and SpS(xodk)=Xk gk≡((nv1+Xv1)/2,...,nXk,...,(nvn+Xvn)/2) f, g, d, SpS(xod) require no processing (gap-finding is the only cost). MCR(fg) adds the cost of SpS((x-f)o(x-f)) and SpS((x-g)o(x-g)). MCR(dfg) on Iris150 Do SpS(xod) linear gap analysis (since it is processing free). SpS((x-f)o(x-f)), SpS((x-g)o(x-g)) rnd gap. Sequence thru{f, g} pairs: On what's left: (look for outliers in subclus1, subclus2 d3 0 10 set23... 1 19 set45 0 30 ver49... 0 69 vir19 SubClus2 SubClus1 d1 none d2 none f1 none f1 none g1 none g1 none f2 1 41 vir23 0 47 vir18 0 47 vir32 f2 none SubClus1 g2 none d4 1 6 set44 0 18 vir39 Leaves exactly the 50 setosa. f3 none g2 none g3 none f4 none f3 none g3 none g4 none SubClus2 f4 none d4 none Leaves 50 ver and 49 vir g4 none

MCR(d) on Iris150+Outlier30, gap>4: Sub Clus1 Sub Clus1 t2 t23 t24 t234 0.0 35.0 12.0 37 35.0 0.0 37.0 12 12.0 37.0 0.0 35 37.0 12.0 35.0 0 b124 b134 b14 ball 0.0 52.4 30.0 43.0 52.4 0.0 43.0 30.0 30.0 43.0 0.0 52.4 43.0 30.0 52.4 0.0 b24 b2 b234 b23 0.0 28.0 43.0 51.3 28.0 0.0 51.3 43.0 43.0 51.3 0.0 28.0 51.3 43.0 28.0 0.0 t13 t12 t1 t123 0.0 43.0 35.0 25.0 43.0 0.0 25.0 35.0 35.0 25.0 0.0 43.0 25.0 35.0 43.0 0.0 t124 t14 tal t134 0.0 25.0 35.0 43.0 25.0 0.0 43.0 35.0 35.0 43.0 0.0 25.0 43.0 35.0 25.0 0.0 b12 b1 b13 b123 0.0 30.0 52.4 43.0 30.0 0.0 43.0 52.4 52.4 43.0 0.0 30.0 43.0 52.4 30.0 0.0 Do SpS(xodk) linear gap analysis, k=1,2,3,4. Declare subclusters of size 1 or two to be outliers. Create the full pairwise distance table for any subcluster of size  10 and declare any point an outlier if its column (other than the zero diagonal value) values all exceed the threshold (which is 4). d3 0 10 set23... 1 19 set25 0 30 ver49... 1 69 vir19 Same split (expected) d1 0 17 t124 0 17 t14 0 17 tal 1 17 t134 0 23 t13 0 23 t12 0 23 t1 1 23 t123 0 38 set14 ... 1 79 vir32 0 84 b12 0 84 b1 0 84 b13 1 84 b123 0 98 b124 0 98 b134 0 98 b14 0 98 ball SubClus1 d4 1 6 set44 0 18 vir39 Leaves exactly the 50 setosa as SubCluster1. SubClus2 d4 0 0 t4 1 0 t24 0 10 ver18 ... 1 25 vir45 0 40 b4 0 40 b24 Leaves the 49 virginica (vir39 declared an outlier) and the 50 versicolor as SubCluster2. MCR(d) performs well on this dataset. Accuracy: We can't expect a clustering method to separate versicolor from virginica because there is no gap between them. This method does separate off setosa perfectly and finds all 30 added outliers (subcluster of size 1 or 2). It finds virginica outlier, vir39, which is the most prominent intra-class outlier (distance 29.6 from the other virginica iris's, whereas no other iris is more than 9.1 from its classmates.) Speed: dk = ek so there is zero calculation cost for the d's. SpS(xodk) = SpS(xoek) = SpS(Xk) so there is zero calculation cost for it. The only cost is the loading of the dataset PTreeSet(X) (We use one column, SpS(Xk) at a time.) and that loading is required for any method. So MCR(d) isoptimal with respect to speed! d2 0 5 t2 0 5 t23 0 5 t24 1 5 t234 0 20 ver1 ... 1 44 set16 0 60 b24 0 60 b2 0 60 b234 0 60 b23

CCR(fgd)(Corners of Circumscribing Coordinate Rectangle)f1=minVecX≡(minXx1..minXxn) (0000) Sub Clus1 Sub Clus2 Subc2.1 ver49 ver8 ver44 ver11 ver49 set42 ver8 set36 ver44 ver11 0.0 19.8 3.9 21.3 3.9 7.2 19.8 0.0 21.6 10.4 21.8 23.8 3.9 21.6 0.0 23.9 1.4 4.6 21.3 10.4 23.9 0.0 24.2 27.1 3.9 21.8 1.4 24.2 0.0 3.6 7.2 23.8 4.6 27.1 3.6 0.0 g1=MaxVecX≡(MaxXx1..MaxXxn) (1111), d=(g-f)/|g-f| Sequence thru main diagonal pairs, {f, g} lexicographically. For each, create d. start  f1=MnVec RnGp>4 none g1=MxVec RnGp>4 0 7 vir18... 1 47 ver30 0 53 ver49.. 0 74 set14 CCR(f) Do SpS((x-f)o(x-f)) round gap analysis CCR(g) Do SpS((x-g)o(x-g)) round gap analysis. CCR(d) Do SpS((xod)) linear gap analysis. Notes: No calculation required to find f and g (assuming MaxVecX and minVecXhave been calculated and residualized when PTreeSetX was captured.) If the dimension is high, since the main diagonal corners are liekly far from X and thus the large radii make the round gaps nearly linear. SubClus1 Lin>4 none SubCluster2 f2=0001 RnGp>4 none g2=1110 RnGp>4 none This ends SubClus2 = 47 setosa only g1=1111 RnGp>4 none f1=0000 RnGp>4 none Lin>4 none Lin>4 none f3=0010 RnGp>4 none g2=1110 RnGp>4 none f2=0001 RnGp>4 none Lin>4 none g3=1101 RnGp>4 none f3=0010 RnGp>4 none g3=1101 RnGp>4 none Lin>4 none Lin>4 none f4=0011 RnGp>4 none g4=1100 RnGp>4 none f4=0011 RnGp>4 none Lin>4 none g4=1100 RnGp>4 none f5=0100 RnGp>4 none g5=1011 RnGp>4 none Lin>4 none Lin>4 none f6=0101 RnGp>4 1 19 set26 0 28 ver49 0 31 set42 0 31 ver8 0 32 set36 0 32 ver44 1 35 ver11 0 41 ver13 f5=0100 RnGp>4 none g5=1011 RnGp>4 none Lin>4 none f6=0101 RnGp>4 none g6=1010 RnGp>4 none g6=1010 RnGp>4 none Lin>4 none Lin>4 none f7=0110 RnGp>4 none f7=0110 RnGp>4 1 28 ver13 0 33 vir49 g7=1001 RnGp>4 none Lin>4 none g7=1001 RnGp>4 none Lin>4 none Lin>4 none g8=1000 RnGp>4 none f8=0111 RnGp>4 none f8=0111 RnGp>4 none g8=1000 RnGp>4 none Lin>4 none This ends SubClus1 = 95 ver and vir samples only

SL SW PL PW set 51 35 14 2 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 set 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 50 36 14 2 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 set 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 set 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 set 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 48 34 16 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 48 30 14 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 set 43 30 11 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 set 58 40 12 2 0 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 set 57 44 15 4 0 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 set 54 39 13 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 set 51 35 14 3 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 set 57 38 17 3 0 1 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 set 51 38 15 3 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 set 54 34 17 2 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 set 51 37 15 4 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 set 46 36 10 2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 set 51 33 17 5 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 set 48 34 19 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 set 50 30 16 2 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 50 34 16 4 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 set 52 35 15 2 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 52 34 14 2 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 47 32 16 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 48 31 16 2 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 set 54 34 15 4 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 set 52 41 15 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 55 42 14 2 0 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 50 32 12 2 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 set 55 35 13 2 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 0 set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 44 30 13 2 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 51 34 15 2 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 set 50 35 13 3 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1 set 45 23 13 3 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1 set 44 32 13 2 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 50 35 16 6 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 set 51 38 19 4 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 set 48 30 14 3 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 set 51 38 16 2 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 46 32 14 2 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 53 37 15 2 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 50 33 14 2 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 ver 70 32 47 14 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 ver 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 0 1 1 1 1 ver 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 ver 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 ver 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 1 ver 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 1 0 0 0 0 ver 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 ver 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1 ver 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 0 ver 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 ver 59 30 42 15 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 ver 60 22 40 10 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 ver 61 29 47 14 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 ver 56 29 36 13 0 1 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 1 ver 67 31 44 14 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 ver 56 30 45 15 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 58 27 41 10 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 0 ver 62 22 45 15 0 1 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 56 25 39 11 0 1 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 ver 59 32 48 18 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 ver 61 28 40 13 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 ver 63 25 49 15 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 ver 61 28 47 12 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 0 ver 64 29 43 13 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 ver 66 30 44 14 1 0 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 0 ver 68 28 48 14 1 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 ver 67 30 50 17 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1 ver 60 29 45 15 0 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 57 26 35 10 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 ver 55 24 38 11 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 ver 55 24 37 10 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 ver 58 27 39 12 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 ver 60 27 51 16 0 1 1 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 ver 54 30 45 15 0 1 1 0 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 60 34 45 16 0 1 1 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 ver 67 31 47 15 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 ver 63 23 44 13 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 ver 56 30 41 13 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 ver 55 25 40 13 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1 ver 55 26 44 12 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 0 ver 61 30 46 14 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 0 SL SW PL PW ver 58 26 40 12 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 ver 50 23 33 10 0 1 1 0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 ver 56 27 42 13 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 0 1 1 0 1 ver 57 30 42 12 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 ver 57 29 42 13 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 1 ver 62 29 43 13 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 ver 51 25 30 11 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 ver 57 28 41 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 vir 63 33 60 25 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1 vir 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 vir 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 vir 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1 0 vir 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 vir 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 vir 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 vir 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 vir 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1 0 vir 72 36 61 25 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 0 0 1 vir 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 0 vir 64 27 53 19 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 vir 68 30 55 21 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 vir 57 25 50 20 0 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 vir 58 28 51 24 0 1 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 0 0 vir 64 32 53 23 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 1 1 vir 65 30 55 18 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 0 vir 77 38 67 22 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0 1 1 0 vir 77 26 69 23 1 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 1 1 vir 60 22 50 15 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 vir 69 32 57 23 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 vir 56 28 49 20 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 vir 77 28 67 20 1 0 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 vir 63 27 49 18 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 0 vir 67 33 57 21 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1 vir 72 32 60 18 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 vir 62 28 48 18 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 vir 61 30 49 18 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 vir 64 28 56 21 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 1 vir 72 30 58 16 1 0 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 vir 74 28 61 19 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 1 1 vir 79 38 64 20 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 vir 64 28 56 22 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0 vir 63 28 51 15 0 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1 vir 61 26 56 14 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 0 vir 77 30 61 23 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 1 vir 63 34 56 24 0 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 vir 64 31 55 18 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 0 vir 60 30 18 18 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 vir 69 31 54 21 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1 vir 67 31 56 24 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 vir 69 31 51 23 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 vir 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 vir 68 32 59 23 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 1 1 vir 67 33 57 25 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 0 1 vir 67 30 52 23 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 vir 63 25 50 19 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 vir 65 30 52 20 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 vir 62 34 54 23 0 1 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 vir 59 30 51 18 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 0 t1 20 30 37 12 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 1 0 1 1 t2 58 5 37 12 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 t3 58 30 2 12 0 1 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 t4 58 30 37 0 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 t12 20 5 37 12 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 t13 20 30 2 12 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 t14 20 30 37 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 t23 58 5 2 12 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 t24 58 5 37 0 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 t34 58 30 2 0 0 1 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 t123 20 5 2 12 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 t124 20 5 37 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 t134 20 30 2 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 t234 58 5 2 0 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 tall 20 5 2 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 b1 90 30 37 12 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 1 0 1 1 b2 58 60 37 12 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 b3 58 30 80 12 0 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 1 b4 58 30 37 40 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 b12 90 60 37 12 1 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 b13 90 30 80 12 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 1 b14 90 30 37 40 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 b23 58 60 80 12 0 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 b24 58 60 37 40 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 b34 58 30 80 40 0 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 b123 90 60 80 12 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 b124 90 60 37 40 1 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 b134 90 30 80 40 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 b234 58 60 80 40 0 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 ball 90 60 80 40 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 Before adding the new tuples: MINS 43 20 10 1 MAXS 79 44 69 25 MEAN 58 30 37 12 same after additions. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 3 4 5 6 7 8 9 50 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2

DISTANCES t123 b234 tal b134 b123 0.00 106.48 12.00 111.32 118.36 106.48 0.00 110.24 43.86 42.52 12.00 110.24 0.00 114.93 118.97 111.32 43.86 114.93 0.00 41.04 118.36 42.52 118.97 41.04 0.00 All outliers! |eh-(ehod1)d1-(ehod2)d2-...-(ehodh-1)dh-1| dh≡(eh-(ehod1)d1-(ehod2)d2-..-(ehodh-1)dh-1) / b12 b14 b24 0.00 41.04 42.52 41.04 0.00 43.86 42.52 43.86 0.00 All outliers again! SubClust-2 SubClust-1 FM(fgd)(Furthest-from-the-Mediod) FMO (FM using a Gram-Schmidt Orthonormal basis)X  Rn. Calculate M=MeanVector(X) directly, using only the residualized 1-counts of the basic pTrees of X. And BTW, use residualized STD calculations to guide in choosing good gap width thresholds (which define what an outlier is going to be and also determine when we divide into sub-clusters.)) f=MGp>4 1 53 b13 0 58 t123 0 59 b234 0 59 tal 0 60 b134 1 61 b123 0 67 ball f0=t123 RnGp>4 1 0 t123 0 25 t13 1 28 t134 0 34 set42... 1 103 b23 0 108 b13 f1MxPt(SpS[(M-x)o(M-x)]). d1≡(M-f1)/|M-f1|. SubClust-1 f0=b2 RnGp>4 1 0 b2 0 28 ver36 SubClust-2 f0=t3 RnGp>4 none If d110, Gram-Schmidt {d1 e1...ek-1 ek+1..en} d2 ≡ (e2 - (e2od1)d1) / |e2 - (e2od1)d1| d3 ≡ (e3 - (e3od1)d1 - (e3od2)d2) / |e3 - (e3od1)d1 - (e3od2)d2| ... SubClust-1 f0=b3 RnGp>4 1 0 b3 0 23 vir8 ... 1 54 b1 0 62 vir39 SubClust-2 f0=t3 LinGap>4 1 0 t3 0 12 t34 f0=b23 RnGp>4 1 0 b23 0 30 b3... 1 84 t34 0 95 t23 0 96 t234 Thm: MxPt[SpS((M-x)od)]=MxPt[SpS(xod)] (shift by Mod, MxPts are same Repick f1MnPt[SpS(xod1)]. Pick g1MxPt[SpS(xod1)] SubClust-2 f0=t34 LinGap>4 1 0 t34 0 13 set36 Pick fhMnPt[SpS(xodh)]. Pick ghMxPt[SpS(xodh)]. f0=b124 RnGp>4 1 0 b124 0 28 b12 0 30 b14 1 32 b24 0 41 vir10... 1 75 t24 1 81 t1 1 86 t14 1 93 t12 0 98 t124 SubClust-1 f0=t24 RnGp>4 1 0 t24 1 12 t2 0 20 ver13 SubClust-2 f0=set16 LnGp>4 none SubClust-1 f1=ver49 RdGp>4 none SubClust-1 f0=b1 RnGp>4 1 0 b1 0 23 ver1 SubClust-2 f1=set42 RdGp>4 none SubClust-1 f1=ver49 LnGp>4 none 1. Choose f0 (high outlier potential? e.g., furthest from mean, M?) 2. Do f0-rnd-gap analysis (+ subcluster anal?) 3. f1 be s.t. no x further away from f0 (in some dir) (all d1 dot prods0) 4. Do f1-rnd-gap analysis (+ subclust anal?). 5. Do d1-linear-gap analysis, d1≡ f0-f1 / |f0-f1|. 6. Let f2 s.t. no x is further away (in some direction) from d1-line than f2 7. Do f2-round-gap analysis. 8. Do d2-linear-gap d2 ≡ f0-f2 - (f0-f2)od1 / len... SubClust-1 f0=ver19 RnGp>4 none SubClust-2 f1=set42 LnGp>4 none SubClust-2 is 50 setosa! Likely f2, f3 and f4 analysis will not find none. f0=b34 RnGp>4 1 0 b34 0 26 vir1 ... 1 66 vir39 0 72 set24 ... 1 83 t3 0 88 t34 SubClust-1 f0=ver19 LinGp>4 none

b123 b134 b234 0.0 41.0 42.5 41.0 0.0 43.9 42.5 43.9 0.0 b24 b2 b12 0.0 28.0 42.5 28.0 0.0 32.0 42.5 32.0 0.0 x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x g for FMG-GM f for FMG-GM t23 t234 t12 t24 t124 t2 0.0 12.0 51.7 37.0 53.0 35.0 12.0 0.0 53.0 35.0 51.7 37.0 51.7 53.0 0.0 39.8 12.0 38.0 37.0 35.0 39.8 0.0 38.0 12.0 53.0 51.7 12.0 38.0 0.0 39.8 35.0 37.0 38.0 12.0 39.8 0.0 b34 b124 b23 t13 b13 0.0 61.4 41.0 91.2 42.5 61.4 0.0 60.5 88.4 59.4 41.0 60.5 0.0 91.8 43.9 91.2 88.4 91.8 0.0 104.8 42.5 59.4 43.9 104.8 0.0 FMO(d) f1=ball g1=tall LnGp>4 1 -137 ball 0 -126 b123 0 -124 b134 1 -122 b234 0 -112 b13 ... 1 -29 t13 1 -24 t134 1 -18 t123 1 -13 tal f2=vir11 g2=set16 Ln>4 none f3=t34 g3=vir18 Ln>4 none f4=t4 g4=b4 Ln>4 1 24 vir1 0 39 b4 0 39 b14 f4=t4 g4=vir1 Ln>4 none This ends the process. We found all (and only) added anomalies, but missed t34, t14, t4, t1, t3, b1, b3. f1=b13 g1=b2 LnGp>4 none f2=t2 g2=b2 LnGp>4 1 21 set16 0 26 b2 f2=t2 g2=t234 Ln>4 0 5 t23 0 5 t234 0 6 t12 0 6 t24 0 6 t124 1 6 t2 0 21 ver11 CRC method g1=MaxVector ↓ x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x xxx x xx x x x x x x xx x x x x x x x x x x x x x x xx x x x x xx x x x xx x x f2=vir11 g2=b23 Ln>4 1 43 b12 0 50 b34 0 51 b124 0 51 b23 0 52 t13 0 53 b13 MCR g MCR f  f2=vir11 g2=b12 Ln>4 1 45 set16 0 61 b24 0 61 b2 0 61 b12  CRC f1=MinVector

f1=bal RnGp>4 1 0 ball 0 28 b123... 1 73 t4 0 78 vir39... 1 98 t34 0 103 t12 0 104 t23 0 107 t124 1 108 t234 0 113 t13 1 116 t134 0 122 t123 0 125 tal Finally we would classify within SubCluster1 using the means of another training set (with FAUST Classify). We would also classify SubCluster2.1 and SubCluster2.2, but would we know we would find SubCluster2.1 to be all Setosa and SubCluster2.2 to be all Versicolor (as we did before). In SubCluster1 we would separate Versicolor from Virginica perfectly (as we did before). FMO(fg) start  f1MxPt(SpS((M-x)o(M-x))), Round gaps first, then Linear gaps. Sub Clus2 Sub Clus1 t12 t23 t124 t234 0.0 51.7 12.0 53.0 51.7 0.0 53.0 12.0 12.0 53.0 0.0 51.7 53.0 12.0 51.7 0.0 b13 vir32 vir18 b23 0.0 22.5 22.4 43.9 22.5 0.0 4.1 35.3 22.4 4.1 0.0 33.4 43.9 35.3 33.4 0.0 |ver49 ver8 ver44 ver11 0.0 3.9 3.9 7.1 3.9 0.0 1.4 4.7 3.9 1.4 0.0 3.7 7.1 4.7 3.7 0.0 Almost outliers! Subcluster2.2 Which type? Must classify. Sub Clus2.2 b124 b12 b14 0.0 28.0 30.0 28.0 0.0 41.0 30.0 41.0 0.0 We could FAUST Classify each outlier (if so desired) to find out which class they are outliers from. However, what about the rouge outliers I added? What would we expect? They are not represented in the training set, so what would happen to them? My thinking: they are real iris samples so we should not do the really do the outlier analysis and subsequent classification on the original 150. We already know (assuming the "other training set" has the same means as these 150 do), that we can separate Setosa, Versicolor and Virginica prefectly using FAUST Classify. SubClus2 f1=t14 Rn>4 0 0 t1 1 0 t14 0 30 ver8 ... 1 47 set15 0 52 t3 0 52 t34 SubClus1 f1=b123 Rn>4 1 0 b123 0 30 b13 0 30 vir32 0 30 vir18 1 32 b23 0 37 vir6 If this is typical (though concluding from one example is definitely "over-fitting"), then we have to conclude that Mark's round gap analysis is more productive than linear dot product proj gap analysis! FFG (Furthest to Furthest), computes SpS((M-x)o(M-x)) for f1 (expensive? Grab any pt?, corner pt?) then compute SpS((x-f1)o(x-f1)) for f1-round-gap-analysis. Then compute SpS(xod1) to get g1 to have projection furthest from that of f1 ( for d1 linear gap analysis) (Too expensive? since gk-round-gap-analysis and linear analysis contributed very little! But we need it to get f2, etc. Are there other cheaper ways to get a good f2? Need SpS((x-g1)o(x-g1)) for g1-round-gap-analysis (too expensive!) SubClus2 f1=set23 Rn>4 1 17 vir39 0 23 ver49 0 26 ver8 0 27 ver44 1 30 ver11 0 43 t24 0 43 t2 SubClus1 f1=b134 Rn>4 1 0 b134 0 24 vir19 SC1 f2=ver13 Rn>4 1 0 ver13 0 5 ver43 SubClus1 f1=b234 Rn>4 1 0 b234 1 30 b34 0 37 vir10 SC1 g2=vir10 Rn>4 1 0 vir10 0 6 vir44 SubClus1 f1=b124 Rn>4 1 0 b124 0 28 b12 0 30 b14 1 32 b24 0 41 b1... 1 59 t4 0 68 b3 SbCl_2.1 g1=ver39 Rn>4 1 0 vir39 0 7 set21 Note:what remains in SubClus2.1 is exactly the 50 setosa. But we wouldn't know that, so we continue to look for outliers and subclusters. SC1 f4=b1 Rn>4 1 0 b1 0 23 ver1 SbCl_2.1 g1=set19 Rn>4 none SbCl_2.1 f3=set16 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 g3=set9 Rn>4 none SbCl_2.1 f2=set42 Rn>4 1 0 set42 0 6 set9 SC1 f1=vir19 Rn>4 1 44 t4 0 52 b2 SC1 g4=b4 Rn>4 1 0 b4 0 21 vir15 SbCl_2.1 LnG>4 none SbCl_2.1 f4=set Rn>4 none SbCl_2.1 f2=set9 Rn>4 none SbCl_2.1 g4=set Rn>4 none SC1 g1=b2 Rn>4 1 0 t4 0 28 ver36 SubC1us1 has 91, only versicolor and virginica. SbCl_2.1 g2=set16 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 LnG>4 none

For speed of text mining (and of other high dimension datamining), we might do additional dimension reduction (after stemming content word). A simple way is to use STD of the column of numbers generated by the functional (e.g., Xk, SpS((x-M)o(x-M)), SpS((x-f)o(x-f)), SpS(xod), etc.). The STDs of the columns, Xk, can be precomputed up front, once and for all. STDs of projection and square distance functionals must be done after they are generated (could be done upon capture too). Good functionals produce many large gaps. In Iris150 and Iris150+Out30, I find that the precomputed STD is a good indicator of that. A text mining scheme might be: 1. Capture the text as a PTreeSET (after stemming the content words) and store mean, median, STD of every column (content word stem). 2. Throw out low STD columns. 4'. Use a weighted sum of "importance" and STD? (If the STD is low, there can't be many large gaps.) Sub Clus2 Sub Clus1 ver49 ver8 ver44 ver11 0.0 3.9 3.9 7.2 3.9 0.0 1.4 4.6 3.9 1.4 0.0 3.6 7.2 4.6 3.6 0.0 A possible Attribute Selection algorithm: 1. Peel from X, outliers using CRM-lin, CRC-lin, possibly M-rnd, fM-rnd, fg-rnd.. (Xin = X - Xout) 2. Calculate widths of each Xin-Circumscribing Rectangle edge, crewk 4. Look for wide gaps top down (or, very simply, order by STD). 4'. Divide crewk into count{xk| xXin}. (but that doesn't account for dups) 4''. look for preponderance of wide thin-gaps top down. 4'''. look for high projection interval count dispersion (STD). Notes: 1. Maybe an inlier sub-cluster needs occur from more than one functional projection to be declared an inlier sub-cluster? 2. STD of a functional projection appears to be a good indicator of the quality of its gap analysis. For FAUST Cluster-d (pick d, then f=MnPt(xod) and g=MxPt(xod) ) a full grid of unit vectors (all directions, equally spaced) may be needed. Such a grid could be constructed using angles a1, ... , am, each equi-width partitioned on [0,180), with the formulas: d = e1k=n...2cosk + e2sin2k=n...3cosk + e3sin3k=n...4cosk + ... + ensinn where i's start at 0 and increment by . So, di1..in= j=1..n[ ej sin((ij-1)) * k=n. .j+1cos(k) ]; i0≡0,  divides 180 (e.g., 90, 45, 22.5...) CRMSTD(dfg) Eliminate all columns with STD < threshold. d3 0 10 set23...50set+vir39 1 19 set25 0 30 ver49...50ver_49vir 0 69 vir19 (d3+d4)/sqr(2) clus1 none (d3+d4)/sqr(2) clus2 none d5 (f5=vir19, g5=set14) none f5 1 0.0 vir19 clus2 0 4.1 vir23 g5 none Just about all the high STD columns find the subcluster split. In addition, they find the four outliers as well (d1+d3+d4)/sqr(3) clus1 1 44.5 set19 0 55.4 vir39 (d1+d3+d4)/sqr(3) clus2 none d5 (f5=vir23, g5=set14) none,f5 none, g5 none d5 (f5=vir32, g5=set14) none, f5 none, g5 none d5 (f5=vir18, g5=set14) none f5 1 0.0 vir18 clus2 1 4.1 vir32 0 8.2 vir6 g5 none d5 (f5=vir6, g5=set14) none, f5 none, g5 none (d1+d2+d3+d4)/sqr(4) clus1 (d1+d2+d3+d4)/sqr(4) clus2 none (d1+d3)/sqr(2) clus1 none (d1+d3)/sqr(2) clus2: 0 57.3 ver49 0 58.0 ver8 0 58.7 ver44 1 60.1 ver11 0 64.3 ver10 none

Relationship Matrix and SVD Recommender Algorithm

Relationship Matrix and SVD Recommender Algorithm

Presentation Transcript