350 likes | 368 Views
Discover innovative data mining methods such as pTree processing and clustering for efficient big data analysis. Learn about FAUST algorithms, Dot Product Projection, and Secure pTreeBases to enhance data exploration and decision-making processes. Explore the latest advancements in pTree algorithmic tools and recommender systems to extract meaningful insights from complex datasets.
E N D
Research of William Perrizo, C.S. Department, NDSU I datamine big data (big data ≡ trillions of rows and, sometimes, thousands of columns (which can complicate data mining trillions of rows). How do I do it? I structure the data table as [compressed] vertical bit columns (called "predicate Trees" or "pTrees"). I process those pTrees horizontally (because processing across thousands of column structures is orders of magnitude faster than processing down trillions of row structures. As a result, some tasks that might have taken forever can be done in a humanly acceptable amount of time. What is data mining? Largely it is classification (assigning a class label to a row based on a training table of previously classified rows). Clustering and Association Rule Mining (ARM) are important areas of data mining also, and they are related to classification. The purpose of clustering is usually to create [or improve] a training table. It is also used for anomaly detection, a huge area in data mining. ARM is used to data mine more complex data (relationship matrixes between two entities, not just single entity training tables). Recommenders recommend products to customers based on their previous purchases or rents (or based on their ratings of items)". To make a decision, we typically search our memory for similar situations (near neighbor cases) and base our decision on the decisions we (or an expert) made in those similar cases. We do what worked before (for us or for others). I.e., we let near neighbor cases vote. But which neighbor vote? "The Magical Number Seven, Plus or Minus Two..." Information"[2] is one of the most highly cited papers in psychology cognitive psychologistGeorge A. Miller of Princeton University's Department of Psychology in Psychological Review. It argues that the number of objects an average human can hold in working memory is 7 ± 2 (called Miller's Law). Classification provides a better 7. Some current pTree Data Mining research projects FAUST pTree PREDICTOR/CLASSIFIER (FAUST= Functional Analytic Unsupervised and Supervised machine Teaching): FAUST pTree CLUSTER/ANOMALASER pTrees in MapReduce MapReduce and Hadoop are key-value approaches to organizing and managing BigData. pTree Text Mining:: capturie the reading sequence, not just the term-frequency matrix (lossless capture) of a text corpus. Secure pTreeBases: This involves anonymizing the identities of the individual pTrees and randomly padding them to mask their initial bit positions. pTree Algorithmic Tools: An expanded algorithmic tool set is being developed to include quadratic tools and even higher degree tools. pTree Alternative Algorithm Implementation: Implementing pTree algorithms in hardware (e.g., FPGAs) should result in orders of magnitude performance increases? pTree O/S Infrastructure: Computers and Operating Systems are designed to do logical operations (AND, OR...) rapidly. Exploit this for pTree processing speed. pTree Recommender: This includes, Singular Value Decomposition (SVD) recommenders, pTree Near Neighbor Recommenders and pTree ARM Recommenders.
FAUST clustering (the unsupervised part of FAUST) The Dot Product Projection (DPP)Check for gaps in DPPd(y) or DPPpq(y)≡ (y-p)o(p-q)/|p-q| (parameterized over a grid of d=(p-q)/|p-q|Spheren. d The Dot Product Radius (DPR) Check gaps in DPRpq(y) ≡ √ SDp(y)- DPPpq(y)2 This class of partitioning or clustering methods relies on choosing a dot product projection so that if we find a gap in the F-values, we know that the 2 sets of points mapping to opposite sides of that gap are at least as far apart as the gap width.). The Coordinate Projection Functionals (ej)Check gaps in ej(y) ≡ yoej = yj The Square Distance Functional (SD)Check gaps in SDp(y) ≡ (y-p)o(y-p) (parameterized over a pRn grid). The Square Dot Product Radius (SDPR) SDPRpq(y) ≡ SDp(y)- DPPpq(y)2 (easier pTree processing) DPP-KM1. Check gaps in DPPp,d(y) (over grids of p and d?). 1.1 Check distances at any sparse extremes. 2. After several rounds of 1, apply k-means to the resulting clusters (when k seems to be determined). DPP-DA2. Check gaps in DPPp,d(y) (grids of p and d?) against the density of subcluster. 2.1 Check distances at sparse extremes against subcluster density. 2.2 Apply other methods once Dot ceases to be effective. DPP-SD) 3. Check gaps in DPPp,d(y) (over a p-grid and a d-grid) and SDp(y) (over a p-grid). 3.1 Check sparse ends distance with subcluster density. (DPPpd and SDp share construction steps!) SD-DPP-SDPR) (DPPpq , SDp and SDPRpq share construction steps! SDp(y)≡ (y-p)o(y-p) = yoy - 2 yop +pop DPPpq(y) ≡ (y-p)od=yod-pod= (1/|p-q|)yop - (1/|p-q|)yoq Calc yoy, yop, yoq concurrently? Then constant multiplies 2*yop, (1/|p-q|)*yop concurrently. Then add | subtract. Calculate DPPpq(y)2. Then subtract it from SDp(y)
FAUST DPP CLUSTER on IRiS with DPP(y)=(y-p)o(q-p)/|q-p|, where p is the min (or n) corner and q is the max (x) corner of the circumscribing rectangle (mdpts or avg (a) is used also). DPP 60 59 60 58 60 58 60 59 59 58 60 58 59 62 63 61 61 60 58 60 57 59 64 56 56 57 57 59 60 58 57 58 61 62 58 61 61 58 60 59 61 57 60 57 56 58 59 59 60 60 25 27 22 29 24 26 25 37 25 31 34 29 30 24 35 27 26 31 23 32 23 31 21 25 28 SL SW PL PW set 51 35 14 2 set 49 30 14 2 set 47 32 13 2 set 46 31 15 2 set 50 36 14 2 set 54 39 17 4 set 46 34 14 3 set 50 34 15 2 set 44 29 14 2 set 49 31 15 1 set 54 37 15 2 set 48 34 16 2 set 48 30 14 1 set 43 30 11 1 set 58 40 12 2 set 57 44 15 4 set 54 39 13 4 set 51 35 14 3 set 57 38 17 3 set 51 38 15 3 set 54 34 17 2 set 51 37 15 4 set 46 36 10 2 set 51 33 17 5 set 48 34 19 2 set 50 30 16 2 set 50 34 16 4 set 52 35 15 2 set 52 34 14 2 set 47 32 16 2 set 48 31 16 2 set 54 34 15 4 set 52 41 15 1 set 55 42 14 2 set 49 31 15 1 set 50 32 12 2 set 55 35 13 2 set 49 31 15 1 set 44 30 13 2 set 51 34 15 2 set 50 35 13 3 set 45 23 13 3 set 44 32 13 2 set 50 35 16 6 set 51 38 19 4 set 48 30 14 3 set 51 38 16 2 set 46 32 14 2 set 53 37 15 2 set 50 33 14 2 ver 70 32 47 14 ver 64 32 45 15 ver 69 31 49 15 ver 55 23 40 13 ver 65 28 46 15 ver 57 28 45 13 ver 63 33 47 16 ver 49 24 33 10 ver 66 29 46 13 ver 52 27 39 14 ver 50 20 35 10 ver 59 30 42 15 ver 60 22 40 10 ver 61 29 47 14 ver 56 29 36 13 ver 67 31 44 14 ver 56 30 45 15 ver 58 27 41 10 ver 62 22 45 15 ver 56 25 39 11 ver 59 32 48 18 ver 61 28 40 13 ver 63 25 49 15 ver 61 28 47 12 ver 64 29 43 13 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 ver 58 26 40 12 ver 50 23 33 10 ver 56 27 42 13 ver 57 30 42 12 ver 57 29 42 13 ver 62 29 43 13 ver 51 25 30 11 ver 57 28 41 13 vir 63 33 60 25 vir 58 27 51 19 vir 71 30 59 21 vir 63 29 56 18 vir 65 30 58 22 vir 76 30 66 21 vir 49 25 45 17 vir 73 29 63 18 vir 67 25 58 18 vir 72 36 61 25 vir 65 32 51 20 vir 64 27 53 19 vir 68 30 55 21 vir 57 25 50 20 vir 58 28 51 24 vir 64 32 53 23 vir 65 30 55 18 vir 77 38 67 22 vir 77 26 69 23 vir 60 22 50 15 vir 69 32 57 23 vir 56 28 49 20 vir 77 28 67 20 vir 63 27 49 18 vir 67 33 57 21 vir 72 32 60 18 vir 62 28 48 18 vir 61 30 49 18 vir 64 28 56 21 vir 72 30 58 16 vir 74 28 61 19 vir 79 38 64 20 vir 64 28 56 22 vir 63 28 51 15 vir 61 26 56 14 vir 77 30 61 23 vir 63 34 56 24 vir 64 31 55 18 vir 60 30 18 18 vir 69 31 54 21 vir 67 31 56 24 vir 69 31 51 23 vir 58 27 51 19 vir 68 32 59 23 vir 67 33 57 25 vir 67 30 52 23 vir 63 25 50 19 vir 65 30 52 20 vir 62 34 54 23 vir 59 30 51 18 3 4 5 6 7 8 9 50 DPP 27 23 21 26 36 32 33 32 20 27 27 24 25 31 30 27 26 SL SW PL PW ver 66 30 44 14 ver 68 28 48 14 ver 67 30 50 17 ver 60 29 45 15 ver 57 26 35 10 ver 55 24 38 11 ver 55 24 37 10 ver 58 27 39 12 ver 60 27 51 16 ver 54 30 45 15 ver 60 34 45 16 ver 67 31 47 15 ver 63 23 44 13 ver 56 30 41 13 ver 55 25 40 13 ver 55 26 44 12 ver 61 30 46 14 26 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 CL3 w outliers removed p=aaax q=aaan F Cnt 0 4 1 2 2 5 3 13 4 8 5 12 6 4 7 2 8 11 9 5 10 4 11 5 12 2 13 7 14 3 15 2 30 37 29 30 29 28 40 30 10 19 11 15 12 5 24 8 12 10 19 16 15 19 17 17 16 6 0 30 10 19 11 15 12 5 24 8 12 10 19 16 15 19 17 17 16 6 0 16 13 18 19 11 12 17 19 18 16 20 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 25 Checking [0,4] distances (s42 Setosa outlier) F 0 1 2 3 3 3 4 s14 s42 s45 s23 s16 s43 s3 s14 0 8 14 7 20 3 5 s42 8 0 17 13 24 9 9 s45 14 17 0 11 9 11 10 s23 7 13 11 0 15 5 5 s16 20 24 9 15 0 18 16 s43 3 9 11 5 18 0 3 s3 5 9 10 5 16 3 0 IRIS: 150 irises (rows), 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). first 50 are Setosa (s), next 50 are Versicolor (e), next 50 are Virginica (i) irises. CL1 F<17(50 Set) 17<F<23 CL2 (e8,e11,e44,e49,i39) gap>=4 p=nnnn q=xxxx F Count 0 1 1 1 2 1 3 3 4 1 5 6 6 4 7 5 8 7 9 3 10 8 11 5 12 1 13 2 14 1 15 1 19 1 20 1 21 3 26 2 28 1 29 4 30 2 31 2 32 2 33 4 34 3 36 5 37 2 38 2 39 2 40 5 41 6 42 5 43 7 44 2 45 1 46 3 47 2 48 1 49 5 50 4 51 1 52 3 53 2 54 2 55 3 56 2 57 1 58 1 59 1 61 2 64 2 66 2 68 1 Thinning=[6,7 ] CL3.1 <6.5 44 ver 4 vir CL3.2 >6.5 2 ver 39 vir No sparse ends 23<F CL3 (46 vers,49 vir) Check distances in [12,28] s16,,i39,e49, e11, {e8,e44, i6,i10,i18,i19,i23,i32 outliers F 12 13 13 14 15 19 20 21 21 21 26 26 28 s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30 e31 s34 0 5 8 5 4 21 25 28 32 28 30 28 31 s6 5 0 4 3 6 18 21 23 27 24 26 23 27 s45 8 4 0 6 9 18 18 21 25 21 24 22 25 s19 5 3 6 0 6 17 21 24 27 24 25 23 27 s16 4 6 9 6 0 20 26 29 33 29 30 28 31 i39 21 18 18 17 20 0 17 21 24 21 22 19 23 e49 25 21 18 21 26 17 0 4 7 4 8 8 9 e8 28 23 21 24 29 21 4 0 5 1 7 8 8 e11 32 27 25 27 33 24 7 5 0 4 7 9 7 e44 28 24 21 24 29 21 4 1 4 0 6 8 7 e32 30 26 24 25 30 22 8 7 7 6 0 3 1 e30 28 23 22 23 28 19 8 8 9 8 3 0 4 e31 31 27 25 27 31 23 9 8 7 7 1 4 0 Here we project onto lines through the corners and edge midpoints of the coordinate-oriented circumscribing rectangle. It would, of course, get better results if we choose p and q to maximize gaps. Next we consider maximizing the STD of the F-values to insure strong gaps (a heuristic method). Checking [57.68] distances i10,i36,i19,i32,i18, {i6,i23} outliers F 57 58 59 61 61 64 64 66 66 68 i26 i31 i8 i10 i36 i6 i23 i19 i32 i18 i26 0 5 4 8 7 8 10 13 10 11 i31 5 0 3 10 5 6 7 10 12 12 i8 4 3 0 10 7 5 6 9 11 11 i10 8 10 10 0 8 10 12 14 9 9 i36 7 5 7 8 0 5 7 9 9 10 i6 8 6 5 10 5 0 3 5 9 8 i23 10 7 6 12 7 3 0 4 11 10 i19 13 10 9 14 9 5 4 0 13 12 i32 10 12 11 9 9 9 11 13 0 4 i18 11 12 11 9 10 8 10 12 4 0
"Gap Hill Climbing": mathematical analysis 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k 4 b c q 3 d e f 2 1 0 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 0 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k l m n 4 b c q r s 3 d e f o p 2 g h 1 i 0 =p d2-gap d2-gap p d1-gap d1-gap q= q d2 d1 d1 d2 1. To increase gap size, we hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher StDev would increase the likelihood that gaps would be larger since more dispersion allows for more and/or larger gaps. This is very heuristic but it works. 2. We are more interested in growing the largest gap(s) of interest ( or largest thinning). To do this we could do: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.). The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q???
Xod=Fd(X)=DPPd(X) d1 x1od x1 x2 : xN x2od = - ( j=1..nXj dj)2 = i=1..N(j=1..nxi,jdj)2 xNod dn V(d)≡VarianceXod=(Xod)2 - (Xod)2 M1 M2 : MC For Dot Product Gap based Classification, we can start with X = the table of the C Training Set Class Means, where Mk≡MeanVectorOfClassk. = i(jxi,jdj) - (jXj dj) (kXk dk) (kxi,kdk) + j<kxi,jxi,kdjdk = ijxi,j2dj2 1 1 1 2 Then Xi = Mean(X)i and N N N N and XiXj = Mean Mi1 Mj1 . : +2j<kXjXkdjdk - " = jXj2 dj2 +2j<kXjXkdjdk - jXj2dj2 2a11d1 V(d)= +j1a1jdj MiC MjC XjXk)djdk ) +(2j=1..n<k=1..n(XjXk- 2a22d2 = j=1..n(Xj2 - Xj2)dj2 + +j2a2jdj : 2anndn +jnanjdj V(d)=jajjdj2 V(d) = + jkajkdjdk ijaijdidj subject to i=1..ndi2=1 dTo A o d = V(d) d1 : dn V i XiXj-XiX,j : d1 ... dn V(d)≡Gradient(V)=2Aod 2a11 2a12 ... 2a1n 2a21 2a22 ... 2a2n : ' 2an1 ... 2ann d1 : di : dn or Theorem1: k{1,...,n}, d=ek will hill-climb V to its globally maximum. Let d=ek s.t. akk is a maximal diagonal element of A, Theorem2 (working on it): d=ek will hill-climb V to its globally maximum. Maximizing theVariance How do we use this theory? For Dot Product gap based Clustering, we can hill-climb akk below to a d that gives us the global maximum variance. Heuristically, higher variance means more prominent gaps. Given any table, X(X1, ..., Xn), and any unit vector, d, in n-space, let We can separate out the diagonal or not: These computations are O(C) (C=number of classes) and are instantaneous. Once we have the matrix A, we can hill-climb to obtain a d that maximizes the variance of the dot product projections of the class means. FAUST Classifier MVDI (Maximized Variance Definite Indefinite: Build a Decision tree. 1. Find the d that maximizes the variance of the dot product projections of the class means each round. 2. Apply DI each round (see next slide). d0, one can hill-climb it to locally maximize the variance, V, as follows: d1≡(V(d0)); d2≡(V(d1)):... where
FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK): Let mi≡meanCi s.t. dom1dom2 ...domKMni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj} Definite_i = ( Mx<i, Mn>i ) Indefinite_i_i+1 = [Mn>i, Mx<i+1] Then recurse on each Indefinite. For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEANsMEANe Definite_i_______ Indefinite_i_i+1______ class Mx<i MN>i class MN>i Mx<i+1 s-Mean 50.49 34.74 14.74 2.43 s(i=1) -1 25 e-Mean 63.50 30.00 44.00 13.50 e(i=2) 10 37 se 25 10 empty i-Mean 61.00 31.50 55.50 21.50 i(i=3) 48 128 ei 37 48 F < 18 setosa (35 seto) 1ST ROUND D=MeansMeane 18 < F < 37 versicolor (15 vers) 37 F 48 IndefiniteSet2 (20 vers, 10 virg) 48 < F virginica (25 virg) F < 7 versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7 F 10 IndefSet3 ( 3 vers, 5 virg) 10 < F virginica ( 0 vers, 5 virg) F < 3 versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3 F 7 IndefSet4 ( 2 vers, 1 virg) Here we will assign 0 F 7 versicolor 7 < F virginica ( 0 vers, 3 virg) 7 < F virginica Test: F < 15 setosa (15 seto) 1ST ROUND D=MeansMeane 15 < F < 15 versicolor ( 0 vers, 0 virg) 15 F 41 IndefiniteSet2 (15 vers, 1 virg) 41 < F virginica ( 14 virg) F < 20 versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F virginica ( 0 vers, 1 virg) 100% accuracy. Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k=1... (and Mean could be replaced by VOM or?) Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k=1... (and Mean could be replaced by VOM or?) Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?) Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster. Option-4: D seq.: always pick the means pair which are furthest separated from each other. Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(meani), F(meanj) Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially, IS=X)
FAUST MVDI (-1, 16.5=avg{23,10})s sCt=50 (16.5, 38)e eCt=24 (48.128)i iCt=39 d=(.33, -.1, .86, .38) (-1,8)e Ct=21 (10,128)i Ct=9 indef[38, 48]se_i seCt=26 iCt=13 indef[8,10]e_i eCt=5 iCt=4 Definite Indefinite i-Mean 62.8 29.2 46.1 14.5 i -1 8 e-Mean 59 26.9 49.6 18.4 e 10 17 i_e 8 10 empty d=(-.55, -.33, .51, .57) d0=(.33, -.1, .86,.38) 16.5 xod0 < 38 xod0 < 16.5 38 xod0 48 48 < xod0 Setosa Virginica Versicolor d1=(-.55, -.33, .51, .57) xod1 < 9 xod1 9 Virginica Versicolor on IRIS 15 records from each Class for Testing (Virg39 was removed as an outlier.) Definite_____ Indefinite s-Mean 50.49 34.74 14.74 2.43 s -1 10 e-Mean 63.50 30.00 44.00 13.50 e 23 48 s_ei 23 10 empty i-Mean 61.00 31.50 55.50 21.50 i 38 70 se_i 38 48 In this case, since the indefinite interval is so narrow, we absorb it into the two definite intervals; resulting in decision tree:
FAUST MVDI SatLog 413train 4atr 6cls 127test Using class means: FoMN Ct min max max+1 mn4 83 101 104 82 113 8 110 121 122 mn3 85 103 108 85 117 79 105 128 129 mn1 69 106 115 94 133 12 123 148 149 Using full data: (much better!) mn4 83 101 104 82 59 8 56 65 66 mn3 85 103 108 85 62 79 52 74 75 mn1 69 106 115 94 81 12 73 95 96 d=(0.39 0.89 0.35 0.10 ) F[a,b) 0 92 104 118 127 146 156 157 161 179 190 Class 2 2 2 2 2 2 5 5 5 5 7 7 7 7 7 7 1 1 1 1 1 1 1 4 4 4 4 4 3 3 3 3 d=(-.11 -.22 .54 .81) F[a,b) 89 102 Class 5 2 d=(-.15 -.29 .56 .76) F[a,b) 47 65 81 101 Class 7 5 5 2 2 d=(-.81, .17, .45, .33) F[a,b) 21 3541 59 Class 3 1 d=(-.01, -.19, .7, .69) d=(-.66, .19, .47, .56) F[a,b) 57 6169 87 Class 5 7 F[a,b) 5256667375 Class 333 3 4 11 cl=4 cl=7 Cl=7 Gradient Hill Climb of Variance(d) d1 d2 d3 d4 Vd) 0.00 0.00 1.00 0.00 282 0.13 0.38 0.64 0.65 700 0.20 0.51 0.62 0.57 742 0.26 0.62 0.57 0.47 781 0.30 0.70 0.53 0.38 810 0.34 0.76 0.48 0.30 830 0.36 0.79 0.44 0.23 841 0.37 0.81 0.40 0.18 847 0.38 0.83 0.38 0.15 850 0.39 0.84 0.36 0.12 852 0.39 0.84 0.35 0.10 853 Fomn Ct min max max+1 mn2 49 40 115 119 106 108 91 155 156 mn5 58 58 76 64 108 61 92 145 146 mn7 69 77 81 64 131 154 104 160 161 mn4 78 91 96 74 152 60 127 178 179 mn1 67 103 114 94 167 27 118 189 190 mn3 89 107 112 88 178 155 157 206 207 Gradient Hill Climb of Var(d)on t25 d1 d2 d3 d4 Vd) 0.00 0.00 0.00 1.00 1137 -0.11 -0.22 0.54 0.81 1747 MNod Ct ClMn ClMx ClMx+1 mn2 45 33 115 124 150 54 102 177 178 mn5 55 52 72 59 69 33 45 88 89 Gradient Hill Climb of Var(d)on t257 0.00 0.00 1.00 0.00 496 -0.15 -0.29 0.56 0.76 1595 Same using class means or training subset. Gradient Hill Climb of Var(d)on t75 0.00 0.00 1.00 0.00 12 0.04 -0.09 0.83 0.55 20 -0.01 -0.19 0.70 0.69 21 Gradient Hill Climb of Var(d)on t13 0.00 0.00 1.00 0.00 29 -0.83 0.17 0.42 0.34 166 0.00 0.00 1.00 0.00 25 -0.66 0.14 0.65 0.36 81 -0.81 0.17 0.45 0.33 88 On the 127 sample SatLog TestSet: 4 errors or 96.8% accuracy. speed? With horizontal data, DTI is applied one unclassified sample at a time (per execution thread). With this pTree Decision Tree, we take the entire TestSet (a PTreeSet), create the various dot product SPTS (one for each inode), create ut SPTS Masks. These masks mask the results for the entire TestSet. Gradient Hill Climb of Var(d)on t143 0.00 0.00 1.00 0.00 19 -0.66 0.19 0.47 0.56 95 0.00 0.00 1.00 0.00 27 -0.17 0.35 0.75 0.53 54 -0.32 0.36 0.65 0.58 57 -0.41 0.34 0.62 0.58 58 For WINE: min max+1 8.40 10.33 27.00 9.63 28.65 9.9 53.4 7.56 11.19 32.61 10.38 34.32 7.7 111.8 8.57 12.84 30.55 11.65 32.72 8.7 108.4 8.91 13.64 34.93 11.97 37.16 13.1 92.2 Awful results! Gradient Hill Climb of Var t156161 0.00 0.00 1.00 0.00 5 -0.23 -0.28 0.89 0.28 19 -0.02 -0.06 0.12 0.99 157 0.02 -0.02 0.02 1.00 159 0.00 0.00 1.00 0.00 1 -0.46 -0.53 0.57 0.43 2 Inconclusive both ways so predict purality=4(17) (3ct=3 tct=6 Gradient Hill Climb of Var t146156 0.00 0.00 1.00 0.00 0 0.03 -0.08 0.81 -0.58 1 0.00 0.00 1.00 0.00 13 0.02 0.20 0.92 0.34 16 0.02 0.25 0.86 0.45 17 Inconclusive both ways so predict purality=4(17) (7ct=15 2ct=2 Gradient Hill Climb of Var t127 0.00 0.00 1.00 0.00 41 -0.01 -0.01 0.70 0.71 90 -0.04 -0.04 0.65 0.75 91 0.00 0.00 1.00 0.00 35 -0.32 -0.14 0.59 0.73 105 Inconclusive predict purality=7(62 4(15) 1(5) 2(8) 5(7)
FAUST MVDI Concrete d0= -0.34 -0.16 0.81 -0.45 xod3<969 xod0<320 xod2<28 xod>=19.3 xod2>=662 xod2>=92 xod0>=634 xod>=18.6 d1= .85 -.03 .52 -.02 d2= .85 -.00 .53 .05 Class=m (test:1/1) Class= l or m Cl=l *test 6/9) Class=m errs0/1) Class=m errs8/12) Cl=h (test:11/12) Class=m errs0/4) Class=m errs0/0) Class=l (test:1/1) Class=m (test:2/2) xod<13.2 xod<13.2 .00 .00 1.00 .00 1.0 8.0 6 4 l 4.0 5.0 0 0 m 2.0 9.0 0 0 h 0 2 2 99 .97 .19 .08 .16 d1 13.4 19.6 0 0 l 16.9 19.9 4 3 m 13.5 16.0 0 0 h 0 13.45 18.6 99 0.97 0.19 0.06 0.15 14.4 19.6 0 0 l 16.8 18.8 0 0 m 13.5 15.8 11 1 h 0 14.366 17.816 99 Class=l errs:0/4) Class=h errs:0/5) Class=h errs:0/5) Class=h errs:0/1) d3= .81 .04 .58 .01 xod4>=681 xod3>=868 Cl=m (test:1/1) Cl=l (test:0/3) d4 = .79 .14 .60 .03 xod4<640 Cl=l *test 2/2) xod3<544 Cl=m *test 0/0) 7 test errors / 30 = 77% For Concrete min max+1 train 335.3 657.1 0 l 120.5 611.6 12 m 321.1 633.5 0 h Test 0 l ****** 1 m ****** 0 h ****** 0 321 3.0 57.0 0 l 3.0 361.0 11 m 28.0 92.0 0 h 0 l ***** 2 m ***** 0 h 92 ***** 999 .97 .17 -.02 .15 d0 13.3 19.3 0 0 l 16.4 23.5 0 0 m 12.2 15.2 25 5 h 0 13.2 19.3 23.5 Seeds d3 547.9 860.9 4 l 617.1 957.3 0 m 762.5 867.7 0 h 0 l ******* 0 m ******* 0 h . 0 ******* 617 8 test errors / 32 = 75% d2 544.2 651.5 0 l 515.7 661.1 0 m 591.0 847.4 40 h 1 l ****** 0 m ****** 11 h 662 ****** 999
PX dot d>a = PdiXi>a a AND 2 pTrees masks P(mrmv)/|mrmv|oX<a P(mvmr)oX>(mr+mv)/2od masks vectors that makes a shadow on mr side of the midpt b r r r v v r mr r v v v r r v mv v r v v r v r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b grb grb grb grb grb grb grb grb grb bgr bgr bgr bgr bgr bgr bgrbgr bgr bgr D g D = mrmv For classes r and v For classes r and b FAUST Oblique Classifier:formula: P(X dot D)>aX any set of vectors. D=oblique vector (Note: if D=ei, PXi > a ). E.g.,? Let D=vector connecting class means and d= D/|D| To separate r from v: D = (mvmr), a = (mv+mr)/2 o d = midpoint of D projected onto d FAUST-Oblique: Create tbl, TBL(classi, classj, medoid_vectori, medoid_vectorj). Notes: If we just pick the one class which when paired with r, gives max gap, then we can use max gap or max_std_Int_pt instead of max_gap_midpt. Then need stdj (or variancej) in TBL. Best cutpoint? mean, vector_of_medians, outmost, outmost_non-outlier? P(mbmr)oX>(mr+m)|/2od "outermost = "furthest from means (their projs of D-line); best rankK points, best std points, etc. "medoid-to-mediod" close to optimal provided classes are convex. In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} finds them. r
Separate classR, classV using midpoints of means (mom) method: calc a vomV vomR d-line d v2 v1 std of these distances from origin along the d-line a FAUST Oblique PR = P(X dot d)<a D≡ mRmV= oblique vector. d=D/|D| View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2o d(Very same formula works when D=mVmR, i.e., points to left) Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) median{v2|vV}, ... ) dim 2 r r vv r mR r v v v v r r v mV v r v v r v dim 1
L1(x,y) ValueArray z1 0 2 4 5 10 13 14 15 16 17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17 18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0 2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9 10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0 2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12 13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5 8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15 17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13 0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9 10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10 11 13 15 12/8/12 L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1 1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1 1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2 1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1 1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2 1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4 1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2 1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2 1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1 1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1 1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2 3 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9
L1(x,y) ValueArray z1 0 2 4 5 10 13 14 15 16 17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17 18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0 2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9 10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0 2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12 13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5 8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15 17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13 0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9 10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10 11 13 15 L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1 1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1 1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2 1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1 1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2 1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4 1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2 1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2 1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1 1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1 1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2 3 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 This just confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis. Confirms zf as an anomaly or outlier, since it was already declared so during the linear gap analysis. After having subclustered with linear gap analysis, it would make sense to run this round gap algoritm out only 2 steps to determine if there are any singleton, gap>2 subclusters (anomalies) which were not found by the previous linear analysis.
yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 gap: 10-6 gap: 5-2 cluster PTree Masks (by ORing) z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 gap: 6-9 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 gap: 3-7 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 AND each red with each blue with each green, to get the subcluster masks (12 ANDs)
z FAUST Clustering Methods: MCR (Using Midlines of circumscribing Coordinate Rectangle) (Xv1,Xv2,Xv3)=Xv =MaxVect (nv1,Xv2,Xv3) g3 g4 g3 (Xv1,nv2,Xv3) (nv1,nv2,Xv3) g2 Sub Clus1 Sub Clus2 g1 g1 f1 f1 f2 (Xv1,Xv2,nv3) (nv1,Xv2,nv3) 0111 0011 f4 f3 f3 MinVect=nv=(nv1,nv2,nv3) (Xv1,nv2,nv3) 0101 0001 y g2 =½1½½ x =½½1½ 0½½½= 0110 0010 f2 = ½0½½ =½½0½ nv= 0000 0100 1111 =Xv 1011 1101 1001 =1½½½ f1 = 0½½½ 1110 1010 g1 = 1½½½ 1000 1100 =½½½1 =½½½0 For any FAUST clustering method, we proceed in one of 2 ways: gap analysis of the projections onto a unit vector, d, and/or gap analysis of the distances from a point, f (and another point, g, usually): Given d, fMinPt(xod) and gMaxPt(xod). Given f and g, dk≡(f-g)/|f-g| So we can do any subset (d), (df), (dg), (dfg), (f), (fg), fgd), ... Define a sequence fk,gkdk fk≡((nv1+Xv1)/2,...,nvk,...,(nvn+Xvn)/2) dk=ek and SpS(xodk)=Xk gk≡((nv1+Xv1)/2,...,nXk,...,(nvn+Xvn)/2) f, g, d, SpS(xod) require no processing (gap-finding is the only cost). MCR(fg) adds the cost of SpS((x-f)o(x-f)) and SpS((x-g)o(x-g)). MCR(dfg) on Iris150 Do SpS(xod) linear gap analysis (since it is processing free). SpS((x-f)o(x-f)), SpS((x-g)o(x-g)) rnd gap. Sequence thru{f, g} pairs: On what's left: (look for outliers in subclus1, subclus2 d3 0 10 set23... 1 19 set45 0 30 ver49... 0 69 vir19 SubClus2 SubClus1 d1 none d2 none f1 none f1 none g1 none g1 none f2 1 41 vir23 0 47 vir18 0 47 vir32 f2 none SubClus1 g2 none d4 1 6 set44 0 18 vir39 Leaves exactly the 50 setosa. f3 none g2 none g3 none f4 none f3 none g3 none g4 none SubClus2 f4 none d4 none Leaves 50 ver and 49 vir g4 none
MCR(d) on Iris150+Outlier30, gap>4: Sub Clus1 Sub Clus1 t2 t23 t24 t234 0.0 35.0 12.0 37 35.0 0.0 37.0 12 12.0 37.0 0.0 35 37.0 12.0 35.0 0 b124 b134 b14 ball 0.0 52.4 30.0 43.0 52.4 0.0 43.0 30.0 30.0 43.0 0.0 52.4 43.0 30.0 52.4 0.0 b24 b2 b234 b23 0.0 28.0 43.0 51.3 28.0 0.0 51.3 43.0 43.0 51.3 0.0 28.0 51.3 43.0 28.0 0.0 t13 t12 t1 t123 0.0 43.0 35.0 25.0 43.0 0.0 25.0 35.0 35.0 25.0 0.0 43.0 25.0 35.0 43.0 0.0 t124 t14 tal t134 0.0 25.0 35.0 43.0 25.0 0.0 43.0 35.0 35.0 43.0 0.0 25.0 43.0 35.0 25.0 0.0 b12 b1 b13 b123 0.0 30.0 52.4 43.0 30.0 0.0 43.0 52.4 52.4 43.0 0.0 30.0 43.0 52.4 30.0 0.0 Do SpS(xodk) linear gap analysis, k=1,2,3,4. Declare subclusters of size 1 or two to be outliers. Create the full pairwise distance table for any subcluster of size 10 and declare any point an outlier if its column (other than the zero diagonal value) values all exceed the threshold (which is 4). d3 0 10 set23... 1 19 set25 0 30 ver49... 1 69 vir19 Same split (expected) d1 0 17 t124 0 17 t14 0 17 tal 1 17 t134 0 23 t13 0 23 t12 0 23 t1 1 23 t123 0 38 set14 ... 1 79 vir32 0 84 b12 0 84 b1 0 84 b13 1 84 b123 0 98 b124 0 98 b134 0 98 b14 0 98 ball SubClus1 d4 1 6 set44 0 18 vir39 Leaves exactly the 50 setosa as SubCluster1. SubClus2 d4 0 0 t4 1 0 t24 0 10 ver18 ... 1 25 vir45 0 40 b4 0 40 b24 Leaves the 49 virginica (vir39 declared an outlier) and the 50 versicolor as SubCluster2. MCR(d) performs well on this dataset. Accuracy: We can't expect a clustering method to separate versicolor from virginica because there is no gap between them. This method does separate off setosa perfectly and finds all 30 added outliers (subcluster of size 1 or 2). It finds virginica outlier, vir39, which is the most prominent intra-class outlier (distance 29.6 from the other virginica iris's, whereas no other iris is more than 9.1 from its classmates.) Speed: dk = ek so there is zero calculation cost for the d's. SpS(xodk) = SpS(xoek) = SpS(Xk) so there is zero calculation cost for it. The only cost is the loading of the dataset PTreeSet(X) (We use one column, SpS(Xk) at a time.) and that loading is required for any method. So MCR(d) isoptimal with respect to speed! d2 0 5 t2 0 5 t23 0 5 t24 1 5 t234 0 20 ver1 ... 1 44 set16 0 60 b24 0 60 b2 0 60 b234 0 60 b23
CCR(fgd)(Corners of Circumscribing Coordinate Rectangle)f1=minVecX≡(minXx1..minXxn) (0000) Sub Clus1 Sub Clus2 Subc2.1 ver49 ver8 ver44 ver11 ver49 set42 ver8 set36 ver44 ver11 0.0 19.8 3.9 21.3 3.9 7.2 19.8 0.0 21.6 10.4 21.8 23.8 3.9 21.6 0.0 23.9 1.4 4.6 21.3 10.4 23.9 0.0 24.2 27.1 3.9 21.8 1.4 24.2 0.0 3.6 7.2 23.8 4.6 27.1 3.6 0.0 g1=MaxVecX≡(MaxXx1..MaxXxn) (1111), d=(g-f)/|g-f| Sequence thru main diagonal pairs, {f, g} lexicographically. For each, create d. start f1=MnVec RnGp>4 none g1=MxVec RnGp>4 0 7 vir18... 1 47 ver30 0 53 ver49.. 0 74 set14 CCR(f) Do SpS((x-f)o(x-f)) round gap analysis CCR(g) Do SpS((x-g)o(x-g)) round gap analysis. CCR(d) Do SpS((xod)) linear gap analysis. Notes: No calculation required to find f and g (assuming MaxVecX and minVecXhave been calculated and residualized when PTreeSetX was captured.) If the dimension is high, since the main diagonal corners are liekly far from X and thus the large radii make the round gaps nearly linear. SubClus1 Lin>4 none SubCluster2 f2=0001 RnGp>4 none g2=1110 RnGp>4 none This ends SubClus2 = 47 setosa only g1=1111 RnGp>4 none f1=0000 RnGp>4 none Lin>4 none Lin>4 none f3=0010 RnGp>4 none g2=1110 RnGp>4 none f2=0001 RnGp>4 none Lin>4 none g3=1101 RnGp>4 none f3=0010 RnGp>4 none g3=1101 RnGp>4 none Lin>4 none Lin>4 none f4=0011 RnGp>4 none g4=1100 RnGp>4 none f4=0011 RnGp>4 none Lin>4 none g4=1100 RnGp>4 none f5=0100 RnGp>4 none g5=1011 RnGp>4 none Lin>4 none Lin>4 none f6=0101 RnGp>4 1 19 set26 0 28 ver49 0 31 set42 0 31 ver8 0 32 set36 0 32 ver44 1 35 ver11 0 41 ver13 f5=0100 RnGp>4 none g5=1011 RnGp>4 none Lin>4 none f6=0101 RnGp>4 none g6=1010 RnGp>4 none g6=1010 RnGp>4 none Lin>4 none Lin>4 none f7=0110 RnGp>4 none f7=0110 RnGp>4 1 28 ver13 0 33 vir49 g7=1001 RnGp>4 none Lin>4 none g7=1001 RnGp>4 none Lin>4 none Lin>4 none g8=1000 RnGp>4 none f8=0111 RnGp>4 none f8=0111 RnGp>4 none g8=1000 RnGp>4 none Lin>4 none This ends SubClus1 = 95 ver and vir samples only
SL SW PL PW set 51 35 14 2 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 set 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 50 36 14 2 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 set 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 set 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 set 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 48 34 16 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 48 30 14 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 set 43 30 11 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 set 58 40 12 2 0 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 set 57 44 15 4 0 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 set 54 39 13 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 set 51 35 14 3 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 set 57 38 17 3 0 1 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 set 51 38 15 3 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 set 54 34 17 2 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 set 51 37 15 4 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 set 46 36 10 2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 set 51 33 17 5 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 set 48 34 19 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 set 50 30 16 2 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 50 34 16 4 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 set 52 35 15 2 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 52 34 14 2 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 47 32 16 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 48 31 16 2 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 set 54 34 15 4 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 set 52 41 15 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 55 42 14 2 0 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 50 32 12 2 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 set 55 35 13 2 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 0 set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 44 30 13 2 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 51 34 15 2 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 set 50 35 13 3 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1 set 45 23 13 3 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1 set 44 32 13 2 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 50 35 16 6 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 set 51 38 19 4 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 set 48 30 14 3 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 set 51 38 16 2 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 46 32 14 2 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 53 37 15 2 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 50 33 14 2 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 ver 70 32 47 14 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 ver 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 0 1 1 1 1 ver 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 ver 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 ver 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 1 ver 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 1 0 0 0 0 ver 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 ver 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1 ver 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 0 ver 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 ver 59 30 42 15 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 ver 60 22 40 10 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 ver 61 29 47 14 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 ver 56 29 36 13 0 1 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 1 ver 67 31 44 14 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 ver 56 30 45 15 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 58 27 41 10 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 0 ver 62 22 45 15 0 1 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 56 25 39 11 0 1 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 ver 59 32 48 18 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 ver 61 28 40 13 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 ver 63 25 49 15 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 ver 61 28 47 12 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 0 ver 64 29 43 13 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 ver 66 30 44 14 1 0 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 0 ver 68 28 48 14 1 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 ver 67 30 50 17 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1 ver 60 29 45 15 0 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 57 26 35 10 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 ver 55 24 38 11 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 ver 55 24 37 10 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 ver 58 27 39 12 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 ver 60 27 51 16 0 1 1 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 ver 54 30 45 15 0 1 1 0 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 60 34 45 16 0 1 1 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 ver 67 31 47 15 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 ver 63 23 44 13 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 ver 56 30 41 13 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 ver 55 25 40 13 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1 ver 55 26 44 12 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 0 ver 61 30 46 14 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 0 SL SW PL PW ver 58 26 40 12 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 ver 50 23 33 10 0 1 1 0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 ver 56 27 42 13 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 0 1 1 0 1 ver 57 30 42 12 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 ver 57 29 42 13 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 1 ver 62 29 43 13 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 ver 51 25 30 11 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 ver 57 28 41 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 vir 63 33 60 25 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1 vir 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 vir 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 vir 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1 0 vir 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 vir 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 vir 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 vir 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 vir 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1 0 vir 72 36 61 25 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 0 0 1 vir 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 0 vir 64 27 53 19 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 vir 68 30 55 21 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 vir 57 25 50 20 0 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 vir 58 28 51 24 0 1 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 0 0 vir 64 32 53 23 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 1 1 vir 65 30 55 18 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 0 vir 77 38 67 22 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0 1 1 0 vir 77 26 69 23 1 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 1 1 vir 60 22 50 15 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 vir 69 32 57 23 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 vir 56 28 49 20 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 vir 77 28 67 20 1 0 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 vir 63 27 49 18 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 0 vir 67 33 57 21 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1 vir 72 32 60 18 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 vir 62 28 48 18 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 vir 61 30 49 18 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 vir 64 28 56 21 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 1 vir 72 30 58 16 1 0 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 vir 74 28 61 19 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 1 1 vir 79 38 64 20 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 vir 64 28 56 22 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0 vir 63 28 51 15 0 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1 vir 61 26 56 14 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 0 vir 77 30 61 23 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 1 vir 63 34 56 24 0 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 vir 64 31 55 18 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 0 vir 60 30 18 18 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 vir 69 31 54 21 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1 vir 67 31 56 24 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 vir 69 31 51 23 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 vir 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 vir 68 32 59 23 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 1 1 vir 67 33 57 25 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 0 1 vir 67 30 52 23 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 vir 63 25 50 19 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 vir 65 30 52 20 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 vir 62 34 54 23 0 1 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 vir 59 30 51 18 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 0 t1 20 30 37 12 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 1 0 1 1 t2 58 5 37 12 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 t3 58 30 2 12 0 1 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 t4 58 30 37 0 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 t12 20 5 37 12 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 t13 20 30 2 12 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 t14 20 30 37 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 t23 58 5 2 12 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 t24 58 5 37 0 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 t34 58 30 2 0 0 1 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 t123 20 5 2 12 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 t124 20 5 37 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 t134 20 30 2 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 t234 58 5 2 0 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 tall 20 5 2 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 b1 90 30 37 12 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 1 0 1 1 b2 58 60 37 12 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 b3 58 30 80 12 0 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 1 b4 58 30 37 40 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 b12 90 60 37 12 1 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 b13 90 30 80 12 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 1 b14 90 30 37 40 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 b23 58 60 80 12 0 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 b24 58 60 37 40 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 b34 58 30 80 40 0 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 b123 90 60 80 12 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 b124 90 60 37 40 1 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 b134 90 30 80 40 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 b234 58 60 80 40 0 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 ball 90 60 80 40 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 Before adding the new tuples: MINS 43 20 10 1 MAXS 79 44 69 25 MEAN 58 30 37 12 same after additions. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 3 4 5 6 7 8 9 50 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2
DISTANCES t123 b234 tal b134 b123 0.00 106.48 12.00 111.32 118.36 106.48 0.00 110.24 43.86 42.52 12.00 110.24 0.00 114.93 118.97 111.32 43.86 114.93 0.00 41.04 118.36 42.52 118.97 41.04 0.00 All outliers! |eh-(ehod1)d1-(ehod2)d2-...-(ehodh-1)dh-1| dh≡(eh-(ehod1)d1-(ehod2)d2-..-(ehodh-1)dh-1) / b12 b14 b24 0.00 41.04 42.52 41.04 0.00 43.86 42.52 43.86 0.00 All outliers again! SubClust-2 SubClust-1 FM(fgd)(Furthest-from-the-Mediod) FMO (FM using a Gram-Schmidt Orthonormal basis)X Rn. Calculate M=MeanVector(X) directly, using only the residualized 1-counts of the basic pTrees of X. And BTW, use residualized STD calculations to guide in choosing good gap width thresholds (which define what an outlier is going to be and also determine when we divide into sub-clusters.)) f=MGp>4 1 53 b13 0 58 t123 0 59 b234 0 59 tal 0 60 b134 1 61 b123 0 67 ball f0=t123 RnGp>4 1 0 t123 0 25 t13 1 28 t134 0 34 set42... 1 103 b23 0 108 b13 f1MxPt(SpS[(M-x)o(M-x)]). d1≡(M-f1)/|M-f1|. SubClust-1 f0=b2 RnGp>4 1 0 b2 0 28 ver36 SubClust-2 f0=t3 RnGp>4 none If d110, Gram-Schmidt {d1 e1...ek-1 ek+1..en} d2 ≡ (e2 - (e2od1)d1) / |e2 - (e2od1)d1| d3 ≡ (e3 - (e3od1)d1 - (e3od2)d2) / |e3 - (e3od1)d1 - (e3od2)d2| ... SubClust-1 f0=b3 RnGp>4 1 0 b3 0 23 vir8 ... 1 54 b1 0 62 vir39 SubClust-2 f0=t3 LinGap>4 1 0 t3 0 12 t34 f0=b23 RnGp>4 1 0 b23 0 30 b3... 1 84 t34 0 95 t23 0 96 t234 Thm: MxPt[SpS((M-x)od)]=MxPt[SpS(xod)] (shift by Mod, MxPts are same Repick f1MnPt[SpS(xod1)]. Pick g1MxPt[SpS(xod1)] SubClust-2 f0=t34 LinGap>4 1 0 t34 0 13 set36 Pick fhMnPt[SpS(xodh)]. Pick ghMxPt[SpS(xodh)]. f0=b124 RnGp>4 1 0 b124 0 28 b12 0 30 b14 1 32 b24 0 41 vir10... 1 75 t24 1 81 t1 1 86 t14 1 93 t12 0 98 t124 SubClust-1 f0=t24 RnGp>4 1 0 t24 1 12 t2 0 20 ver13 SubClust-2 f0=set16 LnGp>4 none SubClust-1 f1=ver49 RdGp>4 none SubClust-1 f0=b1 RnGp>4 1 0 b1 0 23 ver1 SubClust-2 f1=set42 RdGp>4 none SubClust-1 f1=ver49 LnGp>4 none 1. Choose f0 (high outlier potential? e.g., furthest from mean, M?) 2. Do f0-rnd-gap analysis (+ subcluster anal?) 3. f1 be s.t. no x further away from f0 (in some dir) (all d1 dot prods0) 4. Do f1-rnd-gap analysis (+ subclust anal?). 5. Do d1-linear-gap analysis, d1≡ f0-f1 / |f0-f1|. 6. Let f2 s.t. no x is further away (in some direction) from d1-line than f2 7. Do f2-round-gap analysis. 8. Do d2-linear-gap d2 ≡ f0-f2 - (f0-f2)od1 / len... SubClust-1 f0=ver19 RnGp>4 none SubClust-2 f1=set42 LnGp>4 none SubClust-2 is 50 setosa! Likely f2, f3 and f4 analysis will not find none. f0=b34 RnGp>4 1 0 b34 0 26 vir1 ... 1 66 vir39 0 72 set24 ... 1 83 t3 0 88 t34 SubClust-1 f0=ver19 LinGp>4 none
b123 b134 b234 0.0 41.0 42.5 41.0 0.0 43.9 42.5 43.9 0.0 b24 b2 b12 0.0 28.0 42.5 28.0 0.0 32.0 42.5 32.0 0.0 x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x g for FMG-GM f for FMG-GM t23 t234 t12 t24 t124 t2 0.0 12.0 51.7 37.0 53.0 35.0 12.0 0.0 53.0 35.0 51.7 37.0 51.7 53.0 0.0 39.8 12.0 38.0 37.0 35.0 39.8 0.0 38.0 12.0 53.0 51.7 12.0 38.0 0.0 39.8 35.0 37.0 38.0 12.0 39.8 0.0 b34 b124 b23 t13 b13 0.0 61.4 41.0 91.2 42.5 61.4 0.0 60.5 88.4 59.4 41.0 60.5 0.0 91.8 43.9 91.2 88.4 91.8 0.0 104.8 42.5 59.4 43.9 104.8 0.0 FMO(d) f1=ball g1=tall LnGp>4 1 -137 ball 0 -126 b123 0 -124 b134 1 -122 b234 0 -112 b13 ... 1 -29 t13 1 -24 t134 1 -18 t123 1 -13 tal f2=vir11 g2=set16 Ln>4 none f3=t34 g3=vir18 Ln>4 none f4=t4 g4=b4 Ln>4 1 24 vir1 0 39 b4 0 39 b14 f4=t4 g4=vir1 Ln>4 none This ends the process. We found all (and only) added anomalies, but missed t34, t14, t4, t1, t3, b1, b3. f1=b13 g1=b2 LnGp>4 none f2=t2 g2=b2 LnGp>4 1 21 set16 0 26 b2 f2=t2 g2=t234 Ln>4 0 5 t23 0 5 t234 0 6 t12 0 6 t24 0 6 t124 1 6 t2 0 21 ver11 CRC method g1=MaxVector ↓ x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x xxx x xx x x x x x x xx x x x x x x x x x x x x x x xx x x x x xx x x x xx x x f2=vir11 g2=b23 Ln>4 1 43 b12 0 50 b34 0 51 b124 0 51 b23 0 52 t13 0 53 b13 MCR g MCR f f2=vir11 g2=b12 Ln>4 1 45 set16 0 61 b24 0 61 b2 0 61 b12 CRC f1=MinVector
f1=bal RnGp>4 1 0 ball 0 28 b123... 1 73 t4 0 78 vir39... 1 98 t34 0 103 t12 0 104 t23 0 107 t124 1 108 t234 0 113 t13 1 116 t134 0 122 t123 0 125 tal Finally we would classify within SubCluster1 using the means of another training set (with FAUST Classify). We would also classify SubCluster2.1 and SubCluster2.2, but would we know we would find SubCluster2.1 to be all Setosa and SubCluster2.2 to be all Versicolor (as we did before). In SubCluster1 we would separate Versicolor from Virginica perfectly (as we did before). FMO(fg) start f1MxPt(SpS((M-x)o(M-x))), Round gaps first, then Linear gaps. Sub Clus2 Sub Clus1 t12 t23 t124 t234 0.0 51.7 12.0 53.0 51.7 0.0 53.0 12.0 12.0 53.0 0.0 51.7 53.0 12.0 51.7 0.0 b13 vir32 vir18 b23 0.0 22.5 22.4 43.9 22.5 0.0 4.1 35.3 22.4 4.1 0.0 33.4 43.9 35.3 33.4 0.0 |ver49 ver8 ver44 ver11 0.0 3.9 3.9 7.1 3.9 0.0 1.4 4.7 3.9 1.4 0.0 3.7 7.1 4.7 3.7 0.0 Almost outliers! Subcluster2.2 Which type? Must classify. Sub Clus2.2 b124 b12 b14 0.0 28.0 30.0 28.0 0.0 41.0 30.0 41.0 0.0 We could FAUST Classify each outlier (if so desired) to find out which class they are outliers from. However, what about the rouge outliers I added? What would we expect? They are not represented in the training set, so what would happen to them? My thinking: they are real iris samples so we should not do the really do the outlier analysis and subsequent classification on the original 150. We already know (assuming the "other training set" has the same means as these 150 do), that we can separate Setosa, Versicolor and Virginica prefectly using FAUST Classify. SubClus2 f1=t14 Rn>4 0 0 t1 1 0 t14 0 30 ver8 ... 1 47 set15 0 52 t3 0 52 t34 SubClus1 f1=b123 Rn>4 1 0 b123 0 30 b13 0 30 vir32 0 30 vir18 1 32 b23 0 37 vir6 If this is typical (though concluding from one example is definitely "over-fitting"), then we have to conclude that Mark's round gap analysis is more productive than linear dot product proj gap analysis! FFG (Furthest to Furthest), computes SpS((M-x)o(M-x)) for f1 (expensive? Grab any pt?, corner pt?) then compute SpS((x-f1)o(x-f1)) for f1-round-gap-analysis. Then compute SpS(xod1) to get g1 to have projection furthest from that of f1 ( for d1 linear gap analysis) (Too expensive? since gk-round-gap-analysis and linear analysis contributed very little! But we need it to get f2, etc. Are there other cheaper ways to get a good f2? Need SpS((x-g1)o(x-g1)) for g1-round-gap-analysis (too expensive!) SubClus2 f1=set23 Rn>4 1 17 vir39 0 23 ver49 0 26 ver8 0 27 ver44 1 30 ver11 0 43 t24 0 43 t2 SubClus1 f1=b134 Rn>4 1 0 b134 0 24 vir19 SC1 f2=ver13 Rn>4 1 0 ver13 0 5 ver43 SubClus1 f1=b234 Rn>4 1 0 b234 1 30 b34 0 37 vir10 SC1 g2=vir10 Rn>4 1 0 vir10 0 6 vir44 SubClus1 f1=b124 Rn>4 1 0 b124 0 28 b12 0 30 b14 1 32 b24 0 41 b1... 1 59 t4 0 68 b3 SbCl_2.1 g1=ver39 Rn>4 1 0 vir39 0 7 set21 Note:what remains in SubClus2.1 is exactly the 50 setosa. But we wouldn't know that, so we continue to look for outliers and subclusters. SC1 f4=b1 Rn>4 1 0 b1 0 23 ver1 SbCl_2.1 g1=set19 Rn>4 none SbCl_2.1 f3=set16 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 g3=set9 Rn>4 none SbCl_2.1 f2=set42 Rn>4 1 0 set42 0 6 set9 SC1 f1=vir19 Rn>4 1 44 t4 0 52 b2 SC1 g4=b4 Rn>4 1 0 b4 0 21 vir15 SbCl_2.1 LnG>4 none SbCl_2.1 f4=set Rn>4 none SbCl_2.1 f2=set9 Rn>4 none SbCl_2.1 g4=set Rn>4 none SC1 g1=b2 Rn>4 1 0 t4 0 28 ver36 SubC1us1 has 91, only versicolor and virginica. SbCl_2.1 g2=set16 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 LnG>4 none
For speed of text mining (and of other high dimension datamining), we might do additional dimension reduction (after stemming content word). A simple way is to use STD of the column of numbers generated by the functional (e.g., Xk, SpS((x-M)o(x-M)), SpS((x-f)o(x-f)), SpS(xod), etc.). The STDs of the columns, Xk, can be precomputed up front, once and for all. STDs of projection and square distance functionals must be done after they are generated (could be done upon capture too). Good functionals produce many large gaps. In Iris150 and Iris150+Out30, I find that the precomputed STD is a good indicator of that. A text mining scheme might be: 1. Capture the text as a PTreeSET (after stemming the content words) and store mean, median, STD of every column (content word stem). 2. Throw out low STD columns. 4'. Use a weighted sum of "importance" and STD? (If the STD is low, there can't be many large gaps.) Sub Clus2 Sub Clus1 ver49 ver8 ver44 ver11 0.0 3.9 3.9 7.2 3.9 0.0 1.4 4.6 3.9 1.4 0.0 3.6 7.2 4.6 3.6 0.0 A possible Attribute Selection algorithm: 1. Peel from X, outliers using CRM-lin, CRC-lin, possibly M-rnd, fM-rnd, fg-rnd.. (Xin = X - Xout) 2. Calculate widths of each Xin-Circumscribing Rectangle edge, crewk 4. Look for wide gaps top down (or, very simply, order by STD). 4'. Divide crewk into count{xk| xXin}. (but that doesn't account for dups) 4''. look for preponderance of wide thin-gaps top down. 4'''. look for high projection interval count dispersion (STD). Notes: 1. Maybe an inlier sub-cluster needs occur from more than one functional projection to be declared an inlier sub-cluster? 2. STD of a functional projection appears to be a good indicator of the quality of its gap analysis. For FAUST Cluster-d (pick d, then f=MnPt(xod) and g=MxPt(xod) ) a full grid of unit vectors (all directions, equally spaced) may be needed. Such a grid could be constructed using angles a1, ... , am, each equi-width partitioned on [0,180), with the formulas: d = e1k=n...2cosk + e2sin2k=n...3cosk + e3sin3k=n...4cosk + ... + ensinn where i's start at 0 and increment by . So, di1..in= j=1..n[ ej sin((ij-1)) * k=n. .j+1cos(k) ]; i0≡0, divides 180 (e.g., 90, 45, 22.5...) CRMSTD(dfg) Eliminate all columns with STD < threshold. d3 0 10 set23...50set+vir39 1 19 set25 0 30 ver49...50ver_49vir 0 69 vir19 (d3+d4)/sqr(2) clus1 none (d3+d4)/sqr(2) clus2 none d5 (f5=vir19, g5=set14) none f5 1 0.0 vir19 clus2 0 4.1 vir23 g5 none Just about all the high STD columns find the subcluster split. In addition, they find the four outliers as well (d1+d3+d4)/sqr(3) clus1 1 44.5 set19 0 55.4 vir39 (d1+d3+d4)/sqr(3) clus2 none d5 (f5=vir23, g5=set14) none,f5 none, g5 none d5 (f5=vir32, g5=set14) none, f5 none, g5 none d5 (f5=vir18, g5=set14) none f5 1 0.0 vir18 clus2 1 4.1 vir32 0 8.2 vir6 g5 none d5 (f5=vir6, g5=set14) none, f5 none, g5 none (d1+d2+d3+d4)/sqr(4) clus1 (d1+d2+d3+d4)/sqr(4) clus2 none (d1+d3)/sqr(2) clus1 none (d1+d3)/sqr(2) clus2: 0 57.3 ver49 0 58.0 ver8 0 58.7 ver44 1 60.1 ver11 0 64.3 ver10 none
CRMSTD(dfg) using IRIS rectangle on Satlog (1805 rows of R,G,IR1,IR2 with classes {1,2,3,4,5,7}.). Here I made a mistake and left MinVec, MaxVec and M as they were for IRIS (so probably far from the Satlog dataset). The results were good??? Suggests random f and g? 3 361 3 84 3 100 3 315 0.0 7.3 16.4 10.1 7.3 0.0 10.7 3.9 16.4 10.7 0.0 11.5 10.1 3.9 11.5 0.0 3 361 3 84 3 100 3 315 0.0 7.3 16.4 10.1 7.3 0.0 10.7 3.9 16.4 10.7 0.0 11.5 10.1 3.9 11.5 0.0 3 361 3 84 3 100 3 315 0.0 7.3 16.4 10.1 7.3 0.0 10.7 3.9 16.4 10.7 0.0 11.5 10.1 3.9 11.5 0.0 5 149 5 24 5 73 5 168 0.0 7.7 4.6 8.1 7.7 0.0 7.1 14.0 4.6 7.1 0.0 7.3 8.1 14.0 7.3 0.0 3 315 2 191 3 84 3 100 2 85 3 361 0.0 120.7 3.9 11.5 122.2 10.1 120.7 0.0 119.2 115.0 7.8 121.3 3.9 119.2 0.0 10.7 120.7 7.3 11.5 115.0 10.7 0.0 116.4 16.4 122.2 7.8 120.7 116.4 0.0 122.5 10.1 121.3 7.3 16.4 122.5 0.0 d2 STD=23.7 gp>3 val cl num 1 121 1 297 0 126 3 361 0 127 3 84 0 128 3 100 0 128 3 315 (d2+d3)/sqr2 STD=23.6 1 173.2 3 244 0 183.8 3 361 0 181.7 3 84 0 184.6 3 100 0 180.3 3 315 (d1+d2)/sqr2 STD=25.3 1 153.4 3 200 0 157.7 3 315 1 157.7 3 84 0 161.2 3 361 (d1+d4)/sqr2 STD=15.5 0 59.4 5 75 1 60.1 5 24 0 64.3 5 149... 1 142.1 3 84 0 145.7 3 361 d4 STD=20.3 gp>3 val cl num 1 29 5 75 1 33 5 24 0 37 5 73... 1 150 2 85 0 154 2 191 SQRT(x-f2)o(x-f2) STD=26.7 val cl num 1 41.6 5 75 0 45.9 5 24... 1 168.8 3 244 0 180.4 3 361 0 178.1 3 84 0 179.2 3 100 0 176.1 3 315 (d1+d3)/sqr2 STD=16.8 1 159.8 3 84 0 166.9 3 361 d3+d4)/sqr2 STD=25.7 1 39.5 5 75 0 44.5 5 24... 1 142.5 2 119 0 146.5 2 191 0 147.5 2 85 (d2+d4)/sqr2 STD=20.4 0 40.0 5 75 1 41.0 5 24... 1 109.5 3 45 0 115.0 3 361 0 115.5 3 315 0 116.0 3 84 0 117.5 3 100 same d3 STD=17.2 gp>3 val cl num 1 139 2 191 0 145 2 85 d1+d2+d3+d4)/sqr4 STD=25.9 0 92.5 5 75 1 95.0 5 24 0 99.0 5 149 1 101.5 5 73 0 105.0 5 121... 1 222.0 3 244 0 226.5 3 315 0 227.0 3 100 1 229.0 3 84 0 233.0 3 361 same d1+d2+d3)/sqr3 STD=25.3 1 203.8 3 84 0 209.0 3 361 SQRT(x-g2)o(x-g2) STD=26.8 val cl num 1 15.6 5 75 0 22.5 5 149 0 22.9 5 24 0 24.1 5 73 1 26.6 5 168 0 29.6 5 121... 1 162.1 2 119 0 168.7 2 191 0 169.7 2 85 d1 STD=13.6 g>3 none SQRT(x-M)o(x-M) STD=28 val cl num 1 29.6 5 75 1 34.2 5 24 0 38.7 5 149 1 39.7 5 73 0 43.7 5 168 (d1+d2+d4)/sqr3 STD=21.9 0 67.0 5 24 1 67.5 5 75 0 72.5 5 149 sqr(x-f4)o(x-f4 STD=27.8 val cl num 1 35.6 5 75 1 39.9 5 24 0 45.3 5 149 1 45.7 5 73 0 50.8 5 168... 1 176.2 2 119 0 182.9 2 191 0 182.9 2 85 d1+d3+d4/sq3 STD=22.1 0 81.4 5 24 1 77.4 5 75 SQRT(x-f1)o(x-f1) STD=27 val cl num 0 41.1 5 24 1 41.6 5 75 0 44.9 5 149... 1 172.8 3 84 0 176.6 3 361 SQRT(x-f5)o(x-f5) STD=25 val cl num 1 147.1 3 100 0 151.7 2 85 0 152.3 2 191 Skip STD<25, same outliers: 2_85, 2_191, 3_361, 3_84, 3_100, 3_315, 5_24, 5_73, 5_75, 5_149, 5_168, SQRT(x-f3)o(x-f3) STD=27.5 val cl num 1 52.2 5 75 0 58.0 5 24 1 58.2 5 149 0 61.5 5 73 1 62.5 5 168 0 66.0 5 121... 1 188.2 3 361 0 192.0 2 191 0 193.6 2 85 SQRT(x-g4)o(x-g4) STD=27.7 val cl num 1 144.8 2 119 0 148.6 3 315 0 150.7 2 191 0 150.9 3 84 0 151.8 3 100 0 151.8 2 85 0 153.9 3 361 SQRTx-g5ox-g5 STD=27.4 val cl num 0 27.8 5 75 1 29.4 5 24 0 35.1 5 73 1 35.6 5 149 0 39.4 5 71 SQRT(x-g1)o(x-g1) STD=26.3 val cl num 1 41.6 5 75 0 45.9 5 24... 1 166.1 2 119 0 172.3 2 191 0 172.8 2 85 SQRT(x-g3)o(x-g3) STD=24.9 none
CRMSTD(dfg) Satlog corners on Satlog Class Means c1M 63.6 98.4 110.3 90.2 c2M 48.4 38.5 114.5 119.9 c3M 87.8 106.1 111.0 87.8 c4M 77.1 90.2 94.7 73.9 c5M 59.8 62.2 80.4 66.7 c7M 69.2 77.9 82.3 64.5 1=red soil, 2=cotton, 3=grey soil, 4=damp grey soil, 5=soil w stubble, 6=mixture, 7=very damp grey soil Classes 2, 5 isolated from the rest (and each other)? 2 and 5 produced the greatest number of outliers. Take f5=c2M; g5 to be other means: Sub Cluster4 Sub Cluster2 Sub Cluster1 3 361 3 84 3 100 3 315 0.0 7.3 16.4 10.1 7.3 0.0 10.7 3.9 16.4 10.7 0.0 11.5 10.1 3.9 11.5 0.0 Sub Cluster3 d5(f5=c2M,g5=c7M) g>3 STD=26 val cl num 0 -139.9 2 85 1 -138.8 2 191 0 -134.4 2 186 0 -132.1 2 119 0 -131.7 2 224 0 -130.9 2 23 : 1 -74.5 2 200 0 -70.2 2 160 0 -68.9 2 165 0 -68.2 2 86 0 -68.1 2 194 0 -67.3 2 138 0 -67.0 2 19 1 -67.0 2 223 0 -62.9 2 60 0 -62.5 2 132 0 -59.8 5 45 : 0 -14.1 7 602 0 -14.0 7 412 0 -14.0 7 420 0 -13.9 7 306 0 -13.9 7 244 0 -13.7 5 175 0 -13.2 5 15 0 -13.1 7 562 0 -13.1 7 359 0 -13.0 7 532 0 -13.0 7 530 0 -12.9 7 414 0 -12.8 5 71 0 -12.7 5 121 0 -12.2 7 636 0 -11.4 5 144 1 -11.0 7 470 0 -8.0 5 168 0 -7.9 5 24 0 -7.9 5 73 0 -7.5 5 149 1 -4.9 5 190 0 -0.8 5 75 d2 STD=23.7 val cl num 1 121 1 297 0 126 3 361 0 127 3 84 0 128 3 100 0 128 3 315 Lots of outliers found, but did not separate classes as subclusters (Keeping in mind that they may butt up against each other (no gap) so that they would never appear as subclsuter via gap analysis methods.). Suppose we have a high quality training set for this dataset reliably accurate class means. Next, find any class gaps that might exist by using those as our f and g points. (d1+d2)/sqr2 STD=25.2 none d4 STD=20.3 val cl num 1 29 5 75 1 33 5 24 0 37 5 73.. 1 150 2 85 0 154 2 191 (d1+d3)/sqr2 STD=16.6 none (d1+d4)/sqr2 STD=15.3 none (d2+d3)/sqr2 STD=23.4 none (d2+d4)/sqr2 STD=23.4 none SubCluster1 consists of 191 class=2 samples. SubCluster3 contains every subcluster. Next, on SubCluster3 we use f5=c1M and g5=c7M. (d3+d4)/sqr2 STD=25.3 1 68.6 5 168 0 72.1 5 121 2 160 2 165 2 86 2 194 2 138 2 19 2 223 0.0 20.6 4.6 9.9 5.8 20.9 15.4 20.6 0.0 22.4 11.6 23.3 5.0 12.2 4.6 22.4 0.0 12.6 4.1 21.7 18.6 9.9 11.6 12.6 0.0 12.8 13.0 6.9 5.8 23.3 4.1 12.8 0.0 22.9 18.2 20.9 5.0 21.7 13.0 22.9 0.0 15.4 15.4 12.2 18.6 6.9 18.2 15.4 0.0 (d1+d2+d3)/sqr3 STD=25.2 none d3 STD=17.2 val cl num 1 139 2 191 0 145 2 85 (d1+d2+d4)/sqr3 STD=21.6 none (d1+d3+d4)/sqr3 STD=21.8 none (d2+d3+d4)/sqr3 STD=25.4 none (d1+d2+d3+d4)/sqr4 STD=25.4 none d1 STD=13.6 none d2 STD=23.7 val cl num val dis(1 297) 0 118 3 242 153.3 35.128 0 118 3 73 148.4 35.707 0 118 3 343 152.3 31.144 0 118 3 263 148.4 35.707 0 118 3 155 147.4 31.796 0 118 1 36 153.5 9.2736 0 118 3 221 152.3 31.144 0 118 3 244 158.3 35.707 0 120 3 50 155.6 33.090 0 120 3 344 148.1 24.617 0 120 3 200 151.8 33.136 0 120 3 310 151.9 29.189 0 120 3 202 154.0 33.136 1 121 1 297 149.8 0 dis(2_200,2_160)=12.4 outlier f1 STD=11.8 none dis(2_60,2_132) =3.9 g1 STD=14.5 none (2_132,5_45) =33.6 outliers. f2 STD=14.9 none 5 168 5 24 5 73 5 149 5 190 5 75 0.0 14.0 7.3 8.1 16.5 15.7 14.0 0.0 7.1 7.7 26.2 8.1 7.3 7.1 0.0 4.6 19.7 11.0 8.1 7.7 4.6 0.0 22.7 10.1 16.5 26.2 19.7 22.7 0.0 27.9 15.7 8.1 11.0 10.1 27.9 0.0 g2 STD=23.6 none f3 STD=16.9 none g3 STD=12.7 val cl num 1 101.9 5 73 0 105.0 5 149 SubClus3 f5=c1M, g5=c7M. f4 STD=22.3 none d5(f5=c2M,g5=c7M) g>2 STD=68 val cl num 0 4.9 3 70 1 90.2 5 33 0 92.3 5 121 1 92.5 5 179 0 187.5 1 110 : 1 216.5 3 244 1 223.3 3 315 0 225.6 3 84 0 226.6 3 100 g4 STD=11.6 val cl num 1 42.1 2 10 0 48.0 2 143.. 1 114.9 5 168 0 119.8 5 73 g4 STD=11.6 val cl num 0 52.1 2 143 52.1 0 54.6 2 145 54.6 16.278 f5 STD=24.8 none g5 STD=27.1 none
Density: A set is T-dense iff it has no distance gaps greater than T. 10/20/12 (Equivalently, every point has neighbors in its' T-neighborhood.) We can use L1 or HOB or L distance, since disL1(x,y) disL2(x,y); disL2(x,y) 2*disHOB(x,y) and disL2(x,y) n*disL(x,y) Definition: YX is T-dense iff there does not exist yY such that dis2(y, Y-{y}) > T. Theorem-1:If for every yY, dis2(y,Y-{y}) T then Y is T-dense. Using L1 distance, not L2=Euclidean: Theorem-2: disL1(x,y) disL2(x,y) (from here on we will use disk to mean disLk ). Therefore: If, for every yY, dis1(y,Y-{y}) T then Y is T-dense. ( Proof: dis2(y,Y-{y}) dis1(y,Y-{y}) T ) 2*disHOB(x,y) dis2(x,y) (Proof: Let the bit pattern of dis2(x,y) be 001bk-1...b0 then disHOB(x,y)=2k and the most bk-1 ...b0 can contribute is 2k-1 (if it's all 1-bits). So dis2(x,y) 2k + (2k - 1) 2*2k = 2*disHOB(x,y). Theorem-3: If, for every yY, disHOB(y,Y-{y}) T/2 then Y is T-dense. Proof: dis2(y,Y-{y}) 2*disHOB(y,Y-{y}) 2*T/2 = T Theorem-4: If, for every yY, dis(y,Y-{y}) T/n then Y is T-dense. Proof: dis2(y,Y-{y}) n*disHOB(y,Y-{y}) n*T/n = T Pick T' based on T and the dimension, n (It can be done!). If MaxGap(yoek)=MaxGap(Yk) < T' k=1..n, then Y is T-dense (Recall, yoek is just Yk as a column of values.) Note: We use the logn pTreeGapFinder to avoid sorting. Unfortunately, it doesn't immediately find all gaps precisely at their full width (because it descends using power of 2 widths), but if we find all PTreeGaps, we can be assured that MaxPTreeGap(Y) MaxGap(Y) or we can keep track of "thin gaps" and thereby actually identify all gaps (see the slide on pTreeGapFinder). Theorem-5: If k=1..nMaxGap(Yk) T, then Y is T-dense Proof: dis1(y,x)≡k=1..n|yk-xk|. |yk-xk| MaxGap(Yk) xY. So dis2(y,Y-{y}) dis1(y,Y-{y}) k=1..nMaxGap(Yk) T
p x y 1 6 36 2 7 39 3 8 41 4 9 34 5 9 38 6 10 42 7 12 34 8 12 38 9 13 35 10 13 40 11 19 38 12 25 38 13 22 22 14 26 16 15 26 25 16 29 11 17 31 18 18 32 26 19 34 11 20 34 23 21 35 20 22 37 10 23 37 23 24 38 13 25 38 21 26 39 24 27 40 9 28 42 9 29 38 39 30 38 42 31 39 44 32 41 41 33 41 45 34 42 39 35 42 43 36 44 43 37 45 40 No gaps (ct=0_intervals) on the furthest-to-Mean line, but 3 ct=1 intevals. Declare p=p12, p16, p18 anomaly if pofM is far enough from the bddry pts of its interval? Mean, M VOM (34, 35) Round 2 is straight forward. So, 1. Given gaps, find ct=k_intervals. 2. Find good gaps (dot prod with a constant vector for linear gaps?) For rounded gaps, use xox? Note: in this example, vom works better than mean.
Using vector lengths However, if the data happens to be shifted, as it is on the right, using lengths no longer works in this example. That is, dot product with a fixed vector, like fM is independent of the placement of the points with respect to the origin. Length based gapping is dependent. 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 0 10 20 30 40 50 A squared pattern does not lend itself to rounded gap boundaries. distance from the origin is in red. Distance from (7,0) is in blue. 9 x x 8 7 x x x x x x x x x x x x x x x x 6 x x x x x x x x x x x x x x x x 5 x x x x x x x x x x x x x x x x 4 x x x x x x x x x x x x x x x x 3 x x x x x x x x x x x x x x x x 2 x x x x x x x x x x x x x x x x 1 x x x x x x x x x x x x x x x x 0 x x x x x x x x x x x x x x x x 0 1 2 3 4 5 6 7 8 9 a b c d e f
FAUST=Fast, Accurate Unsupervised and Supervised Teaching(Teachingbig data to reveal information) 6/9/12 • FAUST CLUSTER-fmg (furthest-to-meangaps for finding round clusters):C=X (e.g., X≡{p1, ..., pf}= 15 pix dataset.) • While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). • Pick fC furthest fromM from S≡SPTreeSet(D(x,M) .(e.g., HOBbit furthestf, take any from highest-order S-slice.) • If ct(C)/dis2(f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(cofM/|fM|) gap > GT (GapThresh) • End While. • Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? Interlocking horseshoes with an outlier 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 1 2 3 4 5 6 7 8 9 a b c d e f C2={p5} complete (singleton = outlier). C3={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is dense ( density(C1)= ~4/22=.5 > DT=.3 ?) , thus C1is complete. Applying the algorithm to C4: In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high! {pa} outlier. C2 splits into {p9}, {pb,pc,pd} complete. 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f M0 8.3 4.2 M1 6.3 3.5 f1=p3, C1 doesn't split (complete). M f M4 X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 3.3 1.8 1.5 C1 C2 C3 C4 M1 M0
FAUST CLUSTER-fmg:O(logn) pTree method for finding P-gaps: P ≡ ScalarPTreeSet( c ofM/|fM| ) xoUp1M 1 3 3 4 6 9 14 13 15 13 13 14 13 15 10 P3 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 P2 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 P1 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 P0 1 1 1 0 0 1 0 1 1 1 1 0 1 1 0 X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 D(x,M) 8 7 7 6 4 2 7 6 7 4 4 6 6 7 4 D3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D2 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 D1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 D0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 HOBbit Furthest pt list ={p1} Pick f=p1. dens(C)=16/82=16/64=.25 If GT=2k then add 0,1,...,2k-1 check all k of these down to level=2k P3'=[0,7] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 ct=5 P3=[8,15] 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 ct= 10 P3'&P2 =[4,7] 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 ct =2 P3&P2' =[8,11] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 ct =2 P3&P2 =[12,15] 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 ct =8 P3'&P2' =[0,3] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ct =3 P3'&P2'&P1' =[0,1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ct =1 P3&P2'&P1' =[8,9] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ct =1 P3&P2'& P1= [10,11] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ct=1 P3'&P2'&P1 =[2,3] 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ct =2 P3'&P2&P1' =[4,5] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ct =1 P3'&P2&P1 =[6,7] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ct=1 P3&P2&P1' =[12,13] 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 ct =3 P3&P2&P1 =[14,15] 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 ct =4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P3'&P2' &P1'&P0' 0ct=0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P3'&P2' &P1'&P0 1ct=1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P3'&P2' &P1&P0' 2ct=0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 P3'&P2' &P1&P0 3ct=2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 P3'&P2& P1'&P0' 4ct=0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P3'&P2 &P1'&P0 5ct=0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 P3'&P2& P1&P0' 6ct=1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P3'&P2 &P1&P0 7ct=0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P3&P2'& P1'&P0' 8ct=0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 P3&P2'& P1'&P0 9ct=1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 P3&P2' &P1&P0' 10ct=1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P3&P2' &P1&P0 11ct=0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P3&P2& P1'&P0' 12ct=0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 P3&P2 &P1'&P0 13ct=4 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 P3&P2' &P1&P0' 14ct=2 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 P3&P2 &P1&P0 15ct=2 Gaps at each red value. Get a mask pTree for each cluster by ORing the pTrees between pairs of gaps. Next slide - use xofM instead of xoUfM
FAUST CLUSTER ffd summary If DT=1.1 then{pa} joins {p7,p8,p9}. If DT=0.5 then also {pf} joins {pb,pc.pd,pe} and {p5} joins {p1,p2,p3,p4}. We call the overall method FAUST CLUSTER because it resembles FAUST CLASSIFY algorithmically and k (# of clusters) is dynamically determined. Improvements? Better stop condition? Is fmg better than ffd? In ffd, what if k over shoots its' optimal value? Add a fusion step each round? As Mark points out, having k too large can be problematic?. The proper definition of outlier or anomaly is a huge question. An outlier or anomaly should be a cluster that is both small and remote. How small? How remote? What combination? Should the definition be global or local? We need to research this (give users options and advice for their use). Md: create f=furthest pt from M, d(f,M) while creating D=SPTreeSet(d(x,M)? Or as a separate procedure, start with P=Dh (h=High Bit Pos.) then recursively Pk<-- P & Dh-k until Pk+1=0. Then back up to Pk and take any of those points as f and that bit pattern is d(f,M). Note that this doesn't necessarily give the furthest pt from M but gives a pt sufficiently far from M. Or use HOBbit dis? Modify to get absolute furthest pt by jumping (when AND gives zero) to Pk+2 and continuing AND from there. (Dh gives a decent f (at furthest HOBbit dis). 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f centriod=mean; h=1; DT= 1.5 gives 4 outliers and 3 non-outlier clusters
Relative gap size on f-g line for fission pt. Declare 2 gaps (3 clusters), C1={p1,p2,p3,p4,p5,p6,p7,p8,pe,pf} C2={p9,pb,pd} C3={pa} (outlier). Declare 2 gaps (3 clusters), C1={p1,p2,p3,p4,p5} C2={p6} (outlier) C3={p7,p8,p9,pa,pb,pc,pd,pe,pf} On C1, no gaps, so C1 has converged and is declared complete. On C1, 1 gap so declare (complete) clusters, C11={p1,p2,p3,p4} C12={p5} On C2, 1 (relative) gap, and the two subclusters are uniform so the both are complete (skipping that analysis) On C3, 1 gap so declare clusters, C31={p7,p8,p9,pa} C32={pb,pc,pd,pe,pf} On C31, 1 gap, declare complete clusters, C311={p7,p8,p9} C312={pa} On C32, 1 gap, declare complete clusters, C311={pf} C322={pb,pc,pd,pe} 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 9 a b c d e f 1 2 3 4 5 6 7 8 9 a b c d e f Does this method also work on the first example? YES. 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f
max dis toM0 6.13 dis to M1 2.94 1.17 3.39 2.29 1.34 1.11 2.52 dis to M2 4.24 3.65 2.98 1.42 0.86 1.15 2.42 4.14 disf2=6 0.00 1.00 4.00 5.65 3.60 3.60 4.12 3.16 disg2=a 5.65 5.00 4.00 0.00 2.23 2.23 3.00 5.09 PC21 1 1 0 0 0 0 0 1 dis to M21 4.24 3.65 4.14 disf21=e 3.16 2.24 0.00 disg21=6 0.00 1.00 3.16 PC211 0 0 1 dis to M22 6.02 2.86 1.84 0.63 0.89 disf22=9 0.00 4.00 2.24 3.61 5.00 disg22=d 5.00 3.00 2.83 1.41 0.00 PC221 1 0 1 0 0 disf=p3 6.32 3.60 0.00 1.41 4.47 7.07 7.00 4.00 11.0 11.4 10.0 9.21 8.54 6.32 5.09 disg=pa 7.07 9.43 11.4 10.7 8.60 5.65 5.00 7.61 4.00 0.00 2.23 2.23 3.00 5.09 6.32 PC1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5 6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 13 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5 FAUST CLUSTER ffd on the "Linked Horseshoe" type example: 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 9 a b c d e f 1 2 3 4 5 6 7 8 9 a b c d e f Discuss: Here, DT=.99 (DT=1.5 all singeltons?). We expected FAUST to fail to find interlocked horseshoes, but hoped. e,g, pa and p9 would be only singleton! Can modify so it doesn't make almost everything outliers (singles, doubles a. look at upper cluster bbdry (margin width)? b. use std ratio bddys? c. other? d. use a fussion step to weld the horseshoes back Next slide: gaps on f-g line for fission pt. PC222 0 1 0 1 1 dis to M1 2.94 1.17 3.39 2.29 1.34 1.11 2.52 dis to f2=3 6.32 3.61 0.00 1.41 4.47 4.00 5.10 PC12 0 0 0 1 1 dis to M222 1.70 0.75 1.37 dis to M12 1.89 1.72 1.08 1.08 2.09 dis to f12=f 3.16 3.61 3.16 1.41 0.00 PC11 0 0 1 1 0 0 0 X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5 6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 13 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5 dens(C0)= 15/6.132<DT inc M0 8.1 4.2 dens(C1)= 7/3.392 <DT inc M1 5.3 3.1 dens(C2)= 8/4.242<DT inc M2 12.1 5.9 dens(C21)= 3/4.142<DT inc M2110.6 5.1 dens(C212)= 2/.52=8>DT compl C212 compl dns(C221)= 2/5<DT inc M2211.8 5.6 dns(C222)=1.04<DT inc M22211.3 6.7 M12 6.4 3