1 / 21

Sparse Dot Gaps Analysis: Advanced Methods for Cluster Identification

Explore SCP, Dot Gaps, Dot Sparse Ends, Density Analysis techniques to detect outliers and clusters in dataset gaps.

vkincer
Download Presentation

Sparse Dot Gaps Analysis: Advanced Methods for Cluster Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SL Gaps (SL)1. Check SLp(y) ≡ (y-p)o(y-p) gaps (using a p grid). Dot Gaps - Sparse Ends (DG-SE)1. Check Dotpq(y) ≡ (y-p)o(p-q)/|p-q| gaps (using grids for p and d=(p-q)/|p-q|?). 1.1 Check distances at sparse extremes. SPD Gaps (SPD)1. Check SPDpq(y) ≡ SLp(y)- Dotpq(y)2 gaps Dot Gaps - KMeans (DG-KM) 1. Check Dotp,d(y) gaps (grids of p and d?). 1.1 Check distances at sparse extremes. 2. After several rounds of 1, apply k-means to the resulting clusters (when k seems to be determined). Dot Gaps - Density Analysis (DG-DA) 1. Check Dotp,d(y) gaps (grids of p and d?) against density of subcluster. 1.1 Check distances at sparse extremes against subcluster density. 2. Apply other methods once Dot ceases to be effective. Dot Gaps - Square Length (DG-SL) 1. Check Dotp,d(y) (over grid of p,d) and SLp(y) (over grid of p). 1.1 Check sparse ends distance with subcluster density. (Dotpd , SLp share construction steps!) SL Gaps - Dot Gaps - Square Length Gaps (SL-DG-SPD) (Dotpq , SLp and SPDpq share construction steps! SLp(y)≡ (y-p)o(y-p) = yoy - 2 yop +pop Dotpq(y) ≡ (y-p)od=yod-pod= (1/|p-q|)yop - (1/|p-q|)yoq Calc yoy, yop, yoq concurrently? Then constant multiplies 2*yop, (1/|p-q|)*yop concurrently. Then add | subtract. Calculate Dotpq(y)2. Then subtract it from SLp(y)

  2. Dot Gaps - Sparse Ends (DG-SE) 1. Check Dotp,d(y) gaps (grids of p and d?). 1.1 Check distances at sparse ends. Analyzing the thinning at [8,9]: 7 7 7 7 7 7 7 7 7 7 7 10 10 10 10 10 10 10 10 10 10 10 10 e21i4 i8 i9 i17i24i26i27i28i38i50e2 e3 e12e5 e17e19e23e29e35e37i20i34 e21 0 9 21 15 9 6 18 5 3 9 4 7 11 7 8 6 11 9 5 7 9 11 7 i4 9 0 12 6 2 7 10 8 7 2 6 12 10 15 11 13 13 9 12 15 10 10 6 i8 21 12 0 9 11 17 4 19 18 12 18 21 15 25 19 25 22 18 22 26 17 20 16 i9 15 6 9 0 6 10 9 12 12 7 12 15 11 19 13 18 15 10 16 19 13 11 9 i17 9 2 11 6 0 7 9 8 7 1 7 11 8 15 10 14 13 9 12 15 9 11 6 i24 6 7 17 10 7 0 15 2 4 7 5 7 8 9 5 9 7 4 6 11 7 7 4 i2618 10 4 9 9 15 0 16 16 9 16 17 12 22 16 22 21 16 20 24 14 19 14 i27 5 8 19 12 8 2 16 0 2 8 5 6 8 8 5 8 7 4 5 9 7 7 4 i28 3 7 18 12 7 4 16 2 0 7 3 6 9 8 6 7 9 6 5 9 7 9 5 i38 9 2 12 7 1 7 9 8 7 0 6 10 8 14 10 13 14 9 11 14 9 11 6 i50 4 6 18 12 7 5 16 5 3 6 0 9 11 9 9 7 11 7 7 8 9 9 5 e2 7 12 21 15 11 7 17 6 6 10 9 0 6 6 4 8 10 8 5 10 4 12 7 e3 11 10 15 11 8 8 12 8 9 8 11 6 0 12 6 14 12 8 10 16 3 13 7 e12 7 15 25 19 15 9 22 8 8 14 9 6 12 0 7 4 9 9 3 6 9 11 10 e5 8 11 19 13 10 5 16 5 6 10 9 4 6 7 0 9 7 5 5 11 4 9 5 e17 6 13 25 18 14 9 22 8 7 13 7 8 14 4 9 0 10 9 4 2 11 10 9 e1911 13 22 15 13 7 21 7 9 14 11 10 12 9 7 10 0 5 7 11 10 5 9 e23 9 9 18 10 9 4 16 4 6 9 7 8 8 9 5 9 5 0 6 11 7 4 4 e29 5 12 22 16 12 6 20 5 5 11 7 5 10 3 5 4 7 6 0 6 8 9 7 e35 7 15 26 19 15 11 24 9 9 14 8 10 16 6 11 2 11 11 6 0 13 11 11 e37 9 10 17 13 9 7 14 7 7 9 9 4 3 9 4 11 10 7 8 13 0 12 6 i2011 10 20 11 11 7 19 7 9 11 9 12 13 11 9 10 5 4 9 11 12 0 7 i34 7 6 16 9 6 4 14 4 5 6 5 7 7 10 5 9 9 4 7 11 6 7 0 Actual dist from each F=7 to each F=10 is >=4. F-gap from F=6 to F=11 >=4. F-gap from F=6 to F=10>=4. Separate at F=8.5 to CLUS2.1<8.5 (2 ver, 43 vir) and CLUS2.2>8.5 (44 ver, 4 vir) Dot gp>=4 p=nnnn q=xxxx 0 2 3 2 4 1 5 1 7 1 9 2 10 1 11 1 12 1 13 2 14 1 15 3 16 4 17 3 18 2 19 8 20 2 21 3 22 1 23 4 24 5 25 4 26 5 27 5 28 4 29 3 30 2 31 2 32 2 33 4 34 5 36 3 37 2 38 4 40 1 42 1 43 1 44 2 45 5 47 6 48 3 49 4 50 6 51 4 52 3 53 5 54 3 55 5 56 1 57 1 58 2 59 1 60 1 Dot gp>=4 CLUS2 p=aaan q=aaax 0 3 1 3 2 8 3 2 4 6 5 5 6 5 7 11 8 2 9 4 10 12 11 8 12 13 13 5 14 3 15 7 Sparse lower end i32 i18 i19 i23 i6 i36 F i32 0 4 13 11 9 9 0 i18 4 0 12 10 8 10 0 i19 13 12 0 4 5 9 3 i23 11 10 4 0 3 7 3 i6 9 8 5 3 0 5 4 i36 9 10 9 7 5 0 5 i32andi18gap>=4 outliers Thin area: (40 44) 36 37 37 38 38 38 38 40 42 43 44 44 45 45 45 45 45 47 e4 i7 e10 e31 e32 s14 i39 s16 s19 e49 s15 e44 e11 e8 s6 s34 F e4 0 9 5 3 4 34 24 34 29 11 35 9 8 10 29 34 37 i7 9 0 8 11 12 38 30 39 35 16 40 14 13 14 34 39 37 e10 5 8 0 5 6 32 23 31 27 10 33 8 9 8 27 32 38 e31 3 11 5 0 1 32 23 31 27 9 32 7 7 8 27 31 38 e32 4 12 6 1 0 31 22 30 25 8 31 6 7 7 26 30 38 s14 34 38 32 32 31 0 25 20 17 23 18 26 28 25 16 17 38 i39 24 30 23 23 22 25 0 20 17 17 20 21 24 21 18 21 40 s16 34 39 31 31 30 20 20 0 6 26 5 29 33 29 6 4 42 s19 29 35 27 27 25 17 17 6 0 21 6 24 27 24 3 5 43 e49 11 16 10 9 8 23 17 26 21 0 26 4 7 4 21 25 44 s15 35 40 33 32 31 18 20 5 6 26 0 29 33 29 7 4 44 e44 9 14 8 7 6 26 21 29 24 4 29 0 4 1 24 28 45 e11 8 13 9 7 7 28 24 33 27 7 33 4 0 5 27 32 45 e8 10 14 8 8 7 25 21 29 24 4 29 1 5 0 23 28 45 s6 29 34 27 27 26 16 18 6 3 21 7 24 27 23 0 5 45 s34 34 39 32 31 30 17 21 4 5 25 4 28 32 28 5 0 45 Soi39,s16,s19,s49,s15are "thin area" outliers AND s14is also. Separate at 42, givingCLUS1<41 (50 Setosa, 4 Versicolor, e8,e11,e44,e49)andCLUS2>=41. So, two rounds of Dotpd(y) gap analysis yields CLUS1 (50 Setosa, plus 4 Versicolor) CLUS2.1 (43 Virginica, plus 2 Versicolor) CLUS2.2 (44 Veriscolor, plus 4 Virginica) and picks out 3 Virginica, 4 Setosa as outliers (More outliers would result by applying 1.1 to the sparse ends of the 2nd round?). Round1: p=nnnn (n=min) and q=xxxx (x=max) Round2: p=aaan (a=avg) and q=aaax Sparse Ends analysis should accomplish the same outlier detection that a few steps of SL accomplishes. If an outlier is surrounded at a fixed distance then those neighbors will show up as sparse end neighbors and the outlier-ness of the point will be detected by looking at pairwise distances of that sparse end. Sparse upper end s23 s43 s9 s39 s42 s14 F s23 0 5 8 7 13 7 56 s43 5 0 3 2 9 3 57 s9 8 3 0 1 6 3 58 s39 7 2 1 0 7 2 58 s42 13 9 6 7 0 8 59 s14 7 3 3 2 8 0 60 no gap>4 outliers

  3. CLUS1 p=nxnn q=xnxx 0 1 2 1 4 1 6 2 9 1 10 1 11 2 12 2 13 3 14 3 15 2 16 2 17 4 18 3 19 3 20 2 21 5 22 6 23 5 24 2 25 7 26 3 27 2 28 2 29 1 30 3 31 3 32 7 33 4 34 1 35 1 36 2 37 2 39 1 41 1 42 1 43 1 DG-SE (other corners) Check Dotp,d(y) gaps>=4 Check sparse ends. Sparse low end (check [0,9] 0 2 4 6 6 9 10 i23 i6 i36 i8 i31 i3 i26 i23 0 3 7 6 7 10 10 i6 3 0 5 5 6 9 8 i36 7 5 0 7 5 7 7 i8 6 5 7 0 3 5 4 i31 7 6 5 3 0 5 5 i3 10 9 7 5 5 0 4 i26 10 8 7 4 5 4 0 i3, i26, i36 >=4 singleton outliers {i23,i6}, {i8,i31} doubleton ols Sparse low end (checking [0,7] 0 1 2 3 3 4 4 4 4 4 4 4 5 6 6 6 6 6 6 6 7 7 i1 i18i19i10i37i5 i6 i23i32i44i45i49i25i8 i15i41i21i33i29i4 i3 i16 i1 0 17 18 10 4 5 15 17 18 6 5 6 6 13 11 6 7 7 8 9 9 7 i18 17 0 12 9 18 17 8 10 4 13 15 20 15 11 27 17 14 20 20 20 13 20 i19 18 12 0 14 21 17 5 4 13 15 17 23 17 9 26 17 16 19 19 20 12 21 i10 10 9 14 0 11 10 10 12 9 6 7 13 8 10 19 9 7 13 13 14 8 12 i37 4 18 21 11 0 5 17 19 19 6 4 2 5 14 9 5 6 6 7 8 10 4 i5 5 17 17 10 5 0 14 15 17 4 5 6 4 10 10 4 5 3 3 5 6 6 i6 15 8 5 10 17 14 0 3 9 11 14 19 13 5 24 14 12 16 16 17 9 18 i23 17 10 4 12 19 15 3 0 11 13 16 21 15 6 25 16 14 17 17 18 10 20 i32 18 4 13 9 19 17 9 11 0 14 16 20 15 11 27 17 14 20 20 20 12 20 i44 6 13 15 6 6 4 11 13 14 0 3 8 3 9 13 3 2 6 7 8 4 7 i45 5 15 17 7 4 5 14 16 16 3 0 6 4 12 12 2 3 7 7 9 7 5 i49 6 20 23 13 2 6 19 21 20 8 6 0 6 16 8 6 8 7 7 7 11 3 i25 6 15 17 8 5 4 13 15 15 3 4 6 0 10 12 4 3 6 6 6 5 5 i8 13 11 9 10 14 10 5 6 11 9 12 16 10 0 20 11 9 12 12 12 5 15 i15 11 27 26 19 9 10 24 25 27 13 12 8 12 20 0 11 13 8 8 9 16 8 i41 6 17 17 9 5 4 14 16 17 3 2 6 4 11 11 0 3 5 5 7 6 4 i21 7 14 16 7 6 5 12 14 14 2 3 8 3 9 13 3 0 7 7 8 4 6 i33 7 20 19 13 6 3 16 17 20 6 7 7 6 12 8 5 7 0 1 4 8 5 i29 8 20 19 13 7 3 16 17 20 7 7 7 6 12 8 5 7 1 0 3 8 5 i4 9 20 20 14 8 5 17 18 20 8 9 7 6 12 9 7 8 4 3 0 9 7 i3 9 13 12 8 10 6 9 10 12 4 7 11 5 5 16 6 4 8 8 9 0 10 i16 7 20 21 12 4 6 18 20 20 7 5 3 5 15 8 4 6 5 5 7 10 0 i26 11 11 13 8 12 9 8 10 10 6 9 13 7 4 18 9 7 11 10 10 4 12 i36 14 10 9 8 15 12 5 7 9 9 11 17 11 7 22 11 9 14 14 16 7 15 i38 9 19 20 13 7 5 17 18 19 8 8 6 5 12 10 7 7 5 4 2 9 5 i1, i18, i19, i10, i37, i32 >=4 outliers Dotgp>=4 p=xnnn q=nxxx 0 1 1 1 2 1 3 2 4 7 5 1 6 7 7 5 8 9 9 3 10 7 11 3 12 5 13 4 14 5 15 4 16 8 17 4 18 7 19 3 20 5 21 1 22 4 23 1 24 1 31 2 33 2 34 12 35 8 36 17 37 6 38 2 39 2 Sparse hi end (checking [34,43] 34 35 36 36 37 37 39 41 42 43 e20e31e10e32e15e30e11e44e8 e49 e20 0 2 5 3 5 4 9 9 9 10 e31 2 0 5 1 6 4 7 7 8 9 e10 5 5 0 6 5 8 9 8 8 10 e32 3 1 6 0 6 3 7 6 7 8 e15 5 6 5 6 0 4 11 9 10 9 e30 4 4 8 3 4 0 9 8 8 8 e11 9 7 9 7 11 9 0 4 5 7 e44 9 7 8 6 9 8 4 0 1 4 e8 9 8 8 7 10 8 5 1 0 4 e49 10 9 10 8 9 8 7 4 4 0 e30,e49,ei15,e11 >=4 singleton ols {e44,e8} doubleton ols gap:(24,31) CLUS1<27.5 (50 versi, 49 virg) CLUS2>27.5 (50 set, 1 virg) Sparse hi end (checking [38,39] 38 38 39 39 s42 s36 s37 s1 s42 0 10 16 21 s36 10 0 6 11 s37 16 6 0 6 s15 21 11 6 0 s37, s1 outliers Thinning (8,13) Split in middle=10.5 CLUS_1.1<10.5 (21 virg, 2 ver) CLUS_1.2>10.5 (12 virg, 42 ver) CLUS1 Dotgp>=4 p=nnnn q=xxxx 0 1 1 2 2 2 3 1 4 2 5 1 6 6 7 2 8 3 9 1 10 2 11 2 12 2 13 6 14 6 15 7 16 2 17 2 18 3 19 3 20 2 21 2 22 3 23 4 24 2 25 1 26 2 27 3 28 1 29 1 Clus1 p=nnxn q=xxnx 0 2 1 1 2 5 3 8 4 9 5 6 6 9 7 14 8 11 9 7 10 4 11 2 13 2 Thinning (7,9) Split in middle=7.5 CLUS_1.2.1 < 7.5 (10 virg, 4 ver) CLUS_1.2.2 > 7.5 ( 1 virg, 38 ver) i15 gap>=4 outlier at F=0 Sparse hi end (checking [10,13] 10 10 10 10 11 11 13 13 e34i2 i14i43e41i20i7 i35 e34 0 4 5 4 10 5 13 6 i2 4 0 3 0 10 7 11 8 i14 5 3 0 3 10 7 10 9 i43 4 0 3 0 10 7 11 8 e41 10 10 10 10 0 9 8 14 i20 5 7 7 7 9 0 13 7 i7 13 11 10 11 8 13 0 17 i35 6 8 9 8 14 7 17 0 i7, i35 >=4 singleton outliers CLUS1 Dotgp>=4 p=nnnx q=xxxn 0 1 4 1 5 3 6 5 7 4 8 3 9 6 10 7 11 3 12 4 13 8 14 4 15 4 16 3 17 8 18 5 19 3 20 1 21 1 22 3 23 1 CLUS1.2 Dotgp>=4 p=aaan q=aaax 0 1 4 4 5 3 6 3 7 4 8 1 9 5 10 7 11 3 12 5 13 3 14 6 15 1 16 4 17 1 18 1 19 2 hi end gap outlier i30 CLUS1.2.1 Dotgp>=4 p=anaa q=axaa 0 1 1 1 2 1 4 2 6 3 7 4 9 2 CLUS1.2.1 Dotgp>=4 p=aana q=aaxa 0 5 1 2 2 3 3 2 4 1 6 1 C.2.1 0 0 0 0 1 2 3 3 4 4 5 5 6 7 i24e7 i34i47i27i28e34e36e21i50i2 i43i14i22 i24 0 7 4 2 2 4 4 9 6 5 5 5 7 7 e7 7 0 6 9 6 5 8 4 5 7 9 9 11 10 i34 4 6 0 5 4 5 3 9 7 5 6 6 8 9 i47 2 9 5 0 4 6 5 11 8 7 5 5 6 8 i27 2 6 4 4 0 2 4 7 5 5 5 5 6 6 i28 4 5 5 6 2 0 4 6 3 3 5 5 7 6 e34 4 8 3 5 4 4 0 9 6 4 4 4 5 6 e36 9 4 9 11 7 6 9 0 4 8 10 10 11 9 e21 6 5 7 8 5 3 6 4 0 4 6 6 8 5 i50 5 7 5 7 5 3 4 8 4 0 3 3 6 5 i2 5 9 6 5 5 5 4 10 6 3 0 0 3 3 i43 5 9 6 5 5 5 4 10 6 3 0 0 3 3 i14 7 11 8 6 6 7 5 11 8 6 3 3 0 3 i22 7 10 9 8 6 6 6 9 5 5 3 3 3 0 CLUS1.2.1 p=naaa q=xaaa 0 4 1 1 2 1 3 2 4 2 5 2 6 1 7 1

  4. Check Dotp,d(y) for thinnings. Calc AVG of each side of thinning as p,q. redo. Dot p=nnnn q=xxxx 0 2 3 2 4 1 5 1 7 1 9 2 10 1 11 1 12 1 13 2 14 1 15 3 16 4 17 3 18 2 19 8 20 2 21 3 22 1 23 4 24 5 25 4 26 5 27 5 28 4 29 3 30 2 31 2 32 2 33 4 34 5 36 3 37 2 38 4 40 1 42 1 43 1 44 2 45 5 47 6 48 3 49 4 50 6 51 4 52 3 53 5 54 3 55 5 56 1 57 1 58 2 59 1 60 1 Dot p=AVG>22 q=AVG<22 0 1 1 1 2 2 3 2 4 4 5 5 6 9 7 11 8 6 9 3 10 3 11 3 19 1 23 1 24 1 25 1 26 1 29 1 30 1 31 2 32 2 34 6 35 2 36 4 37 2 38 2 39 3 40 3 41 4 42 4 43 2 44 3 45 6 46 7 47 3 48 2 49 1 50 3 52 7 54 5 55 1 56 3 57 3 58 2 59 2 61 1 62 2 64 1 66 1 67 1 68 2 70 1

  5. Furthest Point or Mean Point Barrel Clustering:(This method attempts to build barrel-shaped gaps around clusters) y (yof) (yof) (yof) f |f| f |f| f f o y - f y - = y - squared is y- yo fof fof fof f |f| f |f| yo dot prod proj len (yof)2 (yof)2 (yof)2 (yof)2 f Gaps in dot product lengths [projections] on the line. + + fof squared = yoy - 2 squared = yoy - 2 fof (fof)2 fof fof y ( (y-p)o(q-p) )2 Squared y-p on q-p Projection Distance = (y-p)o(y-p) - (q-p)o(q-p) 1st 2 (yo(q-p)-p o(q-p = yoy -2yop+ pop- |q-p| |M-p| |q-p| |M-p| M-p |M-p| (y-p)o (yof)2 Squared y on f Proj Dis = yoy - For the dot product length projections (caps) we already needed: fof barrel cap gap width po M-p ) = ( yo(M-p)- barrel radius gap width q Allows for a better fit around convex clusters that are elongated in one direction (not round). Exhaustive Search for all barrel gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in barrel shaped gaps. a. SquareBarrelRadius functional, BR(y) = (y-p)o(y-p) - ((y-p)od)2 b. BarrelLength functional, BL(y) = (y-p)od Given a p, do we need a full grid of ds (directions)? No! d and -d give the same BL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p dot product projection distance That is, we needed to compute the greenconstants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)

  6. 4 functionals in the dot product group of gap clusterers on a VectorSpace subset, Y (yY): y y - (yod)d = projection. Squaring its length: (y-yodd)o(y-yodd)=yoy-(yod)2 y  y - (yod)d so again yoy - (yod)2 = squared proj ( (y-p)o(q-p) )2 d  yod projection (y-p)o(y-p) - d (q-p)o(q-p) yod projection (neg) 3. SPDd(y) = yoy - (yod)2 (d a unit vector) is the Square Projection Distance functional E.g., if d ≡ (q-p)/|q-p|, d = unit vector from vector p to vector q, then SPD(y)= But to avoid creating an entirely new VectorPTreeSet(Y-p) for the space (with origin shifted to p), we think it useful to alter the expression to : SPDpq(y) 2 po yo po yo po yo yo po = yoy -2yop+ pop- where we might: 1st compute the constant vector 2nd the ScalarPTreeSet 3rd the constant 4th the SPTreeSet - - - pop 5th the SPTreeSet 6th the constant |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |M-p| |M-p| |q-p| |q-p| |q-p| 2 q-p |q-p| (y-p)o = yoy -2yop+ pop- yoy,yop 7th the SPTreeSets 8th the SPTreeSet q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p - - - po po po = yo yo yo 1. SLp(y) = (y-p)o(y-p), p a fixed vector. Square Length functional primarily for outlier identification and densities. 2. Dotd(y) = yod, (d is a unit vector) the Dot-product functional. Using d=q-p/|q-p| and y-p Dotp,q(y) = (y-p)o(q-p)/|q-p| Is it better to leave all the additions and subtractions for one mega-step at the end? Other efficiency thoughts? We note that Dot(y)=yod shares many construction steps with SPD. 4. CAd(y) = yod/|y|, (d unit vector) the Cone Angle functional. Using d=q-p/|q-p| and y=x-p CAp,q(y) = (y-p)od/|y-p| SCAp,q(y) = (y-p)od2/|y-p|2 = (y-p)od2/(y-p)o(y-p), Squared Cone Angle functional

  7. SPD p 64 29 50 17 q 61 29 45 14 e14 V Ct 2 10 3 12 4 12 5 12 6 8 7 11 8 9 9 5 10 9 11 4 12 4 13 2 14 1 17 2 18 3 19 10 20 5 21 6 22 5 23 6 24 6 25 3 27 2 29 2 30 1 SPD on CLUS1 p 50 20 35 10 e11 q 58 31 37 12 =MN V Ct 2 3 3 4 4 5 5 7 6 2 7 2 8 6 9 6 10 3 11 4 12 2 13 4 14 4 15 3 16 2 17 1 18 5 19 1 20 2 22 2 23 1 24 1 25 1 26 1 29 1 SPD p 64 29 50 17 q 61 29 45 14 e14 V Ct 1 6 2 4 3 8 4 4 5 10 6 2 7 2 8 2 9 7 10 2 11 2 12 2 13 1 15 2 17 1 18 4 19 2 20 4 22 1 24 1 25 1 26 1 29 1 31 2 32 2 33 3 37 2 i15 i36 92 1 i32 SPD p 54 22 39 10 q 70 34 51 18 V Ct 2 8 3 10 4 10 5 10 6 5 7 10 8 6 9 8 10 6 11 1 mask: V<8.5 CTs 50 0 SMs CTe 50 50 SMe CTi 50 24 SMi CLUS1 mask: V<12.5 5 SMe 24 SMi CLUS1.1 thin gap mask: 8.5<V<15.5 CTs 50 1 SMs CTe 50 0 SMe CTi 50 24 SMi CLUS2 masking V>6: Total_e 37 2 Masked_e Total_i 37 29 Masked_i However I cheated a bit. I used p=MinVect(e) and q=MaxVect(e) which makes it somewhat supervised. START OVER WITH THE FULL 150-----------------> mask: V>12.5 45 SMe 0 SMi CLUS1.2 mask: V>15.5: CTs 50 49 SMs CTe 50 0 SMe CTi 50 2 SMi This tube contains 49 setosa + 2 virginica CLUS3 CLUS1.2 is pure Versicolor (45 of the 50). CLUS3 is almost pure Setosa (49 of the 50, plus 2 virginica) CLUS2 is almost purely [1/2 of] viriginica (24 of 50, plus 1 setosa). CLUS1.1 is the other 24 virginicas, plus the other 5 versicolors. So this method clusters IRIS quite well (albeit into 4 clusters, not three). Note that caps were not put on these tubes. Also, this was NOT unsupervised clustering! I took advantage of my knowledge of the classes to carefully chose the unit vector points, p and q E.g., p = MinVector(Versicolor) and q = MaxVector(Versicolor. True, if one sequenced thru a fine enough d-grid of all unit vectors [directions], one would happen upon a unit vector closely aligned to d=q-p/|q-p| but that would be a whole lot more work that I did here (would take much longer). In worst case though, for totally unsupervised clustering. there would be no other way than to sequence through a grid of unit vectors. However, a good heuristic might be to try all unit vectors "corner-to-corner" and "middle-of-face-TO-middle-of-opposite-face" first, etc. Another thought would be to try to introduce some sort of hill climbing to "work our way" toward a good combination of a radial gap plus two good linear cap gaps for that radial gap.

  8. SPD on CLUS1 p 60 34 60 25 C1US1axxx q 60 28 46 15C1US1aaaa V Ct . 1 3 2 5 3 9 4 13 5 18 6 12 7 4 8 1 9 2 11 3 no thinnings SPD on CLUS1 p 69 28 60 25 C1US1xaxx q 60 28 46 15C1US1aaaa V Ct . 1 4 2 13 3 7 4 19 5 9 6 7 7 9 8 2 SPD on CLUS1 p 69 34 46 25 C1US1xxax q 60 28 46 15C1US1aaaa V Ct . 1 1 2 4 3 3 4 9 5 9 6 14 7 9 8 4 9 6 10 3 11 3 12 1 14 2 15 1 16 1 no thinnings SPD on CLUS1 p 69 34 60 15 C1US1xxxa q 60 28 46 15C1US1aaaa V Ct . 1 1 2 3 3 10 4 15 5 16 6 12 7 7 8 3 9 1 10 1 11 1 no thinnings SPD p 58 44 69 25 axxx q 58 30 37 11 aaaa V Ct . 2 1 3 5 4 6 5 6 6 8 7 6 8 8 9 15 10 7 11 8 12 13 13 8 14 14 15 9 16 13 17 6 18 4 19 4 20 3 21 4 23 1 25 1 mask: V<3.5 14 SM versi 10 SM virgi CL1.1? mask: V>3.5 0 SM setosa 32 SM versi 14 SM virgi CLUS1.2? mask: V<11.5 0 SM setosa 46 SM versicolor 24 SM virginica CLUS1 SPD on CLUS2 p 56 44 69 25 C1US2axxx q 56 32 29 9C1US2aaaa V Ct . 6 2 7 2 8 6 9 13 10 7 11 7 12 4 13 5 14 11 15 9 16 2 18 4 21 2 22 1 23 3 25 1 26 1 mask: V>11.5 50 SM setosa 4 SM versicolor 26 SM virginica CLUS2 SPD on CLUS1 p 60 34 46 25 C1US1axax q 60 28 46 15C1US1aaaa V Ct . 1 1 2 3 3 4 4 2 5 12 6 13 7 9 8 7 9 2 10 7 11 4 13 2 14 1 17 2 18 1 SPD on CLUS1 p60 28 60 25C11aaxx q60 28 46 15C11aaaa V Ct . 1 1 2 7 3 10 4 13 5 13 6 13 7 6 8 2 9 2 11 1 12 2 no thinnings SPD on CLUS1 p60 34 60 15 C1US1axxa q60 28 46 15C1US1aaaa V Ct . 1 1 2 2 3 6 4 9 5 12 6 17 7 8 8 6 9 5 10 1 11 1 12 2 no thinnings mask: V<13.5 44 SM setosa 0 SM versicolor 02 SM virginica CLUS2.1 mask: V<9.5 37 SM vers 16 SM virg CL1.1? mask: 100>V>13.5 6 SM setosa 4 SM versicolor 24 SM virginica CLUS2.2 mask: V>9.5 9 SM vers 8 SM virg CL1.2? SPD on CLUS1 69 28 46 25 C11xaax 60 28 46 15 C11aaaa V Ct 1 2 2 3 3 4 4 8 5 8 6 14 7 8 8 4 9 5 10 6 11 1 12 3 14 1 15 2 17 1 no thins C11axaa C11aaaa V Ct 1 2 2 2 3 2 4 10 5 3 6 13 7 8 8 7 9 4 10 3 11 6 12 2 13 2 14 2 17 2 18 1 19 1 SPD on C1 C11aaax C11aaaa V Ct 1 3 2 1 3 3 4 4 5 12 6 15 7 4 8 5 9 4 10 7 11 4 12 2 13 1 14 1 15 1 17 1 18 1 19 1 C11xaaa C11aaaa V Ct 1 2 2 4 3 5 4 9 5 10 6 9 7 5 8 6 9 2 10 6 11 3 12 1 13 2 14 2 15 2 17 2 SPD on CLUS1 69 28 46 25 C11xxaa 60 28 46 15C11aaaa V Ct 1 1 2 4 3 6 4 9 5 10 6 7 7 9 8 5 9 3 10 4 11 2 12 4 13 1 14 3 17 2 C11aaxa C11aaaa V Ct 1 2 2 3 3 6 4 12 5 11 6 9 7 11 8 5 9 5 10 1 11 3 13 2 SPD on CLUS1 69 28 60 15 C11xaxa 60 28 46 15C11aaaa V Ct 1 2 2 3 3 12 4 12 5 10 6 15 7 7 8 4 9 1 10 2 11 1 12 1 no thins mask: V<5.5 16 ver 3 virCL1.1? mask: V<5.5 26 ver 4 vir CL1.1? mask: V>5.5 30 ver 21 virCL1.1? mask: V>5.5 20 ver 20 vir CL1.1?

  9. i18 77 38 67 22 p max 79 38 69 25 V Ct 0 2 1 2 2 2 3 5 4 3 5 3 6 4 7 4 8 7 9 2 10 3 11 1 12 4 13 5 14 4 15 7 16 2 17 5 18 3 19 1 20 1 21 4 23 2 24 2 25 4 26 1 27 2 28 1 29 2 30 1 32 1 {e4, e40} form a doubleton outlier set i7 and e10 are singleton outliers 45 remaining setosa. This is SubCluster 2 (may have additional outliers or sub-subclusters but we will not analyse further (would be done in practice tho 95 remaining versicolor and virginica=SubClus1. Continue outlier id rounds on SC1 (maxSL, maxSW, max PW) then do "capped tube" (further subclusters.) i32 79 38 64 20 p max 79 44 69 25 V Ct 0 2 2 6 3 3 4 4 5 4 6 2 7 6 8 9 9 2 10 2 11 2 12 5 13 7 14 2 15 6 16 2 17 5 19 3 20 2 22 3 23 2 24 3 25 2 26 1 27 1 28 1 29 3 30 1 31 2 32 1 e32 42 1 e11 43 2 e8,44 44 1 e49 51 1 i39 60 1 61 1 62 1 63 1 64 1 65 1 66 1 67 3 68 4 69 4 70 3 71 3 72 4 73 2 74 5 75 1 76 2 77 1 78 3 79 1 80 1 s3 83 1 s9 84 2 s39,43 85 1 s42 87 1 s23 91 1 s14 2 actual gap-ouliers, checking distances reveals 4 e-outlier (versicolor), 5 s-outliers (setosa). i19 77 26 69 23 p max 79 44 69 25 V Ct 0 2 1 1 2 3 3 3 4 4 5 2 6 6 7 3 8 5 9 4 10 4 11 2 12 3 13 4 14 6 15 4 16 1 17 7 18 2 19 3 20 2 22 2 23 1 24 2 25 4 26 4 27 1 28 2 29 2 30 1 32 2 33 1 34 1 35 1 No new outliers reviealed x=s15 58 40 12 2 (58=avg(y1) ) V Ct 0 3 s15, s17, s34 1 12 s 6,11,16,19,20,22,28,32,3337,47,49 2 12 s 1,10 13,18,21,27,29,40,41,44,45,50 3 7 s 2,12,23,24,35,36,38 4 10 s 2,3,7,13,25,26,30,31,46,48 5 2 s4, s43 6 2 s9,s39 7 1 s14 8 1 i39 9 1 s32 ^^all 50 setosa + i39 14 1 e49 16 2 17 2 19 1 20 2 21 5 22 4 23 3 24 4 25 1 27 8 28 2 29 2 30 4 31 1 32 4 34 2 35 2 36 2 37 3 38 2 39 2 40 4 41 1 43 2 44 4 45 2 46 1 47 2 48 1 50 4 52 2 53 2 54 2 56 2 57 1 58 1 i1 62 1 i31 vv 9 virginica 63 1 i10 64 1 i8 66 1 i36 69 1 i32 74 1 i16 76 1 i18 77 1 i23 85 1 i19 But here I mistakenly used the mean rather than the max corner. So I will redo - but note the high level of cluster and outlier revelation????? 1. (y-p)o(y-p) remove edge outliers ( thr>2*50) 2. lthin gaps in SPD: d, from an edge point to MN. 3 For each thin PL, do len gap anal of pts in " tube". e13 i7 e40 e4 e10 F e13 0 14 7 6 10 28 i7 14 0 9 9 8 29 e40 7 9 0 2 4 29 e4 6 9 2 0 5 30 e10 10 8 4 5 0 32 e32 e11 e8 e44 e49 e32 0 7 7 6 8 e11 7 0 5 4 7 e8 7 5 0 1 4 e44 6 4 1 0 4 e49 8 7 4 4 0 SPD(y) =(y-p)o(y-p)-(y-p)od2 d: mn-mx V Ct Next slide i1 63 33 60 25 p max 79 38 69 25 V Ct 0 2 1 10 2 11 3 6 4 15 5 4 6 8 7 9 8 4 9 5 10 2 11 7 13 4 14 2 15 2 16 1 17 1 18 1 19 1 e30, e15 outliers e20,e31,e32 form SC12 Declared tripleton outlier set? (But they are not singleton outliers.) s3 s9 s39 s43 s42 s23 s3 0 4 4 3 9 5 s9 4 0 1 3 6 8 s39 4 1 0 2 7 7 s43 3 3 2 0 9 5 s42 9 6 7 9 0 13 s23 5 8 7 5 13 0 e13 e20 e15 e31 e32 e30 F e13 0 5 9 6 6 7 15 e20 5 0 5 2 3 4 15 e15 9 5 0 6 6 4 16 e31 6 2 6 0 1 4 17 e32 6 3 6 1 0 3 18 e30 7 4 4 4 3 0 19

  10. Cone Clustering:(finding cone-shaped clusters) x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 52 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 14 1 i25 16 1 i40 18 2 i16 i42 19 2 i17 i38 20 2 i11 i48 22 2 23 1 24 4 i34 i50 25 3 i24 i28 26 3 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 49 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 23 6 i21 24 5 25 1 27 1 28 1 29 2 30 2 i7 41/43 e so picks e Cosine cone gap (over some  angle) Gap in dot product projections onto the cornerpoints line. Corner points x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 26 1 e21 e34 27 2 29 2 37 1 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 10 12 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 26 1 e21 e34 27 2 28 1 29 2 31 1 e35 37 1 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 11 2 i28 12 4 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 34 1 i39 43/50 e so picks out e

  11. FxM(x,y)=yo(x-M)/|x-M|-min on XX≡{(x,y)|x,yX}, where X(x,y) is a Spaeth image tableCluster by splitting at all F_gaps > 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 APPENDIX X xy \y=1 2 3 4 5 6 7 8 9 a b 1 1 x=1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 The 15 Value_Arrays (one for each x) z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 x y FxM z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5  M (=MeanVector) The 15 Count_Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 gap: 10-6 gap: 5-2 pTree masks of the 3 z1_clusters (obtained by ORing) z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 The FAUST algorithm: 1. project onto each Mx line using the dot product with the unit vector from M to x. (only x=z1 is shown) 2. Generate each Value Array, F[x0]|(y), xX (also generate the Count_Arrays and the mask pTrees). 3. Analyze all gaps and create sub-cluster pTree Masks.

  12. yo(z7-M)/|z7-M| ValueArrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 yo(z7-M)/|z7-M| CountArrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 gap: 6-9 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 In Step_3 of the algorithm we can: Analyze one of the gap arrays (e.g., As done for z1. Subclusters is shown above) then start over on each subcluster. Or we can analyze all gap arrays concurrently (in parallel using the same F - saving the [substantial?] re-compute costs?) and then intersect the subcluster partitions we get from each x_ValueArray gap analysis, forthe final subclustering. Here we use the second alternative, judiciously choosing only the x's that are likely to be productive (choosing z7 next). Many are likely to produce redundant partitions - e.g., z1, z2, z3, z4, z6 - as their projection lines will be nearly coincident. How should we choose the sequence of "productive" strides? One way would be to always choose the remaining stride with the shortest ValueArray, so that the chances of decent sized gaps is maximized. Other ways of choosing?

  13. yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 gap: 3-7 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 We choose zd=z13 next (Should have been first? Since it's ValueArray is shortest?) Note, z8, z9, za projection lines will be nearly coincident with that of z7.

  14. yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 AND each red with each blue with each green, to get the subcluster masks (12 ANDs producing 5 sub-clusters.

  15. F1(x,y) = L1Distance(x,y) = (|x1-y1|+|x2-y2|) on XX≡{(x,y)|x,yX}, Cluster by splitting at all F1_gaps L1(x,y) ValueArray z1 0 2 4 5 10 13 14 15 16 17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17 18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0 2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9 10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0 2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12 13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5 8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15 17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13 0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9 10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10 11 13 15 L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1 1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1 1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2 1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1 1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2 1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4 1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2 1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2 1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1 1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1 1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2 3 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 gap: 10-5 (redundant subclustering) There is a z1-gap, but it produces a subclustering that was already discovered by a previous round. Which z values will give new subclusterings?

  16. L1(x,y) ValueArray z1 0 2 4 5 10 13 14 15 16 17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17 18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0 2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9 10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0 2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12 13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5 8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15 17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13 0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9 10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10 11 13 15 L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1 1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1 1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2 1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1 1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2 1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4 1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2 1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2 1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1 1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1 1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2 3 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 This re-confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis. Re-confirms zf an anomaly. After having subclustered with linear gap analysis, which is best for determining larger subclusters, we run this round gap algorithm out only 2 steps to determine if there are any singleFvalue gaps>2 (the points in the singleFvalueGapped set are then declared anomalies). So we run it out two steps only, then find those points for which the one initial gap determined by those first two values is sufficient to declare outlierness. Doing that here, we reconfirm the outlierness of z6 and zf, while finding new outliers, z5 and za.

  17. Using F=yo(x-M)/|x-M|-MIN on IRIS, one stride at a time (s1=setosa1 first) For virginica1 Val Ct 0 1 1 1 2 2 3 5 4 6 5 11 6 12 7 4 8 2 9 5 10 1 17 1 22 1 24 2 25 1 27 1 28 1 29 2 30 1 31 3 32 4 33 1 34 4 35 2 36 2 37 4 38 4 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 F(i39)=17 F<17 (50 Setosa) vers1 Val Ct 0 1 2 4 3 1 4 1 5 3 6 3 7 8 8 3 9 7 10 6 11 4 12 4 13 3 15 2 19 2 20 2 21 1 26 2 27 3 28 4 30 2 31 5 32 4 33 3 34 1 36 3 37 5 38 4 39 5 40 7 41 4 42 2 43 2 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 F<19 (50 setosa) 19<F<22 {vers8,12,39,44,49} 22<F yo(s1-M)/|s1-M|-69) ValCt 0 1 3 1 4 2 7 1 8 1 9 2 10 1 12 4 14 5 15 2 16 4 17 1 18 4 19 5 20 1 21 2 22 2 23 8 24 4 25 3 26 2 27 5 28 3 29 4 30 4 31 3 32 2 33 2 34 4 35 5 36 2 37 2 38 1 39 1 40 1 41 1 43 1 44 1 45 1 52 1 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 F(i39)=52 virginica39 is an outlier. 2 clusters, F<52 (ct=99) and F>52 (50 Setosa) virgini39 Val Ct 0 1 1 2 2 1 4 2 6 1 7 1 8 7 9 2 10 2 11 7 12 2 13 3 14 7 15 4 16 10 17 4 18 6 19 9 20 3 21 6 22 3 23 6 24 3 25 1 27 3 28 2 32 1 39 1 40 1 41 1 42 8 43 13 44 17 45 4 46 5 47 1 F=32 vers49 outlier. 32<F (50 Setosa, vir39) AVG(ver8,12,39,44,49) Val Ct 0 1 1 1 7 5 10 3 12 2 13 2 14 3 15 5 16 2 17 5 18 8 19 4 20 3 21 4 22 3 23 8 24 4 25 4 26 3 27 7 28 7 29 4 30 5 31 4 32 5 33 8 34 2 35 6 36 5 37 3 38 2 39 8 40 6 41 3 43 1 44 2 45 1 47 1 F=0 vir32 outlier F=1 vir18 outlier F=7 vir6,10,19,23,36 subcluster?

  18. On Remaining, mx mn mx mx ValCt 0 3 1 4 2 11 3 14 4 14 5 9 6 10 7 2 8 6 9 2 11 2 On Clus(F<52) ver1 F(virg7)=0 outlier F(virg32)=25 outlier ValCt 0 1 4 1 5 5 6 3 7 5 8 3 9 8 10 11 11 14 12 8 13 8 14 5 15 3 16 7 17 5 18 6 19 2 20 1 21 1 22 1 25 1 F=yo(x-M)/|x-M|-MIN on IRIS, subclustering as we go. For s1 (i.e., yo(s1-M)/|s1-M|-69) ValCt 0 1 3 1 4 2 7 1 8 1 9 2 10 1 12 4 14 5 15 2 16 4 17 1 18 4 19 5 20 1 21 2 22 2 23 8 24 4 25 3 26 2 27 5 28 3 29 4 30 4 31 3 32 2 33 2 34 4 35 5 36 2 37 2 38 1 39 1 40 1 41 1 43 1 44 1 45 1 52 1 outlier 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 F(i39)=52 i39=virgi39 outlier. Clusters, F<52 (ct=99) and F>52 (50 Setosa) On Remaining, max's ValCt 0 2 e8 outlier 1 2 e11 outlier 7 2 8 1 9 4 10 1 11 2 12 2 13 4 14 3 15 1 16 4 17 2 18 2 19 3 20 4 21 6 22 5 23 5 24 4 25 2 26 2 27 1 28 2 29 4 30 5 31 1 32 3 33 2 34 2 35 3 36 2 37 1 38 1 39 1 i8 41 2 i10 i36 44 2 i6 i23 46 1 i19 48 1 i18 i6 i8 i10 i19 i23 i35 i6 0 5 10 5 3 20 i8 5 0 10 9 6 15 i10 10 10 0 14 12 19 i19 5 9 14 0 4 22 i23 3 6 12 4 0 20 i35 20 15 19 22 20 0 i6 i10 i18 i19 i23 i35 all declared outliers e4 e38 e19 i20 F e4 0 9 9 11 9 e38 9 0 3 7 9 e19 9 3 0 5 11 outlier i20 11 7 5 0 11 outlier On Remaining, max's ValCt 0 2 e44 outlier 6 1 7 2 8 1 9 3 10 1 11 3 12 5 13 2 14 2 15 3 17 3 18 3 19 5 20 1 21 9 22 5 23 4 24 2 26 4 27 2 28 2 29 4 30 2 31 3 32 3 33 2 34 3 35 2 36 1 37 1 38 1 39 1 42 1 e36 outlier? On Remaining, mx mx mx mn ValCt 0 1 1 2 2 3 3 1 5 5 6 4 7 5 8 2 9 3 10 5 11 4 12 7 13 5 14 2 15 4 16 4 17 7 18 4 19 4 20 2 21 2 22 1 24 1 25 1 27 2 29 2 On Remaining, mn mn mx mx ValCt 0 1 1 3 2 3 3 7 4 7 5 7 6 5 7 5 8 3 9 8 10 4 11 4 12 11 13 4 14 8 15 4 16 1 18 1 On Remaining, mn mx mx mx ValCt 0 1 2 1 3 4 4 3 5 5 6 4 7 5 8 7 9 8 10 3 11 5 12 2 13 4 14 5 15 7 16 5 17 4 18 1 20 1 On Remaining w e35 ValCt 0 1 i26 outlier 3 2 On remaining vir1 ValCt 0 1 0 1 0 1 1 2 2 1 4 1 5 1 6 2 7 2 8 2 9 4 10 1 11 4 12 3 13 4 14 2 15 6 16 4 17 6 19 4 20 5 21 5 22 2 23 1 24 2 25 5 26 4 27 4 28 1 29 2 30 6 31 2 32 1 33 1 34 1 35 2 36 1 38 1 39 1 e35 e10 e35 0 7 e10 7 0 outlier i44 i3 i44 0 4 i3 4 0 ^^outlier i3 i30 i31 i26 i8 i36 i3 0 5 5 4 5 7 i30 5 0 5 3 6 9 outlier i31 5 5 0 5 3 5 outlier i26 4 3 5 0 4 7 outlier i8 5 6 3 4 0 7 outlier i36 7 9 5 7 7 0 outlier Rem mn mx mn mx ValCt 0 1 1 1 2 1 3 1 4 1 5 1 6 1 8 1 9 3 10 5 11 5 12 3 13 7 14 6 15 4 16 6 17 7 18 5 19 4 20 2 21 3 22 7 23 4 24 3 25 1 26 1 27 2 33 1 e49 outlier On Remaining, mn mx mx mx ValCt 0 1 1 1 2 1 3 5 4 6 5 5 6 4 7 9 8 4 9 4 10 4 11 3 12 5 13 6 14 6 15 7 16 5 17 4 18 4 20 1 22 1 Could look at distances for 0,1 and 20,22? e13 e30 e32 e13 0 7 3 outlier e30 7 0 6 outlier e32 3 6 0 i44 i45 i49 i5 i37 i1 i44 0 3 8 4 6 6 i45 3 0 6 5 4 5 i49 8 6 0 6 2 6 i5 4 5 6 0 5 5 i37 6 4 2 5 0 4 not outlier i1 6 5 6 5 4 0 outlier

  19. outliers gap>L1=32.1 s6 s14 s15 s16 s17 s19 s21 s23 s24 s32 s33 s34 s37 s42 s45 e1 e2 e3 e5 e6 e7 e9 e10 e11 e12 e13 e15 e18 e19 e21 e22 e23 e27 e28 e29 e30 e34 e36 e37 e38 e41 e49 i1 i3 i4 i5 i6 i7 i8 i9 i10 i12 i14 i15 i16 i18 i19 i20 i22 i23 i25 i26 i28 i30 i31 i32 i34 i35 i36 i37 i39 i41 i42 i45 i46 i47 i49 i50 outliers gap>L1=42.8 s15 s16 s19 s23 s33 s34 s37 s42 s45 e1 e2 e7 e10 e11 e12 e13 e15 e19 e21 e22 e23 e27 e28 e30 e34 e36 e38 e41 e49 i1 i3 i5 i6 i7 i8 i9 i10 i12 i14 i15 i16 i18 i19 i20 i22 i23 i26 i30 i31 i32 i34 i35 i36 i39 outliers gp>L1=53.5 s15 s16 s23 s33 s34 s42 e10 e13 e15 e27 e28 e30 e36 e49 i1 i3 i7 i9 i10 i12 i15 i18 i19 i20 i26 i30 i32 i35 i36 i39 F=L1(x,y) on IRIS, masking to subclusters (go right down the table). Two rounds only If we use L1gap=6, remove those outliers, then use linear gap analysis for larger subcluster revalation, let's see if we can separate Versicolor (e) from virginica (i). outliers gap>L1=64.3 s15 s16 s23 s42 e10 e13 e49 i3 i7 i9 i10 i18 i19 i20 i32 i35 i36 i39 outliers gap>L1=74.95 L1gap s42 9 e13 8 i7 10 i9 12 i10 12 i35 9 i36 9 i39 26

  20. Val=0;p=K;c=0;P=Pure1; For i=n to 0 {c=Ct(P&Pi); If (c>=p){Val=Val+2i;P=P&Pi}; else{p=p-c;P=P&P'i }; return Val, P; X3 X4 IDY z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf 1 3 2 3 6 9 15 14 15 13 10 11 9 11 7 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 : 1 3 2 3 6 9 15 14 15 13 10 11 9 11 7 : z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf : 1 3 2 3 6 9 15 14 15 13 10 11 9 11 7 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf 1 3 2 3 6 9 15 14 15 13 10 11 9 11 7 z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 IDX z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 : ze ze ze ze ze ze ze ze ze ze ze ze ze ze ze zf zf zf zf zf zf zf zf zf zf zf zf zf zf zf X1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 : 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 X2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 : 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Need Rank(n-1) applied toeach stride instead of the entire pTree. The result from stride=j gives the jth entry of SpS(X,d(x,X-x)) Parallelize over a large cluster? Ct(P&Pi): revise the Count proc to kick out count for each stride (involves loop down pTree by register-lengths? What does P represent after each step?? How does alg go on 2pDoop (w 2 level pTrees) where each stride is separate Note: using d, not d2 (fewer pTrees). Can we estimate d? (using truncated McClarin series) P3 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 : 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 P2 0 0 0 0 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 1 0 P0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 : 0 1 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 0 d(xy) 0 2 1 3 4 8 14 13 14 12 12 13 13 14 9 2 0 1 2 2 6 12 11 12 10 11 12 12 13 8 : 14 13 13 11 11 8 11 9 9 7 2 1 2 0 5 9 8 8 6 6 5 11 9 9 7 3 4 4 5 0 P'3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 : 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 P'2 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 1 : 0 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 P'0 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 1 : 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 P1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 : 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 P'1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 1 1 : 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 23 * + 22 * + 21 * + 20 * 0 0 1 = 1 0 n=3: c=Ct(P&P3)=10< 14, p=14–10=4; P=P&P' (elim 10 val8) n=2: c=Ct(P&P2)= 1 < 4, p=4-1=3; P=P&P' (elim 1 val4) n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>=1 P=P&P0 (elim 1 val<1) 23 * + 22 * + 21 * + 20 * 0 0 1 = 1 0 n=3: c=Ct(P&P3)=9< 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 0 < 5, p=5-0=5; P=P&P' (elim 0 val4) n=1: c=Ct(P&P1)=4 < 5, p=5-4=1; P=P&P' (elim 4 val2) n=0: c=Ct(P&P0 )=1>=1 P=P&P0 (elim 1 val<1 23 * + 22 * + 21 * + 20 * 0 0 1 = 1 0 n=3: c=Ct(P&P3)= 9 < 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 2 < 5, p=5-2=3; P=P&P' (elim 2 val4)2 n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>=1 P=P&P0 (elim 1 val<1) 23 * + 22 * + 21 * + 20 * 0 0 1 = 3 1 n=3: c=Ct(P&P3)= 6 < 14, p=14–6=8; P=P&P' (elim 6 val8) n=2: c=Ct(P&P2)= 7 < 8, p=8-7=1; P=P&P' (elim 7 val4)2 n=1: c=Ct(P&P1)=11, p=1-1=0; P=P&P (elim 0 val2) n=0: c=Ct(P&P0 )=1 0 P=P&P0 (elim 0)

  21. Level-1 key map Red=pure stride (so no Level-0) 13 12 11 10 23 22 21 20 33 32 31 30 43 42 41 40 e 2 34 0 0 0 0 f5 6 0 0 0 g7 0 0 h 0 i 8 9 a 0 0 0 0 j b c 0 0 0 k d 0 0 m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 12 11 10 23 22 21 20 33 32 31 30 43 42 41 40 33 32 31 30 43 42 41 40 13 12 11 10 23 22 21 20 33 32 31 30 43 42 41 40 e 2 3 4 f 5 6 g 7 h i 8 9 a j b c k d m 33 32 31 30 43 42 41 40 2 3 4 5 6 7 8 9 a b c d e f g h i j k m 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 Level-0: key map (6-e) = f else pur0 (6-e) = g else pur0 (6-e) = h else pur0 (6-e) = e else pur0 In this 2pDoop KEY-VALUE DB, we list keys. Should we bitmap? Each bitmap is a pTree in the KVDB. Each of these is existing, e.g., e here 5,7-a,f=f else pur0 5,7-a,f=g else pur0 5,7-a,f=h else pur0 234789bcefh else pr0 234789bcefg els pr0 124-79c-f h else pr0 (b-f) = j else pur0 (b-f) = k else pur0 (b-f) = i else pur0 (b-f) = m else pur0 (a) = j else pur0 (a) = k else pur0 (a) = m else pur0 p23p43 p23p42 p13p33 + p13p32 + -27( =SpS(XX, (3-6,8,9) k, els pr0 (3-6,8,9) m els pr0 + p13p31 + p23p41 p13+p23+p33+p43 +p13p12+ p23p22+ p33p32 + +p43p42 ) -26( 26( 124679bd m els pr0 p23p21 + p33p31 + p43p41 ) p13p11+ -25( p13p30 +p23p40 25( +p12p31 +p22p41 +p12p32 +p22p42 p12+p22+p32+p42 +p23p20 +p13p10+ +p33p30 +p43p40 24( -24(p12p30 +p22p40 +p32p31 +p42p41 ) +p12p11+ +p22p21 p22p20 + p32p30 + p42p40 ) p12p10+ -23(p11p31 +p11p30 +p21p41 +p21p40 23( p10+p20+p30+p40) p11+p21+p31+p41 +p11p10 + +p21p20 + +p31p30 -22(p10p30 +p20p40 +p41p40 ) 22(

More Related