260 likes | 277 Views
Email request for information including name, email, thesis advisor, thesis topic, current position, and reasons for requesting information for data mining thesis.
E N D
Please email the following information to william.perrizo@ndsu.edu: 1. Name and email address 2. Thesis advisor (if you have one currently) 3. Thesis topic or topic area (if you have one currently) 4. Are you currently a a. TA b. RA (whose grant funds you?) c. C.S. Department Paper Grader d. Working for an external (to NDSU) entity (which one? How many hours/week?) The reason for requesting this information is that: 1. I want to get to know who you are, what group you're in and your interests and situation. 2. Since my data mining technology has been patented by NDSURF and has been licensed to Treeminer Inc. of Maryland, I have to be somewhat careful with it. If you should use it in a thesis or paper, please let me know ahead of time. 3. So that problems do not arise, I ask that you trust me to decide degree of authorship etc. in all publications involving this work. Why can you trust me to be fair? I have some 250 refereed publications and don’t need any more. I have done this with hundreds of students, all of whom will tell you that I have always been completely fair. It is important that attributions are correct.
This is my research group meeting. Remember that these are not lectures for a class and are not required for anyone. First, what do I do? I Data Mine Big Data in Human Time (big data = zillions of rows! And, sometimes also, thousands of columns (which can complicate the data mining of a zillion rows). What is data mining? Roughly it's CLASSIFICATION (assigning a [class] label to an unclassified row based on a training set of already classified rows). What about clustering and ARM? They are important and related! Roughly, the purpose of clustering is to create/improve a training set. Roughly the purpose of ARM is to data mine complex data (relationships). CLASSIFICATION is case-based reasoning. To make a decision, we tend to search our memory for similar cases (near neighbor cases) and base our current decision on those cases - we do what worked before (for us or for others). We let those near neighbor cases vote. What does near mean? How many? How near? How do we they vote? These are all questions to be answered for the particular decision we are making. "The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information"[2] is one of the most highly cited papers in psychology. It was published in 1956 by the cognitive psychologistGeorge A. Miller of Princeton University's Department of Psychology in Psychological Review. It is often interpreted to argue that the number of objects an average human can hold in working memory is 7 ± 2. This is referred to as Miller's Law. We can think of classification as providing a better 7(decision support, not decision making) Some current Data Mining research projects: 1. MapReduce FAUST Current_Relevancy_Score =10Killer_Idea_Score=5 MapReduce and Hadoop are key-value approach to organizing complex BigData. In FAUST PREDICT/CLASSIFY we start with a Training TABLE and in FAUST CLUSTER/ANOMALIZER we start with a vector space. FAUST = Fast, Accurate, Unsupervised and Supervised machine Teaching 2. pTree Text Mining: Current_Relevancy_Score =10Killer_Idea_Score=9I I think FAUST is the way to do this. Also there is the very new idea of capturing the reading sequence, not just the term-frequency matrix (lossless capture) of a corpus. Preliminary work suggests that attribute selection via simple Standard Deviations really helps (select those columns (or more generally, the functionals) with the highest StDev because they are the ones with the highest potential for large gaps! 3. FAUST CLUSTER/ANOMALASER: Current_Relevancy_Score =10Killer_Idea_Score=10No No one has taken up the proof that this is a break through method. The applications are unlimited! 4. Secure pTreeBases: Current_Relevancy_Score =9Killer_Idea_Score=8 This seems straight forward and a certainty (to be a killer advance)! It would involve becoming the world expert on what data security really means and how it has been done by others and then comparing our approach to theirs. Truly a complete career is waiting for someone here! 5. FAUST PREDICTOR/CLASSIFIER: Current_Relevancy_Score =10Killer_Idea_Score=10 No one done a complete analysis of this is a break through method. The applications are unlimited here too! 6. pTree Algorithmic Tools: Current_Relevancy_Score =10Killer_Idea_Score=10 Mohammad Hossain has expanded the algorithmic tool set to include quadratic tools and even higher degree tools and now division is added to the arithmetic tools. 7. pTree Alternative Algorithm Impl: Current_Relevancy_Score =9Killer_Idea_Score=8 Implementing pTree algorithms in hardware/firmware (e.g., FPGAs) - orders of magnitude performance improvement? 8. pTree O/S Infrastructure: Current_Relevancy_Score =10Killer_Idea_Score=10 Matt Piehl finished his M.S. on this (Available in the library).
Dot Product Projection (DPP)Check DPPd(y) or DPPpq(y)≡ (y-p)o(p-q)/|p-q| gaps (parameterized over a grid of p values and d=(p-q)/|p-q|?) values. d Functional-Gap-based clustering methods (the unsupervised part of FAUST) This class of partitioning or clustering methods relies on choosing a functional (a mapping of each row to a real number) which is distance dominated (i.e., the difference between two functional values, F(x) and F(y) is always the distance between x and y; so if we find a gap in the F-values, we know that points on opposite sides of that gap are at least as far apart as the gap width.) Here are some of the functionals that we have already used productively: (in each, we can actual pair-wise distances at the extreme ends for outliers.) Coordinate Projection (ej)Check gaps in ej(y) ≡ yj Square Distance (SD)Check gaps in SDp(y) ≡ (y-p)o(y-p) (parameterized over a p grid). Square Dot Product Radius (SDPR)1. Check SDPRpq(y) ≡ SDp(y)- DPPpq(y)2 gaps DPP-KM1. Check DPPp,d(y) gaps (grids of p and d?). 1.1 Check distances at sparse extremes. 2. After several rounds of 1, apply k-means to the resulting clusters (when k seems to be determined). DPP-DA1. Check DPPp,d(y) gaps (grids of p and d?) against density of subcluster. 1.1 Check distances at sparse extremes against subcluster density. 2. Apply other methods once Dot ceases to be effective. DPP-SD) 1. Check DPPp,d(y) (over a p-grid and a d-grid) and SDp(y) (over a p-grid). 1.1 Check sparse ends distance with subcluster density. (DPPpd , SDp share construction steps!) SD-DPP-SDPR) (DPPpq , SDp and SDPRpq share construction steps! SDp(y)≡ (y-p)o(y-p) = yoy - 2 yop +pop DPPpq(y) ≡ (y-p)od=yod-pod= (1/|p-q|)yop - (1/|p-q|)yoq Calc yoy, yop, yoq concurrently? Then constant multiplies 2*yop, (1/|p-q|)*yop concurrently. Then add | subtract. Calculate DPPpq(y)2. Then subtract it from SDp(y)
Dot Product Projection (DPP)1.Check F(y)=(y-p)o(q-p)/|q-p| gaps or thin intervals. 1.1Check actual distances at sparse ends. Here we apply DPP to the IRIS data set: 150 iris samples (rows) and 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). We assume we don't know ahead of time that the first 50 are the Setosa Class, next 50 are Versicolor Class and the final 50 are Virginica Class. We cluster with DPP and then see how close it comes to separating into the 3 known classes (s=setosal, e=versicolor, i=virginica). CLUS3 outliers removed p=aaax q=aaan F Cnt 0 4 1 2 2 5 3 13 4 8 5 12 6 4 7 2 8 11 9 5 10 4 11 5 12 2 13 7 14 3 15 2 No Thining. Sparse Lo end: Check [0,8] distances 0 0 3 5 5 6 8 8 i30 i35 i20 e34 i34 e23 e19 e27 i30 0 12 17 14 12 14 18 11 i35 12 0 7 6 6 7 12 11 i20 17 7 0 5 7 4 5 10 e34 14 6 5 0 3 4 8 9 i34 12 6 7 3 0 4 9 6 e23 14 7 4 4 4 0 5 6 e19 18 12 5 8 9 5 0 9 e27 11 11 10 9 6 6 9 0 i30,i35,i20 outliers because F3 they are 4 from 5,6,7,8 {e34,i34} doubleton outlier set Sparse Lower end: Checking [0,4] distances 0 1 2 3 3 3 4 s14 s42 s45 s23 s16 s43 s3 s14 0 8 14 7 20 3 5 s42 8 0 17 13 24 9 9 s45 14 17 0 11 9 11 10 s23 7 13 11 0 15 5 5 s16 20 24 9 15 0 18 16 s43 3 9 11 5 18 0 3 s3 5 9 10 5 16 3 0 s42 is revealed as an outlier because F(s42)= 1 is 4 from 5,6,... and it's 4 from others in [0,4] gap>=4 p=nnnn q=xxxx F Count 0 1 1 1 2 1 3 3 4 1 5 6 6 4 7 5 8 7 9 3 10 8 11 5 12 1 13 2 14 1 15 1 19 1 20 1 21 3 26 2 28 1 29 4 30 2 31 2 32 2 33 4 34 3 36 5 37 2 38 2 39 2 40 5 41 6 42 5 43 7 44 2 45 1 46 3 47 2 48 1 49 5 50 4 51 1 52 3 53 2 54 2 55 3 56 2 57 1 58 1 59 1 61 2 64 2 66 2 68 1 CLUS3.1 p=anxa q=axna F Cnt 0 2 3 1 5 2 6 1 8 2 9 4 10 3 11 6 12 6 13 7 14 7 15 4 16 3 19 2 Thinning=[6,7 ] CLUS3.1 <6.5 44 ver 4 vir LUS3.2 >6.5 2 ver 39 vir No sparse ends Sparse Upper end: Check [16,19] distances 16 16 16 19 19 e7 e32 e33 e30 e15 e7 0 17 12 16 14 e32 17 0 5 3 6 e33 12 5 0 5 4 e30 16 3 5 0 4 e15 14 6 4 4 0 e15 outlier. So CLUS3.1 = 42 versicolor Gaps=[15,19] [21,26] Check dis in [12,28] to see if s16, i39,e49,e8,e11,e44 outliers 12 13 13 14 15 19 20 21 21 21 26 26 28 s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30 e31 s34 0 5 8 5 4 21 25 28 32 28 30 28 31 s6 5 0 4 3 6 18 21 23 27 24 26 23 27 s45 8 4 0 6 9 18 18 21 25 21 24 22 25 s19 5 3 6 0 6 17 21 24 27 24 25 23 27 s16 4 6 9 6 0 20 26 29 33 29 30 28 31 i39 21 18 18 17 20 0 17 21 24 21 22 19 23 e49 25 21 18 21 26 17 0 4 7 4 8 8 9 e8 28 23 21 24 29 21 4 0 5 1 7 8 8 e11 32 27 25 27 33 24 7 5 0 4 7 9 7 e44 28 24 21 24 29 21 4 1 4 0 6 8 7 e32 30 26 24 25 30 22 8 7 7 6 0 3 1 e30 28 23 22 23 28 19 8 8 9 8 3 0 4 e31 31 27 25 27 31 23 9 8 7 7 1 4 0 So s16,,i39,e49, e11 are outlier. {e8,e44} doubleton outlier. Separate at 17 and 23, giving CLUS1 F<17 ( CLUS1 =50 Setosawith s16,s42 declared as outliers). 17<F CLUS2 F<23 (e8,e11,e44,e49,i39 all are already declared outliers) 23<F CLUS3( 46 vers, 49 virg with i6,i10,i18,i19,i23,i32 declared as outliers) CLUS3.2 = 39 virg, 2 vers (unable to separate the 2 vers from the 39 virg) Sparse Upper end: Checking [57.68] distances 57 58 59 61 61 64 64 66 66 68 i26 i31 i8 i10 i36 i6 i23 i19 i32 i18 i26 0 5 4 8 7 8 10 13 10 11 i31 5 0 3 10 5 6 7 10 12 12 i8 4 3 0 10 7 5 6 9 11 11 i10 8 10 10 0 8 10 12 14 9 9 i36 7 5 7 8 0 5 7 9 9 10 i6 8 6 5 10 5 0 3 5 9 8 i23 10 7 6 12 7 3 0 4 11 10 i19 13 10 9 14 9 5 4 0 13 12 i32 10 12 11 9 9 9 11 13 0 4 i18 11 12 11 9 10 8 10 12 4 0 i10,i36,i19,i32,i18 singleton outlies because F 4 from 56 and 4 from each other. {i6,i23} is a doubleton outlier set.
CLUS1 p=nxnn q=xnxx 0 1 2 1 4 1 6 2 9 1 10 1 11 2 12 2 13 3 14 3 15 2 16 2 17 4 18 3 19 3 20 2 21 5 22 6 23 5 24 2 25 7 26 3 27 2 28 2 29 1 30 3 31 3 32 7 33 4 34 1 35 1 36 2 37 2 39 1 41 1 42 1 43 1 DPP (other corners) Check Dotp,d(y) gaps>=4 Check sparse ends. Sparse low end (check [0,9] 0 2 4 6 6 9 10 i23 i6 i36 i8 i31 i3 i26 i23 0 3 7 6 7 10 10 i6 3 0 5 5 6 9 8 i36 7 5 0 7 5 7 7 i8 6 5 7 0 3 5 4 i31 7 6 5 3 0 5 5 i3 10 9 7 5 5 0 4 i26 10 8 7 4 5 4 0 i3, i26, i36 >=4 singleton outliers {i23,i6}, {i8,i31} doubleton ols Sparse low end (checking [0,7] 0 1 2 3 3 4 4 4 4 4 4 4 5 6 6 6 6 6 6 6 7 7 i1 i18i19i10i37i5 i6 i23i32i44i45i49i25i8 i15i41i21i33i29i4 i3 i16 i1 0 17 18 10 4 5 15 17 18 6 5 6 6 13 11 6 7 7 8 9 9 7 i18 17 0 12 9 18 17 8 10 4 13 15 20 15 11 27 17 14 20 20 20 13 20 i19 18 12 0 14 21 17 5 4 13 15 17 23 17 9 26 17 16 19 19 20 12 21 i10 10 9 14 0 11 10 10 12 9 6 7 13 8 10 19 9 7 13 13 14 8 12 i37 4 18 21 11 0 5 17 19 19 6 4 2 5 14 9 5 6 6 7 8 10 4 i5 5 17 17 10 5 0 14 15 17 4 5 6 4 10 10 4 5 3 3 5 6 6 i6 15 8 5 10 17 14 0 3 9 11 14 19 13 5 24 14 12 16 16 17 9 18 i23 17 10 4 12 19 15 3 0 11 13 16 21 15 6 25 16 14 17 17 18 10 20 i32 18 4 13 9 19 17 9 11 0 14 16 20 15 11 27 17 14 20 20 20 12 20 i44 6 13 15 6 6 4 11 13 14 0 3 8 3 9 13 3 2 6 7 8 4 7 i45 5 15 17 7 4 5 14 16 16 3 0 6 4 12 12 2 3 7 7 9 7 5 i49 6 20 23 13 2 6 19 21 20 8 6 0 6 16 8 6 8 7 7 7 11 3 i25 6 15 17 8 5 4 13 15 15 3 4 6 0 10 12 4 3 6 6 6 5 5 i8 13 11 9 10 14 10 5 6 11 9 12 16 10 0 20 11 9 12 12 12 5 15 i15 11 27 26 19 9 10 24 25 27 13 12 8 12 20 0 11 13 8 8 9 16 8 i41 6 17 17 9 5 4 14 16 17 3 2 6 4 11 11 0 3 5 5 7 6 4 i21 7 14 16 7 6 5 12 14 14 2 3 8 3 9 13 3 0 7 7 8 4 6 i33 7 20 19 13 6 3 16 17 20 6 7 7 6 12 8 5 7 0 1 4 8 5 i29 8 20 19 13 7 3 16 17 20 7 7 7 6 12 8 5 7 1 0 3 8 5 i4 9 20 20 14 8 5 17 18 20 8 9 7 6 12 9 7 8 4 3 0 9 7 i3 9 13 12 8 10 6 9 10 12 4 7 11 5 5 16 6 4 8 8 9 0 10 i16 7 20 21 12 4 6 18 20 20 7 5 3 5 15 8 4 6 5 5 7 10 0 i26 11 11 13 8 12 9 8 10 10 6 9 13 7 4 18 9 7 11 10 10 4 12 i36 14 10 9 8 15 12 5 7 9 9 11 17 11 7 22 11 9 14 14 16 7 15 i38 9 19 20 13 7 5 17 18 19 8 8 6 5 12 10 7 7 5 4 2 9 5 i1, i18, i19, i10, i37, i32 >=4 outliers Dotgp>=4 p=xnnn q=nxxx 0 1 1 1 2 1 3 2 4 7 5 1 6 7 7 5 8 9 9 3 10 7 11 3 12 5 13 4 14 5 15 4 16 8 17 4 18 7 19 3 20 5 21 1 22 4 23 1 24 1 31 2 33 2 34 12 35 8 36 17 37 6 38 2 39 2 Sparse hi end (checking [34,43] 34 35 36 36 37 37 39 41 42 43 e20e31e10e32e15e30e11e44e8 e49 e20 0 2 5 3 5 4 9 9 9 10 e31 2 0 5 1 6 4 7 7 8 9 e10 5 5 0 6 5 8 9 8 8 10 e32 3 1 6 0 6 3 7 6 7 8 e15 5 6 5 6 0 4 11 9 10 9 e30 4 4 8 3 4 0 9 8 8 8 e11 9 7 9 7 11 9 0 4 5 7 e44 9 7 8 6 9 8 4 0 1 4 e8 9 8 8 7 10 8 5 1 0 4 e49 10 9 10 8 9 8 7 4 4 0 e30,e49,ei15,e11 >=4 singleton ols {e44,e8} doubleton ols gap:(24,31) CLUS1<27.5 (50 versi, 49 virg) CLUS2>27.5 (50 set, 1 virg) Sparse hi end (checking [38,39] 38 38 39 39 s42 s36 s37 s1 s42 0 10 16 21 s36 10 0 6 11 s37 16 6 0 6 s15 21 11 6 0 s37, s1 outliers Thinning (8,13) Split in middle=10.5 CLUS_1.1<10.5 (21 virg, 2 ver) CLUS_1.2>10.5 (12 virg, 42 ver) CLUS1 Dotgp>=4 p=nnnn q=xxxx 0 1 1 2 2 2 3 1 4 2 5 1 6 6 7 2 8 3 9 1 10 2 11 2 12 2 13 6 14 6 15 7 16 2 17 2 18 3 19 3 20 2 21 2 22 3 23 4 24 2 25 1 26 2 27 3 28 1 29 1 Clus1 p=nnxn q=xxnx 0 2 1 1 2 5 3 8 4 9 5 6 6 9 7 14 8 11 9 7 10 4 11 2 13 2 Thinning (7,9) Split in middle=7.5 CLUS_1.2.1 < 7.5 (10 virg, 4 ver) CLUS_1.2.2 > 7.5 ( 1 virg, 38 ver) i15 gap>=4 outlier at F=0 Sparse hi end (checking [10,13] 10 10 10 10 11 11 13 13 e34i2 i14i43e41i20i7 i35 e34 0 4 5 4 10 5 13 6 i2 4 0 3 0 10 7 11 8 i14 5 3 0 3 10 7 10 9 i43 4 0 3 0 10 7 11 8 e41 10 10 10 10 0 9 8 14 i20 5 7 7 7 9 0 13 7 i7 13 11 10 11 8 13 0 17 i35 6 8 9 8 14 7 17 0 i7, i35 >=4 singleton outliers CLUS1 Dotgp>=4 p=nnnx q=xxxn 0 1 4 1 5 3 6 5 7 4 8 3 9 6 10 7 11 3 12 4 13 8 14 4 15 4 16 3 17 8 18 5 19 3 20 1 21 1 22 3 23 1 CLUS1.2 Dotgp>=4 p=aaan q=aaax 0 1 4 4 5 3 6 3 7 4 8 1 9 5 10 7 11 3 12 5 13 3 14 6 15 1 16 4 17 1 18 1 19 2 hi end gap outlier i30 CLUS1.2.1 Dotgp>=4 p=anaa q=axaa 0 1 1 1 2 1 4 2 6 3 7 4 9 2 CLUS1.2.1 Dotgp>=4 p=aana q=aaxa 0 5 1 2 2 3 3 2 4 1 6 1 C.2.1 0 0 0 0 1 2 3 3 4 4 5 5 6 7 i24e7 i34i47i27i28e34e36e21i50i2 i43i14i22 i24 0 7 4 2 2 4 4 9 6 5 5 5 7 7 e7 7 0 6 9 6 5 8 4 5 7 9 9 11 10 i34 4 6 0 5 4 5 3 9 7 5 6 6 8 9 i47 2 9 5 0 4 6 5 11 8 7 5 5 6 8 i27 2 6 4 4 0 2 4 7 5 5 5 5 6 6 i28 4 5 5 6 2 0 4 6 3 3 5 5 7 6 e34 4 8 3 5 4 4 0 9 6 4 4 4 5 6 e36 9 4 9 11 7 6 9 0 4 8 10 10 11 9 e21 6 5 7 8 5 3 6 4 0 4 6 6 8 5 i50 5 7 5 7 5 3 4 8 4 0 3 3 6 5 i2 5 9 6 5 5 5 4 10 6 3 0 0 3 3 i43 5 9 6 5 5 5 4 10 6 3 0 0 3 3 i14 7 11 8 6 6 7 5 11 8 6 3 3 0 3 i22 7 10 9 8 6 6 6 9 5 5 3 3 3 0 CLUS1.2.1 p=naaa q=xaaa 0 4 1 1 2 1 3 2 4 2 5 2 6 1 7 1
HILL CLIMBING GAP WIDTH Check Dotp,d(y) for thinnings. Use AVG of each side of the thinning for p,q. redo. Dot F p=aaan q=aaax 0 3 1 3 2 8 3 3 4 6 5 6 6 5 7 12 8 2 9 4 10 12 11 8 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 p=avg<12 q=avg>12 0 2 2 1 3 2 5 1 6 1 8 1 9 1 10 3 11 2 12 1 13 4 14 1 15 3 16 5 17 2 18 2 19 3 21 4 22 1 23 6 24 5 25 5 26 4 27 4 28 2 29 3 30 3 31 3 33 4 34 4 35 2 36 3 37 3 38 1 39 1 40 2 44 1 45 1 46 2 47 1 Inconclusive! There isn't a more prominent gap than before. p=aaan+.005*avg<12 q=aaax+.005*avg>12 0 3 1 3 2 8 3 3 4 6 5 6 6 5 7 12 8 2 9 4 10 12 11 8 12 13 13 5 14 3 15 7 Here we tweek d just a little toward the means and get a more prominent gap?? Cut=8 CLUS_1.1<8 (45 Virg, 1 Vers) 8<CLUS_1.2<17 (5 Virg, 49 Vers) Cut=9 CLUS_1.1<9 (46 Virg, 2 Vers) CLUS_1.2>9 (4 Virg, 48 Vers) Cut=17 CLUS_1<17 CLUS_2>17 (50 Set) These are attempts at "hill-climbing" the gaps to make them more prominent (To see if they are wider than they show up to be via the choice of F - in the case that the projection line cuts the gap at a severe angle and therefore reports a much narrower gap than actually exists. The next slide attempts analyze "gap climbing" mathematically.
"Gap Hill Climbing": mathematical analysis 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k 4 b c q 3 d e f 2 1 0 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 0 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k l m n 4 b c q r s 3 d e f o p 2 g h 1 i 0 =p d2-gap d2-gap p d1-gap d1-gap q= q d2 d1 d1 d2 One way to increase the size of the functional gaps is to hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher STDev would increase the likelihood that gaps would be larger ( more dispersion allows for more and/or larger gaps). This is very general. We are more interested in growing the one particular gap of interest (largest gap or largest thinning). To do this we can do as follows: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.). The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q???
Furthest Point or Mean Point Barrel Clustering:(This method attempts to build barrel-shaped gaps around clusters) y (yof) (yof) (yof) f |f| f |f| f f o y - f y - = y - squared is y- yo fof fof fof f |f| f |f| yo dot prod proj len (yof)2 (yof)2 (yof)2 (yof)2 f Gaps in dot product lengths [projections] on the line. + + fof squared = yoy - 2 squared = yoy - 2 fof (fof)2 fof fof y ( (y-p)o(q-p) )2 Squared y-p on q-p Projection Distance = (y-p)o(y-p) - (q-p)o(q-p) 1st 2 (yo(q-p)-p o(q-p = yoy -2yop+ pop- |q-p| |M-p| |q-p| |M-p| M-p |M-p| (y-p)o (yof)2 Squared y on f Proj Dis = yoy - For the dot product length projections (caps) we already needed: fof barrel cap gap width po M-p ) = ( yo(M-p)- barrel radius gap width q Allows for a better fit around convex clusters that are elongated in one direction (not round). Exhaustive Search for all barrel gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in barrel shaped gaps. a. SquareBarrelRadius functional, SBR(y) = (y-p)o(y-p) - ((y-p)od)2 b. BarrelLength functional, BL(y) = (y-p)od Given a p, do we need a full grid of ds (directions)? No! d and -d give the same BL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p dot product projection distance That is, we needed to compute the greenconstants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)
4 functionals in the dot product group of gap clusterers on a VectorSpace subset, Y (yY): y y - (yod)d = projection. Squaring its length: (y-yodd)o(y-yodd)=yoy-(yod)2 y y - (yod)d so again yoy - (yod)2 = squared proj ( (y-p)o(q-p) )2 d yod projection (y-p)o(y-p) - d (q-p)o(q-p) yod projection (neg) 3. SPDd(y) = yoy - (yod)2 (d a unit vector) is the Square Projection Distance functional E.g., if d ≡ (q-p)/|q-p|, d = unit vector from vector p to vector q, then SPD(y)= But to avoid creating an entirely new VectorPTreeSet(Y-p) for the space (with origin shifted to p), we think it useful to alter the expression to : SPDpq(y) 2 po yo po yo po yo yo po = yoy -2yop+ pop- where we might: 1st compute the constant vector 2nd the ScalarPTreeSet 3rd the constant 4th the SPTreeSet - - - pop 5th the SPTreeSet 6th the constant |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |q-p| |M-p| |M-p| |q-p| |q-p| |q-p| 2 q-p |q-p| (y-p)o = yoy -2yop+ pop- yoy,yop 7th the SPTreeSets 8th the SPTreeSet q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p q-p - - - po po po = yo yo yo 1. SLp(y) = (y-p)o(y-p), p a fixed vector. Square Length functional primarily for outlier identification and densities. 2. Dotd(y) = yod, (d is a unit vector) the Dot-product functional. Using d=q-p/|q-p| and y-p Dotp,q(y) = (y-p)o(q-p)/|q-p| Is it better to leave all the additions and subtractions for one mega-step at the end? Other efficiency thoughts? We note that Dot(y)=yod shares many construction steps with SPD. 4. CAd(y) = yod/|y|, (d unit vector) the Cone Angle functional. Using d=q-p/|q-p| and y=x-p CAp,q(y) = (y-p)od/|y-p| SCAp,q(y) = (y-p)od2/|y-p|2 = (y-p)od2/(y-p)o(y-p), Squared Cone Angle functional
SPD p 64 29 50 17 q 61 29 45 14 e14 V Ct 2 10 3 12 4 12 5 12 6 8 7 11 8 9 9 5 10 9 11 4 12 4 13 2 14 1 17 2 18 3 19 10 20 5 21 6 22 5 23 6 24 6 25 3 27 2 29 2 30 1 SPD on CLUS1 p 50 20 35 10 e11 q 58 31 37 12 =MN V Ct 2 3 3 4 4 5 5 7 6 2 7 2 8 6 9 6 10 3 11 4 12 2 13 4 14 4 15 3 16 2 17 1 18 5 19 1 20 2 22 2 23 1 24 1 25 1 26 1 29 1 SPD p 64 29 50 17 q 61 29 45 14 e14 V Ct 1 6 2 4 3 8 4 4 5 10 6 2 7 2 8 2 9 7 10 2 11 2 12 2 13 1 15 2 17 1 18 4 19 2 20 4 22 1 24 1 25 1 26 1 29 1 31 2 32 2 33 3 37 2 i15 i36 92 1 i32 SPD p 54 22 39 10 q 70 34 51 18 V Ct 2 8 3 10 4 10 5 10 6 5 7 10 8 6 9 8 10 6 11 1 mask: V<8.5 CTs 50 0 SMs CTe 50 50 SMe CTi 50 24 SMi CLUS1 mask: V<12.5 5 SMe 24 SMi CLUS1.1 thin gap mask: 8.5<V<15.5 CTs 50 1 SMs CTe 50 0 SMe CTi 50 24 SMi CLUS2 masking V>6: Total_e 37 2 Masked_e Total_i 37 29 Masked_i However I cheated a bit. I used p=MinVect(e) and q=MaxVect(e) which makes it somewhat supervised. START OVER WITH THE FULL 150-----------------> mask: V>12.5 45 SMe 0 SMi CLUS1.2 mask: V>15.5: CTs 50 49 SMs CTe 50 0 SMe CTi 50 2 SMi This tube contains 49 setosa + 2 virginica CLUS3 CLUS1.2 is pure Versicolor (45 of the 50). CLUS3 is almost pure Setosa (49 of the 50, plus 2 virginica) CLUS2 is almost purely [1/2 of] viriginica (24 of 50, plus 1 setosa). CLUS1.1 is the other 24 virginicas, plus the other 5 versicolors. So this method clusters IRIS quite well (albeit into 4 clusters, not three). Note that caps were not put on these tubes. Also, this was NOT unsupervised clustering! I took advantage of my knowledge of the classes to carefully chose the unit vector points, p and q E.g., p = MinVector(Versicolor) and q = MaxVector(Versicolor. True, if one sequenced thru a fine enough d-grid of all unit vectors [directions], one would happen upon a unit vector closely aligned to d=q-p/|q-p| but that would be a whole lot more work that I did here (would take much longer). In worst case though, for totally unsupervised clustering. there would be no other way than to sequence through a grid of unit vectors. However, a good heuristic might be to try all unit vectors "corner-to-corner" and "middle-of-face-TO-middle-of-opposite-face" first, etc. Another thought would be to try to introduce some sort of hill climbing to "work our way" toward a good combination of a radial gap plus two good linear cap gaps for that radial gap.
SPD on CLUS1 p 60 34 60 25 C1US1axxx q 60 28 46 15C1US1aaaa V Ct . 1 3 2 5 3 9 4 13 5 18 6 12 7 4 8 1 9 2 11 3 no thinnings SPD on CLUS1 p 69 28 60 25 C1US1xaxx q 60 28 46 15C1US1aaaa V Ct . 1 4 2 13 3 7 4 19 5 9 6 7 7 9 8 2 SPD on CLUS1 p 69 34 46 25 C1US1xxax q 60 28 46 15C1US1aaaa V Ct . 1 1 2 4 3 3 4 9 5 9 6 14 7 9 8 4 9 6 10 3 11 3 12 1 14 2 15 1 16 1 no thinnings SPD on CLUS1 p 69 34 60 15 C1US1xxxa q 60 28 46 15C1US1aaaa V Ct . 1 1 2 3 3 10 4 15 5 16 6 12 7 7 8 3 9 1 10 1 11 1 no thinnings SPD p 58 44 69 25 axxx q 58 30 37 11 aaaa V Ct . 2 1 3 5 4 6 5 6 6 8 7 6 8 8 9 15 10 7 11 8 12 13 13 8 14 14 15 9 16 13 17 6 18 4 19 4 20 3 21 4 23 1 25 1 mask: V<3.5 14 SM versi 10 SM virgi CL1.1? mask: V>3.5 0 SM setosa 32 SM versi 14 SM virgi CLUS1.2? mask: V<11.5 0 SM setosa 46 SM versicolor 24 SM virginica CLUS1 SPD on CLUS2 p 56 44 69 25 C1US2axxx q 56 32 29 9C1US2aaaa V Ct . 6 2 7 2 8 6 9 13 10 7 11 7 12 4 13 5 14 11 15 9 16 2 18 4 21 2 22 1 23 3 25 1 26 1 mask: V>11.5 50 SM setosa 4 SM versicolor 26 SM virginica CLUS2 SPD on CLUS1 p 60 34 46 25 C1US1axax q 60 28 46 15C1US1aaaa V Ct . 1 1 2 3 3 4 4 2 5 12 6 13 7 9 8 7 9 2 10 7 11 4 13 2 14 1 17 2 18 1 SPD on CLUS1 p60 28 60 25C11aaxx q60 28 46 15C11aaaa V Ct . 1 1 2 7 3 10 4 13 5 13 6 13 7 6 8 2 9 2 11 1 12 2 no thinnings SPD on CLUS1 p60 34 60 15 C1US1axxa q60 28 46 15C1US1aaaa V Ct . 1 1 2 2 3 6 4 9 5 12 6 17 7 8 8 6 9 5 10 1 11 1 12 2 no thinnings mask: V<13.5 44 SM setosa 0 SM versicolor 02 SM virginica CLUS2.1 mask: V<9.5 37 SM vers 16 SM virg CL1.1? mask: 100>V>13.5 6 SM setosa 4 SM versicolor 24 SM virginica CLUS2.2 mask: V>9.5 9 SM vers 8 SM virg CL1.2? SPD on CLUS1 69 28 46 25 C11xaax 60 28 46 15 C11aaaa V Ct 1 2 2 3 3 4 4 8 5 8 6 14 7 8 8 4 9 5 10 6 11 1 12 3 14 1 15 2 17 1 no thins C11axaa C11aaaa V Ct 1 2 2 2 3 2 4 10 5 3 6 13 7 8 8 7 9 4 10 3 11 6 12 2 13 2 14 2 17 2 18 1 19 1 SPD on C1 C11aaax C11aaaa V Ct 1 3 2 1 3 3 4 4 5 12 6 15 7 4 8 5 9 4 10 7 11 4 12 2 13 1 14 1 15 1 17 1 18 1 19 1 C11xaaa C11aaaa V Ct 1 2 2 4 3 5 4 9 5 10 6 9 7 5 8 6 9 2 10 6 11 3 12 1 13 2 14 2 15 2 17 2 SPD on CLUS1 69 28 46 25 C11xxaa 60 28 46 15C11aaaa V Ct 1 1 2 4 3 6 4 9 5 10 6 7 7 9 8 5 9 3 10 4 11 2 12 4 13 1 14 3 17 2 C11aaxa C11aaaa V Ct 1 2 2 3 3 6 4 12 5 11 6 9 7 11 8 5 9 5 10 1 11 3 13 2 SPD on CLUS1 69 28 60 15 C11xaxa 60 28 46 15C11aaaa V Ct 1 2 2 3 3 12 4 12 5 10 6 15 7 7 8 4 9 1 10 2 11 1 12 1 no thins mask: V<5.5 16 ver 3 virCL1.1? mask: V<5.5 26 ver 4 vir CL1.1? mask: V>5.5 30 ver 21 virCL1.1? mask: V>5.5 20 ver 20 vir CL1.1?
i18 77 38 67 22 p max 79 38 69 25 V Ct 0 2 1 2 2 2 3 5 4 3 5 3 6 4 7 4 8 7 9 2 10 3 11 1 12 4 13 5 14 4 15 7 16 2 17 5 18 3 19 1 20 1 21 4 23 2 24 2 25 4 26 1 27 2 28 1 29 2 30 1 32 1 {e4, e40} form a doubleton outlier set i7 and e10 are singleton outliers 45 remaining setosa. This is SubCluster 2 (may have additional outliers or sub-subclusters but we will not analyse further (would be done in practice tho 95 remaining versicolor and virginica=SubClus1. Continue outlier id rounds on SC1 (maxSL, maxSW, max PW) then do "capped tube" (further subclusters.) i32 79 38 64 20 p max 79 44 69 25 V Ct 0 2 2 6 3 3 4 4 5 4 6 2 7 6 8 9 9 2 10 2 11 2 12 5 13 7 14 2 15 6 16 2 17 5 19 3 20 2 22 3 23 2 24 3 25 2 26 1 27 1 28 1 29 3 30 1 31 2 32 1 e32 42 1 e11 43 2 e8,44 44 1 e49 51 1 i39 60 1 61 1 62 1 63 1 64 1 65 1 66 1 67 3 68 4 69 4 70 3 71 3 72 4 73 2 74 5 75 1 76 2 77 1 78 3 79 1 80 1 s3 83 1 s9 84 2 s39,43 85 1 s42 87 1 s23 91 1 s14 2 actual gap-ouliers, checking distances reveals 4 e-outlier (versicolor), 5 s-outliers (setosa). i19 77 26 69 23 p max 79 44 69 25 V Ct 0 2 1 1 2 3 3 3 4 4 5 2 6 6 7 3 8 5 9 4 10 4 11 2 12 3 13 4 14 6 15 4 16 1 17 7 18 2 19 3 20 2 22 2 23 1 24 2 25 4 26 4 27 1 28 2 29 2 30 1 32 2 33 1 34 1 35 1 No new outliers reviealed x=s15 58 40 12 2 (58=avg(y1) ) V Ct 0 3 s15, s17, s34 1 12 s 6,11,16,19,20,22,28,32,3337,47,49 2 12 s 1,10 13,18,21,27,29,40,41,44,45,50 3 7 s 2,12,23,24,35,36,38 4 10 s 2,3,7,13,25,26,30,31,46,48 5 2 s4, s43 6 2 s9,s39 7 1 s14 8 1 i39 9 1 s32 ^^all 50 setosa + i39 14 1 e49 16 2 17 2 19 1 20 2 21 5 22 4 23 3 24 4 25 1 27 8 28 2 29 2 30 4 31 1 32 4 34 2 35 2 36 2 37 3 38 2 39 2 40 4 41 1 43 2 44 4 45 2 46 1 47 2 48 1 50 4 52 2 53 2 54 2 56 2 57 1 58 1 i1 62 1 i31 vv 9 virginica 63 1 i10 64 1 i8 66 1 i36 69 1 i32 74 1 i16 76 1 i18 77 1 i23 85 1 i19 But here I mistakenly used the mean rather than the max corner. So I will redo - but note the high level of cluster and outlier revelation????? 1. (y-p)o(y-p) remove edge outliers ( thr>2*50) 2. lthin gaps in SPD: d, from an edge point to MN. 3 For each thin PL, do len gap anal of pts in " tube". e13 i7 e40 e4 e10 F e13 0 14 7 6 10 28 i7 14 0 9 9 8 29 e40 7 9 0 2 4 29 e4 6 9 2 0 5 30 e10 10 8 4 5 0 32 e32 e11 e8 e44 e49 e32 0 7 7 6 8 e11 7 0 5 4 7 e8 7 5 0 1 4 e44 6 4 1 0 4 e49 8 7 4 4 0 SPD(y) =(y-p)o(y-p)-(y-p)od2 d: mn-mx V Ct Next slide i1 63 33 60 25 p max 79 38 69 25 V Ct 0 2 1 10 2 11 3 6 4 15 5 4 6 8 7 9 8 4 9 5 10 2 11 7 13 4 14 2 15 2 16 1 17 1 18 1 19 1 e30, e15 outliers e20,e31,e32 form SC12 Declared tripleton outlier set? (But they are not singleton outliers.) s3 s9 s39 s43 s42 s23 s3 0 4 4 3 9 5 s9 4 0 1 3 6 8 s39 4 1 0 2 7 7 s43 3 3 2 0 9 5 s42 9 6 7 9 0 13 s23 5 8 7 5 13 0 e13 e20 e15 e31 e32 e30 F e13 0 5 9 6 6 7 15 e20 5 0 5 2 3 4 15 e15 9 5 0 6 6 4 16 e31 6 2 6 0 1 4 17 e32 6 3 6 1 0 3 18 e30 7 4 4 4 3 0 19
Cone Clustering:(finding cone-shaped clusters) x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 52 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 14 1 i25 16 1 i40 18 2 i16 i42 19 2 i17 i38 20 2 i11 i48 22 2 23 1 24 4 i34 i50 25 3 i24 i28 26 3 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 49 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 23 6 i21 24 5 25 1 27 1 28 1 29 2 30 2 i7 41/43 e so picks e Cosine cone gap (over some angle) Gap in dot product projections onto the cornerpoints line. Corner points x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 26 1 e21 e34 27 2 29 2 37 1 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 10 12 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 26 1 e21 e34 27 2 28 1 29 2 31 1 e35 37 1 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 11 2 i28 12 4 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 34 1 i39 43/50 e so picks out e
FxM(x,y)=yo(x-M)/|x-M|-min on XX≡{(x,y)|x,yX}, where X(x,y) is a Spaeth image tableCluster by splitting at all F_gaps > 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 APPENDIX X xy \y=1 2 3 4 5 6 7 8 9 a b 1 1 x=1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 The 15 Value_Arrays (one for each x) z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 x y FxM z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 M (=MeanVector) The 15 Count_Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 gap: 10-6 gap: 5-2 pTree masks of the 3 z1_clusters (obtained by ORing) z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 The FAUST algorithm: 1. project onto each Mx line using the dot product with the unit vector from M to x. (only x=z1 is shown) 2. Generate each Value Array, F[x0]|(y), xX (also generate the Count_Arrays and the mask pTrees). 3. Analyze all gaps and create sub-cluster pTree Masks.
yo(z7-M)/|z7-M| ValueArrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 yo(z7-M)/|z7-M| CountArrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 gap: 6-9 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 In Step_3 of the algorithm we can: Analyze one of the gap arrays (e.g., As done for z1. Subclusters is shown above) then start over on each subcluster. Or we can analyze all gap arrays concurrently (in parallel using the same F - saving the [substantial?] re-compute costs?) and then intersect the subcluster partitions we get from each x_ValueArray gap analysis, forthe final subclustering. Here we use the second alternative, judiciously choosing only the x's that are likely to be productive (choosing z7 next). Many are likely to produce redundant partitions - e.g., z1, z2, z3, z4, z6 - as their projection lines will be nearly coincident. How should we choose the sequence of "productive" strides? One way would be to always choose the remaining stride with the shortest ValueArray, so that the chances of decent sized gaps is maximized. Other ways of choosing?
yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 gap: 3-7 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 We choose zd=z13 next (Should have been first? Since it's ValueArray is shortest?) Note, z8, z9, za projection lines will be nearly coincident with that of z7.
yo(x-M)/|x-M| Value Arrays z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10 11 Cluster by splitting at gaps > 2 x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 AND each red with each blue with each green, to get the subcluster masks (12 ANDs producing 5 sub-clusters.
F1(x,y) = L1Distance(x,y) = (|x1-y1|+|x2-y2|) on XX≡{(x,y)|x,yX}, Cluster by splitting at all F1_gaps L1(x,y) ValueArray z1 0 2 4 5 10 13 14 15 16 17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17 18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0 2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9 10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0 2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12 13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5 8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15 17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13 0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9 10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10 11 13 15 L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1 1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1 1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2 1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1 1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2 1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4 1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2 1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2 1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1 1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1 1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2 3 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 gap: 10-5 (redundant subclustering) There is a z1-gap, but it produces a subclustering that was already discovered by a previous round. Which z values will give new subclusterings?
L1(x,y) ValueArray z1 0 2 4 5 10 13 14 15 16 17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17 18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0 2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9 10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0 2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12 13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5 8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15 17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13 0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9 10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10 11 13 15 L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1 1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1 1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2 1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1 1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2 1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4 1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2 1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2 1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1 1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1 1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2 3 1 xyx\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 This re-confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis. Re-confirms zf an anomaly. After having subclustered with linear gap analysis, which is best for determining larger subclusters, we run this round gap algorithm out only 2 steps to determine if there are any singleFvalue gaps>2 (the points in the singleFvalueGapped set are then declared anomalies). So we run it out two steps only, then find those points for which the one initial gap determined by those first two values is sufficient to declare outlierness. Doing that here, we reconfirm the outlierness of z6 and zf, while finding new outliers, z5 and za.
Using F=yo(x-M)/|x-M|-MIN on IRIS, one stride at a time (s1=setosa1 first) For virginica1 Val Ct 0 1 1 1 2 2 3 5 4 6 5 11 6 12 7 4 8 2 9 5 10 1 17 1 22 1 24 2 25 1 27 1 28 1 29 2 30 1 31 3 32 4 33 1 34 4 35 2 36 2 37 4 38 4 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 F(i39)=17 F<17 (50 Setosa) vers1 Val Ct 0 1 2 4 3 1 4 1 5 3 6 3 7 8 8 3 9 7 10 6 11 4 12 4 13 3 15 2 19 2 20 2 21 1 26 2 27 3 28 4 30 2 31 5 32 4 33 3 34 1 36 3 37 5 38 4 39 5 40 7 41 4 42 2 43 2 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 F<19 (50 setosa) 19<F<22 {vers8,12,39,44,49} 22<F yo(s1-M)/|s1-M|-69) ValCt 0 1 3 1 4 2 7 1 8 1 9 2 10 1 12 4 14 5 15 2 16 4 17 1 18 4 19 5 20 1 21 2 22 2 23 8 24 4 25 3 26 2 27 5 28 3 29 4 30 4 31 3 32 2 33 2 34 4 35 5 36 2 37 2 38 1 39 1 40 1 41 1 43 1 44 1 45 1 52 1 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 F(i39)=52 virginica39 is an outlier. 2 clusters, F<52 (ct=99) and F>52 (50 Setosa) virgini39 Val Ct 0 1 1 2 2 1 4 2 6 1 7 1 8 7 9 2 10 2 11 7 12 2 13 3 14 7 15 4 16 10 17 4 18 6 19 9 20 3 21 6 22 3 23 6 24 3 25 1 27 3 28 2 32 1 39 1 40 1 41 1 42 8 43 13 44 17 45 4 46 5 47 1 F=32 vers49 outlier. 32<F (50 Setosa, vir39) AVG(ver8,12,39,44,49) Val Ct 0 1 1 1 7 5 10 3 12 2 13 2 14 3 15 5 16 2 17 5 18 8 19 4 20 3 21 4 22 3 23 8 24 4 25 4 26 3 27 7 28 7 29 4 30 5 31 4 32 5 33 8 34 2 35 6 36 5 37 3 38 2 39 8 40 6 41 3 43 1 44 2 45 1 47 1 F=0 vir32 outlier F=1 vir18 outlier F=7 vir6,10,19,23,36 subcluster?
On Remaining, mx mn mx mx ValCt 0 3 1 4 2 11 3 14 4 14 5 9 6 10 7 2 8 6 9 2 11 2 On Clus(F<52) ver1 F(virg7)=0 outlier F(virg32)=25 outlier ValCt 0 1 4 1 5 5 6 3 7 5 8 3 9 8 10 11 11 14 12 8 13 8 14 5 15 3 16 7 17 5 18 6 19 2 20 1 21 1 22 1 25 1 F=yo(x-M)/|x-M|-MIN on IRIS, subclustering as we go. For s1 (i.e., yo(s1-M)/|s1-M|-69) ValCt 0 1 3 1 4 2 7 1 8 1 9 2 10 1 12 4 14 5 15 2 16 4 17 1 18 4 19 5 20 1 21 2 22 2 23 8 24 4 25 3 26 2 27 5 28 3 29 4 30 4 31 3 32 2 33 2 34 4 35 5 36 2 37 2 38 1 39 1 40 1 41 1 43 1 44 1 45 1 52 1 outlier 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 F(i39)=52 i39=virgi39 outlier. Clusters, F<52 (ct=99) and F>52 (50 Setosa) On Remaining, max's ValCt 0 2 e8 outlier 1 2 e11 outlier 7 2 8 1 9 4 10 1 11 2 12 2 13 4 14 3 15 1 16 4 17 2 18 2 19 3 20 4 21 6 22 5 23 5 24 4 25 2 26 2 27 1 28 2 29 4 30 5 31 1 32 3 33 2 34 2 35 3 36 2 37 1 38 1 39 1 i8 41 2 i10 i36 44 2 i6 i23 46 1 i19 48 1 i18 i6 i8 i10 i19 i23 i35 i6 0 5 10 5 3 20 i8 5 0 10 9 6 15 i10 10 10 0 14 12 19 i19 5 9 14 0 4 22 i23 3 6 12 4 0 20 i35 20 15 19 22 20 0 i6 i10 i18 i19 i23 i35 all declared outliers e4 e38 e19 i20 F e4 0 9 9 11 9 e38 9 0 3 7 9 e19 9 3 0 5 11 outlier i20 11 7 5 0 11 outlier On Remaining, max's ValCt 0 2 e44 outlier 6 1 7 2 8 1 9 3 10 1 11 3 12 5 13 2 14 2 15 3 17 3 18 3 19 5 20 1 21 9 22 5 23 4 24 2 26 4 27 2 28 2 29 4 30 2 31 3 32 3 33 2 34 3 35 2 36 1 37 1 38 1 39 1 42 1 e36 outlier? On Remaining, mx mx mx mn ValCt 0 1 1 2 2 3 3 1 5 5 6 4 7 5 8 2 9 3 10 5 11 4 12 7 13 5 14 2 15 4 16 4 17 7 18 4 19 4 20 2 21 2 22 1 24 1 25 1 27 2 29 2 On Remaining, mn mn mx mx ValCt 0 1 1 3 2 3 3 7 4 7 5 7 6 5 7 5 8 3 9 8 10 4 11 4 12 11 13 4 14 8 15 4 16 1 18 1 On Remaining, mn mx mx mx ValCt 0 1 2 1 3 4 4 3 5 5 6 4 7 5 8 7 9 8 10 3 11 5 12 2 13 4 14 5 15 7 16 5 17 4 18 1 20 1 On Remaining w e35 ValCt 0 1 i26 outlier 3 2 On remaining vir1 ValCt 0 1 0 1 0 1 1 2 2 1 4 1 5 1 6 2 7 2 8 2 9 4 10 1 11 4 12 3 13 4 14 2 15 6 16 4 17 6 19 4 20 5 21 5 22 2 23 1 24 2 25 5 26 4 27 4 28 1 29 2 30 6 31 2 32 1 33 1 34 1 35 2 36 1 38 1 39 1 e35 e10 e35 0 7 e10 7 0 outlier i44 i3 i44 0 4 i3 4 0 ^^outlier i3 i30 i31 i26 i8 i36 i3 0 5 5 4 5 7 i30 5 0 5 3 6 9 outlier i31 5 5 0 5 3 5 outlier i26 4 3 5 0 4 7 outlier i8 5 6 3 4 0 7 outlier i36 7 9 5 7 7 0 outlier Rem mn mx mn mx ValCt 0 1 1 1 2 1 3 1 4 1 5 1 6 1 8 1 9 3 10 5 11 5 12 3 13 7 14 6 15 4 16 6 17 7 18 5 19 4 20 2 21 3 22 7 23 4 24 3 25 1 26 1 27 2 33 1 e49 outlier On Remaining, mn mx mx mx ValCt 0 1 1 1 2 1 3 5 4 6 5 5 6 4 7 9 8 4 9 4 10 4 11 3 12 5 13 6 14 6 15 7 16 5 17 4 18 4 20 1 22 1 Could look at distances for 0,1 and 20,22? e13 e30 e32 e13 0 7 3 outlier e30 7 0 6 outlier e32 3 6 0 i44 i45 i49 i5 i37 i1 i44 0 3 8 4 6 6 i45 3 0 6 5 4 5 i49 8 6 0 6 2 6 i5 4 5 6 0 5 5 i37 6 4 2 5 0 4 not outlier i1 6 5 6 5 4 0 outlier
outliers gap>L1=32.1 s6 s14 s15 s16 s17 s19 s21 s23 s24 s32 s33 s34 s37 s42 s45 e1 e2 e3 e5 e6 e7 e9 e10 e11 e12 e13 e15 e18 e19 e21 e22 e23 e27 e28 e29 e30 e34 e36 e37 e38 e41 e49 i1 i3 i4 i5 i6 i7 i8 i9 i10 i12 i14 i15 i16 i18 i19 i20 i22 i23 i25 i26 i28 i30 i31 i32 i34 i35 i36 i37 i39 i41 i42 i45 i46 i47 i49 i50 outliers gap>L1=42.8 s15 s16 s19 s23 s33 s34 s37 s42 s45 e1 e2 e7 e10 e11 e12 e13 e15 e19 e21 e22 e23 e27 e28 e30 e34 e36 e38 e41 e49 i1 i3 i5 i6 i7 i8 i9 i10 i12 i14 i15 i16 i18 i19 i20 i22 i23 i26 i30 i31 i32 i34 i35 i36 i39 outliers gp>L1=53.5 s15 s16 s23 s33 s34 s42 e10 e13 e15 e27 e28 e30 e36 e49 i1 i3 i7 i9 i10 i12 i15 i18 i19 i20 i26 i30 i32 i35 i36 i39 F=L1(x,y) on IRIS, masking to subclusters (go right down the table). Two rounds only If we use L1gap=6, remove those outliers, then use linear gap analysis for larger subcluster revalation, let's see if we can separate Versicolor (e) from virginica (i). outliers gap>L1=64.3 s15 s16 s23 s42 e10 e13 e49 i3 i7 i9 i10 i18 i19 i20 i32 i35 i36 i39 outliers gap>L1=74.95 L1gap s42 9 e13 8 i7 10 i9 12 i10 12 i35 9 i36 9 i39 26
Val=0;p=K;c=0;P=Pure1; For i=n to 0 {c=Ct(P&Pi); If (c>=p){Val=Val+2i;P=P&Pi}; else{p=p-c;P=P&P'i }; return Val, P; X3 X4 IDY z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf 1 3 2 3 6 9 15 14 15 13 10 11 9 11 7 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 : 1 3 2 3 6 9 15 14 15 13 10 11 9 11 7 : z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf : 1 3 2 3 6 9 15 14 15 13 10 11 9 11 7 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf 1 3 2 3 6 9 15 14 15 13 10 11 9 11 7 z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 IDX z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z1 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 z2 : ze ze ze ze ze ze ze ze ze ze ze ze ze ze ze zf zf zf zf zf zf zf zf zf zf zf zf zf zf zf X1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 : 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 X2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 : 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Need Rank(n-1) applied toeach stride instead of the entire pTree. The result from stride=j gives the jth entry of SpS(X,d(x,X-x)) Parallelize over a large cluster? Ct(P&Pi): revise the Count proc to kick out count for each stride (involves loop down pTree by register-lengths? What does P represent after each step?? How does alg go on 2pDoop (w 2 level pTrees) where each stride is separate Note: using d, not d2 (fewer pTrees). Can we estimate d? (using truncated McClarin series) P3 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 : 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 P2 0 0 0 0 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 1 0 P0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 : 0 1 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 0 d(xy) 0 2 1 3 4 8 14 13 14 12 12 13 13 14 9 2 0 1 2 2 6 12 11 12 10 11 12 12 13 8 : 14 13 13 11 11 8 11 9 9 7 2 1 2 0 5 9 8 8 6 6 5 11 9 9 7 3 4 4 5 0 P'3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 : 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 P'2 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 1 : 0 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 P'0 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 1 : 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 P1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 : 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 P'1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 1 1 : 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 23 * + 22 * + 21 * + 20 * 0 0 1 = 1 0 n=3: c=Ct(P&P3)=10< 14, p=14–10=4; P=P&P' (elim 10 val8) n=2: c=Ct(P&P2)= 1 < 4, p=4-1=3; P=P&P' (elim 1 val4) n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>=1 P=P&P0 (elim 1 val<1) 23 * + 22 * + 21 * + 20 * 0 0 1 = 1 0 n=3: c=Ct(P&P3)=9< 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 0 < 5, p=5-0=5; P=P&P' (elim 0 val4) n=1: c=Ct(P&P1)=4 < 5, p=5-4=1; P=P&P' (elim 4 val2) n=0: c=Ct(P&P0 )=1>=1 P=P&P0 (elim 1 val<1 23 * + 22 * + 21 * + 20 * 0 0 1 = 1 0 n=3: c=Ct(P&P3)= 9 < 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 2 < 5, p=5-2=3; P=P&P' (elim 2 val4)2 n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>=1 P=P&P0 (elim 1 val<1) 23 * + 22 * + 21 * + 20 * 0 0 1 = 3 1 n=3: c=Ct(P&P3)= 6 < 14, p=14–6=8; P=P&P' (elim 6 val8) n=2: c=Ct(P&P2)= 7 < 8, p=8-7=1; P=P&P' (elim 7 val4)2 n=1: c=Ct(P&P1)=11, p=1-1=0; P=P&P (elim 0 val2) n=0: c=Ct(P&P0 )=1 0 P=P&P0 (elim 0)
Level-1 key map Red=pure stride (so no Level-0) 13 12 11 10 23 22 21 20 33 32 31 30 43 42 41 40 e 2 34 0 0 0 0 f5 6 0 0 0 g7 0 0 h 0 i 8 9 a 0 0 0 0 j b c 0 0 0 k d 0 0 m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 12 11 10 23 22 21 20 33 32 31 30 43 42 41 40 33 32 31 30 43 42 41 40 13 12 11 10 23 22 21 20 33 32 31 30 43 42 41 40 e 2 3 4 f 5 6 g 7 h i 8 9 a j b c k d m 33 32 31 30 43 42 41 40 2 3 4 5 6 7 8 9 a b c d e f g h i j k m 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 Level-0: key map (6-e) = f else pur0 (6-e) = g else pur0 (6-e) = h else pur0 (6-e) = e else pur0 In this 2pDoop KEY-VALUE DB, we list keys. Should we bitmap? Each bitmap is a pTree in the KVDB. Each of these is existing, e.g., e here 5,7-a,f=f else pur0 5,7-a,f=g else pur0 5,7-a,f=h else pur0 234789bcefh else pr0 234789bcefg els pr0 124-79c-f h else pr0 (b-f) = j else pur0 (b-f) = k else pur0 (b-f) = i else pur0 (b-f) = m else pur0 (a) = j else pur0 (a) = k else pur0 (a) = m else pur0 p23p43 p23p42 p13p33 + p13p32 + -27( =SpS(XX, (3-6,8,9) k, els pr0 (3-6,8,9) m els pr0 + p13p31 + p23p41 p13+p23+p33+p43 +p13p12+ p23p22+ p33p32 + +p43p42 ) -26( 26( 124679bd m els pr0 p23p21 + p33p31 + p43p41 ) p13p11+ -25( p13p30 +p23p40 25( +p12p31 +p22p41 +p12p32 +p22p42 p12+p22+p32+p42 +p23p20 +p13p10+ +p33p30 +p43p40 24( -24(p12p30 +p22p40 +p32p31 +p42p41 ) +p12p11+ +p22p21 p22p20 + p32p30 + p42p40 ) p12p10+ -23(p11p31 +p11p30 +p21p41 +p21p40 23( p10+p20+p30+p40) p11+p21+p31+p41 +p11p10 + +p21p20 + +p31p30 -22(p10p30 +p20p40 +p41p40 ) 22(
Dot Product Projection (DPP) 1. Check F=Dotp,d(y) gaps or thin intervals. 1.1 Check actual distances at sparse ends. Here we apply DPP to the IRIS data set: 150 iris samples (rows) and 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). We assume we don't know ahead of time that the first 50 are the Setosa Class, next 50 are Versicolor Class and the final 50 are Virginica Class. We cluster with DPP and then see how close we came to separating the 3 classes (s=setosal, e=versicolor, i=virginica). Analyzing the thin interval [8,9]: 7 7 7 7 7 7 7 7 7 7 7 10 10 10 10 10 10 10 10 10 10 10 10 e21i4 i8 i9 i17i24i26i27i28i38i50e2 e3 e12e5 e17e19e23e29e35e37i20i34 e21 0 9 21 15 9 6 18 5 3 9 4 7 11 7 8 6 11 9 5 7 9 11 7 i4 9 0 12 6 2 7 10 8 7 2 6 12 10 15 11 13 13 9 12 15 10 10 6 i8 21 12 0 9 11 17 4 19 18 12 18 21 15 25 19 25 22 18 22 26 17 20 16 i9 15 6 9 0 6 10 9 12 12 7 12 15 11 19 13 18 15 10 16 19 13 11 9 i17 9 2 11 6 0 7 9 8 7 1 7 11 8 15 10 14 13 9 12 15 9 11 6 i24 6 7 17 10 7 0 15 2 4 7 5 7 8 9 5 9 7 4 6 11 7 7 4 i2618 10 4 9 9 15 0 16 16 9 16 17 12 22 16 22 21 16 20 24 14 19 14 i27 5 8 19 12 8 2 16 0 2 8 5 6 8 8 5 8 7 4 5 9 7 7 4 i28 3 7 18 12 7 4 16 2 0 7 3 6 9 8 6 7 9 6 5 9 7 9 5 i38 9 2 12 7 1 7 9 8 7 0 6 10 8 14 10 13 14 9 11 14 9 11 6 i50 4 6 18 12 7 5 16 5 3 6 0 9 11 9 9 7 11 7 7 8 9 9 5 e2 7 12 21 15 11 7 17 6 6 10 9 0 6 6 4 8 10 8 5 10 4 12 7 e3 11 10 15 11 8 8 12 8 9 8 11 6 0 12 6 14 12 8 10 16 3 13 7 e12 7 15 25 19 15 9 22 8 8 14 9 6 12 0 7 4 9 9 3 6 9 11 10 e5 8 11 19 13 10 5 16 5 6 10 9 4 6 7 0 9 7 5 5 11 4 9 5 e17 6 13 25 18 14 9 22 8 7 13 7 8 14 4 9 0 10 9 4 2 11 10 9 e1911 13 22 15 13 7 21 7 9 14 11 10 12 9 7 10 0 5 7 11 10 5 9 e23 9 9 18 10 9 4 16 4 6 9 7 8 8 9 5 9 5 0 6 11 7 4 4 e29 5 12 22 16 12 6 20 5 5 11 7 5 10 3 5 4 7 6 0 6 8 9 7 e35 7 15 26 19 15 11 24 9 9 14 8 10 16 6 11 2 11 11 6 0 13 11 11 e37 9 10 17 13 9 7 14 7 7 9 9 4 3 9 4 11 10 7 8 13 0 12 6 i2011 10 20 11 11 7 19 7 9 11 9 12 13 11 9 10 5 4 9 11 12 0 7 i34 7 6 16 9 6 4 14 4 5 6 5 7 7 10 5 9 9 4 7 11 6 7 0 These are the actual distances from each F=7 to each F=10 is >=4. F-gap from F=6 to F=11 >=4. F-gap from F=6 to F=10>=4. Separate at F=8.5 to CLUS2.1<8.5 (2 ver, 43 vir) and CLUS2.2>8.5 (44 ver, 4 vir) gp>=4 CLUS2 p=aaan q=aaax 0 3 1 3 2 8 3 2 4 6 5 5 6 5 7 11 8 2 9 4 10 12 11 8 12 13 13 5 14 3 15 7 gap>=4 p=nnnn q=xxxx F Count 0 2 3 2 4 1 5 1 7 1 9 2 10 1 11 1 12 1 13 2 14 1 15 3 16 4 17 3 18 2 19 8 20 2 21 3 22 1 23 4 24 5 25 4 26 5 27 5 28 4 29 3 30 2 31 2 32 2 33 4 34 5 36 3 37 2 38 4 40 1 42 1 43 1 44 2 45 5 47 6 48 3 49 4 50 6 51 4 52 3 53 5 54 3 55 5 56 1 57 1 58 2 59 1 60 1 Sparse Lower end i32 i18 i19 i23 i6 i36 F i32 0 4 13 11 9 9 0 i18 4 0 12 10 8 10 0 i19 13 12 0 4 5 9 3 i23 11 10 4 0 3 7 3 i6 9 8 5 3 0 5 4 i36 9 10 9 7 5 0 5 i32, i18, i19 gap>=4 outliers So, two rounds of Dotpd(y) gap analysis yields CLUS1 (50 Setosa, plus 4 Versicolor) CLUS2.1 (43 Virginica, plus 2 Versicolor) CLUS2.2 (44 Veriscolor, plus 4 Virginica) and picks out 3 Virginica, 5 Setosa as outliers (More outliers would result by applying 1.1 to the sparse ends of the 2nd round?). Round1: p=nnnn (n=min) and q=xxxx (x=max) Round2: p=aaan (a=avg) and q=aaax Thin interval: (40 44) 37 37 38 38 38 38 40 42 43 44 44 45 45 45 45 45 F e4 i7 e10 e31 e32 s14i39 s16 s19 e49 s15e44 e11 e8 s6 s34 e4 0 9 5 3 4 34 24 34 29 11 35 9 8 10 29 34 37 i7 9 0 8 11 12 38 30 39 35 16 40 14 13 14 34 39 37 e10 5 8 0 5 6 32 23 31 27 10 33 8 9 8 27 32 38 e31 3 11 5 0 1 32 23 31 27 9 32 7 7 8 27 31 38 e32 4 12 6 1 0 31 22 30 25 8 31 6 7 7 26 30 38 s14 34 38 32 32 31 0 25 20 17 23 18 26 28 25 16 17 38 i39 24 30 23 23 22 25 0 20 17 17 20 21 24 21 18 21 40 s16 34 39 31 31 30 20 20 0 6 26 5 29 33 29 6 4 42 s19 29 35 27 27 25 17 17 6 0 21 6 24 27 24 3 5 43 e49 11 16 10 9 8 23 17 26 21 0 26 4 7 4 21 25 44 s15 35 40 33 32 31 18 20 5 6 26 0 29 33 29 7 4 44 e44 9 14 8 7 6 26 21 29 24 4 29 0 4 1 24 28 45 e11 8 13 9 7 7 28 24 33 27 7 33 4 0 5 27 32 45 e8 10 14 8 8 7 25 21 29 24 4 29 1 5 0 23 28 45 s6 29 34 27 27 26 16 18 6 3 21 7 24 27 23 0 5 45 s34 34 39 32 31 30 17 21 4 5 25 4 28 32 28 5 0 45 So i39,s16,s49,s15are "thin area" outlier. Separate at 41, givingCLUS1<41(50 Setosa, 4 Versicolor, e8,e11,e44,e49)andCLUS2>=41. Sparse Ends analysis should accomplish the same outlier detection that a few steps of SL accomplishes. If an outlier is surrounded at a fixed distance then those neighbors will show up as sparse end neighbors and the outlier-ness of the point will be detected by looking at pairwise distances of that sparse end. Sparse upper end s23 s43 s9 s39 s42 s14 F s23 0 5 8 7 13 7 56 s43 5 0 3 2 9 3 57 s9 8 3 0 1 6 3 58 s39 7 2 1 0 7 2 58 s42 13 9 6 7 0 8 59 s14 7 3 3 2 8 0 60 no gap>4 outliers
Check Dotp,d(y) for thinnings. Calc AVG of each side of thinning as p,q. redo. Dot p=nnnn q=xxxx 0 2 3 2 4 1 5 1 7 1 9 2 10 1 11 1 12 1 13 2 14 1 15 3 16 4 17 3 18 2 19 8 20 2 21 3 22 1 23 4 24 5 25 4 26 5 27 5 28 4 29 3 30 2 31 2 32 2 33 4 34 5 36 3 37 2 38 4 40 1 42 1 43 1 44 2 45 5 47 6 48 3 49 4 50 6 51 4 52 3 53 5 54 3 55 5 56 1 57 1 58 2 59 1 60 1 Dot p=AVG>22 q=AVG<22 0 1 1 1 2 2 3 2 4 4 5 5 6 9 7 11 8 6 9 3 10 3 11 3 19 1 23 1 24 1 25 1 26 1 29 1 30 1 31 2 32 2 34 6 35 2 36 4 37 2 38 2 39 3 40 3 41 4 42 4 43 2 44 3 45 6 46 7 47 3 48 2 49 1 50 3 52 7 54 5 55 1 56 3 57 3 58 2 59 2 61 1 62 2 64 1 66 1 67 1 68 2 70 1 Cut=15 CLUS_1<15 (50 Set) CLUS_2>15