60 likes | 97 Views
UpS = The Universe of all pTree Sets= {all vectors of SPSs (formerly known as SPTSs)} V=n-dimensional vector space. Code all operations as n-ary operations on UpS (1 level or multi-level):. re s u lt. n-ary operations on UpS: UpS ... UpS UpS.
E N D
UpS = The Universe of all pTree Sets={all vectors of SPSs (formerly known as SPTSs)} V=n-dimensional vector space. Code all operations as n-ary operations on UpS (1 level or multi-level): re s u lt n-ary operations on UpS: UpS ... UpS UpS addition is row-wise addition in each column, or (SpSi1,SpSi2,...,SpSini)+...+(SpSz1,SpSz2,...,SpSznz) (SpS1,SpS2,...,SpSn) where SpSh=SpSih+...+SpSzh Dimension_of_result = n = min{ni,...,nz} and |SpS|≡depth_of_SpS=cardinality_of_SpS=#_of_rows_SpS and |SpSh| = min{|SpSih|,..., |SpSzh|} -, /,* are binary on SpSs, row_wise also. if op is any of =, >, <, , , op produces 1 mask pTree of the truth of (SpSinop SpSzn) (SpSi1 op SpSz1) AND ... AND SPc=Scalar Product (multiplying by a constant, c). (Usually unary. Even more efficient is to use c's bit pattern! re s u lt DPv:=Dot Product with a fixed real vector, vV ... use v's bit pattern? SDv=Square Distance from fixed real vector, vV. ...use v's bit pattern only? ERa=FP's EinRings Result = pTree mask of rows < a apply < above? Better, use a's bit pattern only? AGa= YC's Aggregates, count, sum, avg, max, min, median, rank_k, top_k, IceBergQueries (Here, the result is a number, but a number is a depth=1, width=1 SpS.) Note, UpS includes SpSs of all cardinalities (= depths = # of rows). It seems best to code on UpS rather than on SpSn (card(SpS)=n). Of course, it is very important to know what the rows represent so as to avoid nonsense results, however, why restrict the operations themselves? When SpS operands are of different depths, the result SpS's depth = depth of the shallowest operand (operating from the top of each). Note, in HORIZONTAL structuring, a dataset=file=table is one column of rows. In VERTICAL structuring a dataset is one row of columns. Our advantage: the column count is typically small and a fixed number, whereas the row count can be very large and variable. The advantage of defining all operations as n-ary operations on UpS is that we can then code various implementations independently and test them one against the other for speed. I.e., we can do the engineering on the operations.
Oblique FAUST (OF)Clustering:Linear (default) OFL,Spherical OFS, Barrel OFB, Conical OFC) a2 Define distance function ds(x,y):TBLTBLR ds(x,y)= kCRrk|xk-yk|2 + kCCck|xk-yk| where CR is the set of real columns, CC is the set of categorical columns (consider coded columns as real) and rk, ck are real coefficients. Each method uses a real valued functional from X to R and all methods are completely data parallel (data can be distributed over a cluster, processed in parallel (dot product), then the partial results sent home to be added. Bpdx a1 x r d p d No gaps show on the red, blue or green projection lines Note: Bpd(x) = Sp(x) - L2pd(x) Cp,d(x)=(x-p)od / (x-p)o(x-p) Oblique FAUST Cone (OFC)(Enclose clusters with cone gaps) GapLower GapUpper Note: C2pd(x) = L2pd(x) /Sp(x) p p gapBarrel Assume a real number table, TBL(C1..Cn), (= n-dim vector space; or categorical columns, either code to real numbers or bitmap, e.g., a Month column can be coded as {1,...,12} and a Color column can be bitmapped by Red(yes=1|no=0)...Violet(yes=1|no=0) ). TBL is converted to a PTreeSet. Lp,d:XR: Lp,d(x)=(x-p)od Oblique FAUST Linear (OFL) clustering (Enclose clusters between (n-1)-dimensional hyperplanar gaps) Find a1<a2 such that =GapLower={x | a1<Lpd(x)<a1+T}and =GapUpper={x | a2<Lpd(x)<a2+T}and C={x|a1+T<Lpd(x)<a2} Bp,d(x)=(x-p)o(x-p)-((x-p)od)2Oblique FAUST Barrel (OFB)(Enclose clusters with barrel gaps) Search for GapLower>T, GapUpper>T and GapBarrel>T2 (BR≡Barrel_Radius) Sp(x)=(x-p)o(x-p) Oblique FAUST Spherical (OFS)(Enclose clusters with spherical gaps) Search Sp for spherical gap, {x | r2 Sp(x) < (r+T)2}= so that the interior of the r-sphere about p encloses a sub-cluster.
M1 H1 L18 M26 H28 C2 M3 L1 M3 H3 C31 L1 M1 H4 C32 H1 M1 H5 C33 H1 H2 M1 M1 M2 M1 H3 C26 L1 L1 M1 d=4 M2 L1 M1 H1 M3 H1 C4 H4 L20 M9 H4 C1 M1 L2 M3 H4C25 L1 L9 M1C21 M1 M3 . H3 L2 M4 H3C23 L1 M1 d=4 . H4 L1 M1 L2 M3 H16C24 H1 M3 M1 L1 M1 L2 M12 H 17 C3 L2 M1 C0 L4 M3 H1C22 L6 L3 M1 C211 H1 L1 M1 ' H3 M2 H2 M1 H1C251 H1 L1 L1 M2 H16 C241 M1 H5 M3 . H1 L1 M1 C231 M1 M1 . H5 M1 . H5 M3 L1 H1 L1 H1 H5 M2 H1 M3 H1 C27 M1 L1 L2 L11 M3 C11 L4 M1 M2 L1 M2 H1 C12 L3 M1 H3 L1 M1 H5 C2411 M2 . H1 L11 M3 L1 M1 L1 L1 H1 M2 if 1st B radius>>0, use p=min_radius_pt OF LB...LBClustering on Concrete(STrength,ConcreteMix,WAter,FineAggregate, AGgregate). Assess STerror L<40M<60H 43 428 228 594 270 43 213 159 904 100 44 428 228 594 365 44 238 187 847 100 44 199 192 826 360 44 140 192 807 180 44 380 228 594 365 45 140 192 807 360 46 375 127 993 7 46 375 127 993 7 46 266 228 670 28 46 374 170 757 7 46 214 182 785 28 47 190 228 670 180 47 214 182 786 56 47 425 151 804 7 47 266 228 670 90 47 531 142 894 7 47 380 154 605 7 48 304 228 670 28 49 304 228 670 90 49 425 154 887 7 49 425 154 887 7 49 266 228 670 180 49 425 154 887 7 60 425 154 887 28 60 375 127 993 56 60 425 154 887 28 60 425 154 887 28 61 374 170 757 28 62 540 162 676 28 62 425 151 804 28 63 374 170 757 56 63 375 127 993 91 64 425 154 887 56 64 425 154 887 56 64 425 154 887 56 65 425 151 804 56 65 374 170 757 91 65 425 154 887 91 65 313 176 612 56 65 425 154 887 91 65 425 154 887 91 66 439 186 708 28 66 319 156 880 56 67 469 138 841 28 67 313 176 612 91 67 425 151 804 91 68 286 145 804 28 68 475 181 782 28 68 319 156 880 91 68 402 147 852 28 68 338 175 756 91 69 469 138 841 56 71 363 165 756 28 71 363 165 756 28 71 363 165 756 28 71 363 165 756 28 71 469 138 841 91 72 475 181 782 56 72 439 186 708 56 73 286 145 804 56 73 439 186 708 91 74 390 146 756 28 74 475 181 782 91 74 402 147 852 56 75 402 147 852 91 75 324 184 660 28 77 363 165 756 56 77 286 145 804 91 77 363 165 756 56 77 363 165 756 56 77 363 165 756 56 79 390 146 756 56 79 363 165 756 91 79 363 165 756 91 79 363 165 756 91 79 363 165 756 91 80 324 184 660 56 83 390 146 756 91 (x-p)od/4 gp3 C11 15 1 2 17 1 3 20 1 1 21 1 1 22 3 1 23 1 1 24 2 1 25 1 4 29 3 (x-p)od/4 gp3 C21 35 1 1 36 1 1 37 2 1 38 1 1 39 1 3 42 3 1 43 1 (x-p)od/4 Ct Gp3 C p 140 192 807 3 T=MGW=12 d=x-n=.58 .15 .58 .53 CONCRETE ST CM WA FA AG 8 140 192 807 3 8 168 122 780 3 9 190 162 803 3 10 310 192 851 3 20 230 195 759 14 20 238 187 847 3 21 212 180 779 14 21 191 162 804 14 22 166 176 780 28 22 234 198 852 14 22 230 195 758 14 23 234 198 852 28 23 190 162 803 14 23 363 165 756 7 24 168 122 780 28 24 338 175 756 3 24 286 145 804 3 24 222 189 870 14 24 230 195 759 28 25 319 156 880 3 25 222 189 870 28 25 230 195 758 28 25 195 166 906 14 25 212 180 779 28 25 166 176 780 14 25 250 187 861 14 26 191 162 804 28 26 195 166 906 28 26 238 228 594 7 26 238 187 847 14 26 213 159 904 14 28 190 162 803 28 28 389 158 926 3 28 234 198 852 56 28 199 192 826 28 28 140 192 807 28 28 324 184 660 3 29 380 154 605 3 29 375 127 993 3 29 313 176 612 3 29 250 187 861 28 29 166 176 780 56 29 222 189 870 56 40 214 182 786 28 40 190 162 803 100 40 469 138 841 3 40 238 187 847 56 40 333 228 594 270 40 212 180 779 100 41 333 228 594 365 41 390 146 756 3 41 222 189 870 100 41 191 162 804 100 41 531 142 894 3 41 190 228 670 28 41 380 228 594 90 41 380 228 594 270 41 380 228 594 180 41 230 195 758 100 41 402 147 852 3 42 475 228 594 270 42 190 228 670 90 42 428 228 594 90 42 475 228 594 90 42 475 228 594 365 42 199 192 826 180 42 428 228 594 180 42 250 187 861 100 43 213 159 904 56 43 475 228 594 180 43 313 176 612 7 Br/4 gp3 C211 0 1 4 4 1 3 7 1 3 10 1 Br/4 gp3 C0 19 2 46 65 1 0 1 7 7 1 4 11 2 1 12 1 3 15 4 2 17 1 1 18 3 1 19 2 1 20 2 1 21 1 1 22 6 1 23 3 1 24 4 1 25 2 1 26 1 1 27 2 2 29 2 3 32 2 1 33 3 1 34 2 1 35 1 1 36 2 1 37 3 1 38 3 1 39 5 1 40 2 1 41 2 1 42 8 1 43 1 1 44 2 1 45 3 1 46 6 1 47 2 1 48 2 1 49 3 1 50 2 1 51 8 2 53 1 1 54 1 1 55 1 1 56 3 1 57 1 1 58 3 3 61 2 1 62 3 1 63 2 2 65 1 1 66 2 1 67 5 1 68 1 1 69 1 1 70 5 1 71 1 2 73 1 1 74 6 1 75 1 3 78 4 3 81 1 1 82 1 1 83 2 3 86 1 Br/4 gp3 C1 0 1 5... 7 1 3... 21 1 3 24 4 19 43 1 4 47 1 4... 53 1 3... 59 1 7... 68 1 10... 79 1 (x-p)od/4 gp3 C23 30 3 4 34 1 3 37 1 4 41 1 4 45 1 5 50 1 1 51 1 (x-p)od/4 gp3 C12 20 1 3 23 1 4 27 2 Br/4 gp3 C231 0 1 35 35 1 (x-p)od/4 gp3 C22 38 1 1 39 1 1 40 1 5 45 1 3 48 1 3 51 1 1 52 1 5 57 1 Br/4 gp3 C241 0 1 2 ... 4 4 5 9 1 1 10 4 6 16 1 1 17 4 3 20 1 41 61 1 (x-p)od/4 g3 C411 13 1 9 21 5 Br/4 gp3 C2 0 1 3 3 1 1 ... 8 1 5 13 1 3 16 2 1 18 3 3 21 1 1 ... 26 2 3 29 1 7 36 1 2 ... 43 2 3 46 2 1 ... 48 2 3 51 1 9 60 1 2 62 3 13 75 1 7 82 1 1 83 1 4 87 1 1 ... 91 1 3 94 2 Br/4 gp3 C251 0 1 25 25 1 (x-p)od/4 gp3 C24 36 1 3 39 1 2 ... 53 1 5 58 1 (x-p)od/4 gp3 C25 38 1 4 42 1 3 45 1 4 49 1 5 54 1 1 55 1 3 58 1 (x-p)od/4 gp3 C26 46 1 3 49 1 2 51 1 5 56 1 (x-p)od/4 gp3 C27 46 1 3 49 1 2 51 1 5 56 1 (x-p)od/4 gp3 C33 0 1 30 30 4 2 32 1 Br/4 ct gp3 C3 0 1 13 13 3 1 14 3 3 17 5 2 19 1 4 23 1 6 29 2 1 30 2 2 32 1 2 34 1 4 38 1 4 42 1 2 44 1 3 47 1 57 104 1 4 108 1 2 110 1 3 113 1 11 124 3 (x-p)od/4 gp3 C31 67 3 3 70 3 (x-p)od/4 gp3 C32 0 2 32 32 3 2 34 1 c(Clust dendogram w/o purity) c0 c1 c2 c3 c4 c31 c32 c33 c11 c12 c21 c22 c23 c24 c25 c26 c27 c251 c211 c231 c241 c2411 Br/4 ct gp3 C4 40 1 33 73 1 1 74 1 42 116 1
0 9 0 9 0 1 c22 1 0 0 1 0 10 c24 1 0 0 15 0 4 c23 0 0 2 30 0 34 c2 15 7 5 c3 1 4 0 c4 1 14 0 c5 2 0 9 c1 0 0 8 0 9 0 0 0 10 0 2 0 0 0 9 0 1 0 4 0 1 c21 0 2 0 0 1 0 0 1 0 0 2 0 if 1st B radius>>0, use p=min_radius_pt OF LB...LBClustering on SEEDS(cls area lnker acoef lnkrgv) (x-p)od*10 Ct Gp3 Br*10 gp3 c1 7 2 11 18 2 22 40 7 p 15 6 2 5 d 0.5 0.2 0.8 0 ClsAreaLnkeAcoeLnk 1 15 6 2 5 1 15 6 1 5 1 14 5 3 5 1 14 5 2 5 1 16 6 1 5 1 14 5 2 5 1 15 6 4 5 1 14 5 3 5 1 13 5 4 5 1 14 6 3 5 1 14 6 3 5 1 16 6 1 5 1 12 5 1 5 1 15 6 2 5 1 16 6 1 5 1 13 5 3 5 1 13 5 3 5 1 14 6 3 5 1 13 6 4 5 1 13 5 1 5 1 15 6 3 5 1 14 6 4 5 1 14 6 2 5 1 15 6 2 5 1 16 6 2 5 1 16 6 3 6 1 17 6 3 5 1 15 6 3 5 1 14 5 7 5 1 14 5 2 5 1 13 5 2 5 1 16 6 5 6 1 15 6 3 5 1 15 6 3 5 1 15 6 2 5 1 14 6 4 5 1 16 6 6 5 1 14 6 4 5 1 14 6 3 5 1 15 6 1 5 1 15 6 2 5 1 14 5 3 5 1 15 5 1 5 1 13 5 4 5 1 13 5 1 5 1 13 5 2 5 1 14 6 1 5 1 14 6 2 5 1 14 6 1 5 1 13 5 4 5 2 18 6 4 6 2 17 6 5 6 2 17 6 5 6 2 19 6 3 6 2 17 6 4 6 2 17 6 5 6 2 17 6 4 6 2 21 7 4 6 2 19 6 5 6 2 18 6 2 6 2 19 6 2 6 2 19 7 4 6 2 21 7 6 6 2 21 6 5 6 2 20 7 2 6 2 19 6 3 6 2 19 6 3 6 2 19 6 6 6 2 18 7 5 6 2 17 6 4 6 2 19 6 3 6 2 19 6 2 6 2 18 6 3 6 2 19 6 2 6 2 16 6 4 6 2 18 6 2 6 0 1 3 3 2 3 6 2 2 8 2 1 9 7 5 14 3 1 15 3 2 17 10 1 18 2 1 19 3 1 20 4 2 22 3 1 23 12 1 24 1 1 25 4 1 26 3 1 27 4 1 28 1 1 29 4 2 31 7 3 34 4 1 35 3 1 36 1 1 37 4 2 39 1 2 41 1 2 43 1 1 44 2 2 46 3 2 48 2 2 50 1 2 52 3 1 53 1 3 56 3 1 57 2 3 60 4 2 62 3 2 64 5 1 65 3 3 68 1 2 70 8 3 73 1 1 74 2 1 75 1 1 76 4 1 77 1 4 81 1 6 87 1 6 93 2 5 98 1 1 99 1 6 105 1 2 18 6 2 6 2 19 6 4 6 2 19 6 3 6 2 19 6 3 6 2 19 6 4 6 2 19 6 2 6 2 19 6 7 6 2 21 7 5 6 2 19 6 2 6 2 19 6 4 6 2 19 6 3 6 2 19 6 4 6 2 18 6 2 6 2 20 7 2 6 2 18 6 5 6 2 18 6 3 6 2 19 6 3 6 2 15 6 4 6 2 16 6 4 6 2 16 6 5 6 2 15 5 4 5 2 17 6 4 6 2 16 6 3 6 2 16 6 3 6 2 16 6 4 6 3 13 5 5 5 3 13 6 7 5 3 13 5 6 5 3 12 5 5 5 3 12 5 4 5 3 11 5 6 5 3 12 5 5 5 3 11 5 3 5 3 11 5 4 5 3 11 5 6 5 3 11 5 3 5 3 12 5 5 5 3 12 5 4 5 3 11 5 5 5 3 13 5 3 5 3 12 5 4 5 3 12 5 5 5 3 13 5 4 5 3 11 5 6 5 3 12 5 2 5 3 12 5 5 5 3 12 5 4 5 3 11 5 4 5 3 11 5 4 5 3 11 5 7 5 3 12 5 7 5 3 11 5 6 5 3 12 5 2 5 3 11 5 5 5 3 12 5 5 5 3 12 5 5 5 3 12 5 5 5 3 12 5 4 5 3 12 5 5 5 3 11 5 4 5 3 11 5 8 5 3 11 5 4 5 3 12 5 4 5 3 11 5 5 5 3 12 5 4 5 3 12 5 5 5 3 13 5 2 5 3 11 5 4 5 3 13 5 8 5 3 12 5 4 5 3 12 5 4 5 3 11 5 4 5 3 13 5 8 5 3 12 5 4 5 3 12 5 6 5 Br*10 gp3 c2 14 3 1 15 3 2 17 10 1 18 2 1 19 3 1 20 4 2 22 3 1 23 12 1 24 1 1 25 4 1 26 3 1 27 4 1 28 1 1 29 4 2 31 7 10(x-p)od g3 c2 13 5 3 16 1 3 19 5 2 21 5 3 24 10 3 27 11 2 29 8 3 32 11 3 35 4 2 37 1 3 40 1 3 43 1 8 51 1
1=pipe radius 12 L 3 H d 0. Always start with linear analysis, then: 1. Project the inside of a pipe (small radius) on the d-line. 2. linear gapped region, increase radius until a radial gap appears. 3. Increase linear region width until cap gaps appear.. 4. Mask off that cluster 6. GOTO 1 (and revise p, d) here or if either 2 or 3 fail to materialize gaps. Oblique FAUST Pipe Clustering on SEEDS(cls area lnker acoef lnkrgv) p=avg q=vom ClsAreaLnkeAcoeLnk 1 15 6 2 5 1 15 6 1 5 ... 3 12 5 6 5 R Ct gp 0 5 1 1 22 1 2 41 1 3 48 1 4 22 1 5 6 1 6 6 Find LinGaps xod Ct gp 4 15 1 5 9 1 6 3 The fact that there are no good pipe gapsmay means exit . Start over w 1st last. This is unfinished (ran out of time). I also tried Spherical when it appeared from the pipe analysis that we were at the center of a cluster. So far this didn't work out. xod Ct gp First last. 2 3 1 3 5 2 5 4 2 7 1 1 8 1 2 10 1 p
Q&A f=distance dominated functional, avgGap=(fmax-fmin)/|f(X)| may be a good measurement for setting thresholds, e.g., x is an outlier=anomaly if gap around {x} > 3*avgGap? If the minimum barrel radii >> 0, we have chosen a d-line far from the data. It may be advisable to pick p to ba an actual data point. Here are the formulas from the spreadsheet: G=(B12-B$6)*B$9+(C12-C$6)*C$9+(D12-D$6)*D$9+(E12-E$6)*E$9 H=G12-$G$9 L=(x-p)od-min I=(B12-B$6)^2+(C12-C$6)^2+(D12-D$6)^2+(E12-E$6)^2 J=@SQRT(I12-G12^2) B=SQRT[(x-p)o(x-p)-(x-p)od^2] Note we don't round, so we are calculating pTree bitslices by truncating. We don't even need to do that! For fixed piont, here are the bislice formulas: @MOD(@INT(F/2^6),2) @MOD(@INT(F/2^5),2) @MOD(@INT(F/2^4),2) @MOD(@INT(F/2^3),2) @MOD(@INT(F/2^2),2) @MOD(@INT(F/2^1),2) @MOD(@INT(F/2^0),2) Keep going (take bitslices to the right of decimal pt) @MOD(@INT(F/2^-1),2) @MOD(@INT(F/2^-2),2) ... Floating point? Bitslice the mantissa. The exponent shifts the slice name. E.g., .1011 25 .0010 24 .1010 2-1 If d and t are trained over DocumentTerm (DT) Gradient(F)=G=(Gd, Gt). Instead of a LineSearch using F(s)=f +sG, always use 2D-RectangleSearch, F(sd,st)=F(f + sd*Gd + st*Gt). Set F/sd =0 and F/st=0. It may be a better approach to find dense cells (sphere, barrel, cone) then fuse them, because it's difficult to position themaround clusters (due to bumps, protrusion etc.) (Not true for outlier clusters (singleton\doubleton)) An Akg: Start with a line and a small radius barrel around it. Find dense regions between 2 consecutive gaps in this pipe. This should identify portion of a dense cluster. Lots of ways to go from there: a. Use centroid of dense pipe piece as sphere|barrel center. b. Move to a better centroid for that cluster by a gradient asc/desc process c. In a "GA mutation" fashion, jump to a nearby centroid, governed by some fitness function (e.g., count in dense pipe piece). 24 1 0 0 22 1 0 0 21 1 1 0 20 0 0 0 2-1 0 0 0 2-2 0 0 1 2-3 0 0 0 2-4 0 0 1 23 0 0 0 SSPTS = set of all SPTSs (columns of reals); V = n-dim vector space. Code operations on SSPTS (both 1 level or multi-level): 22 2 5/16 SSPTS SSPTS SSPTS (Binary Algebraic Operations): including: +, -, /,RWP =Row_Wise_Product 10110. 10. .01010 {SPTSk}k=1..n SSPTS (Unary ops.Typically SPTSk=Vk) incl: SDv(Square Distance from a fixed vector, vV) Gap analytic tools: L(x)=xod, S(x)=(x-p)o(x-p) and then from those, B(x)=S(x)-L2(x) (If T is the minimum gap threshold, use T2 for S and B ) Oblique FAUST, Barrel (OFLB) Alternate Lpqx, Bpqx to get a cluster dendogram (topdown). Take p=1st_TR pt? d=vomavg Defining Avg Density? AvD = count / k=1..dim(maxk-mink)? This is for choosing good Thresholds. MinGapThres=Tb,AvD≡ b*(1/ AvD)1/dim b=adjustable param If we're given a TrainingSet, TR, with K classes, is avgk=1..Kvomk a better mediod than VoM? Take p=MinCorner, q=MaxCorner of box circumscribing {VoMk}k=1..K better than not circ box of TR? DPv(Dot Product with a fixed vector, vV) ERa= FP's EinRings (n=1, rR) result masks rows s.t. row < a SPTS Rincludes AGa= YC's Aggregates and iceberg queies: count, sum, avg, max, min, median, rank_k, top_k, IceBergQueries. SSPTS SSPTS (Unary Operations) including: SPc=Scalar_Product (Multiply each SPTS row by same constant, c. Use const SPTS? all rows=c, then RWP. More efficient? w/o forming const SPTS? Use c's bit pattern c only? (subset of previous with n = |SSPTS|?) Note, SSPTS includes SPTSs of all cardinalities (= depths = # of rows) It seems best to code on SSPTS rather than on SSPTSn (card(SPTS)=n). Of course, it is very important to know what the rows represent so as to avoid nonsense results, however, why restrict the operations themselves? When SPTS operands are of different depths, the result SPTS's depth = depth of the shallowest operand (operate from the top of each).