y is declared to be class=k iff yHull k where Hull k ={z| l D,k  D o z  h D,k all D}.

FAUST Oblique Analytics (based on the dot product, o).Given a table, X(X1..Xn), |X|=N and vectors, D=(D1..Dn), FAUST Oblique employs the ScalarPTreeSets (SPTS) of the valueTrees, XoD  k=1..nXkDk FP (FAUST Polygon for k-class classification, k=1.. Xn+1= ClassLabel. DDset, lD,kmnCkoD (1st PCI?); hD,k=hD,kmxCkoD (last PCD?) y is declared to be class=k iff yHullk where Hullk={z| lD,k  Doz  hD,k all D}. (If y is in multiple hulls, Hi1..Hih, y isaCk for the k maximizing OneCount{PCk&PHi..&PHih} or fuzzy classify using those OneCounts as k-weights) NextD is a sequence of D's, used when recursively partitioning X into a Clusters (constructing a Cluster Dendogram for X) e.g. a. recursively, take the diagonal maximizing Standard Deviation (STD(CoD)) [or maximizing STD(CoD)/Spread(CoD).] b. recursively, take the AM(CoD)Avg-to-Median; AFFA(CoD)Avg-FurthestFromAvg; FFAFFFFA(CoD)FFA-FurthFromFFA c. recursively cycle thru diagonals: e1,...,..en, e1e2.. or cycle thru AM, AFFA, FFAFFFFA or cycle through both sets FC (FAUST Count Change clusterer) Choose Density(DT), DensityUniformity(DUT) and PrecipitousCountChange(PCCT) thresholds. If DT (and DUT) are not exceeded at a cluster C, partition C by cutting at each gap and/or PCC in CoD using nextD. FCG cuts in the middle of CoD gaps (only) (This is the old version. It might be faster, but it usually chokes on big data.) FCPcuts at PCCs (gap are PCC-cuts, of course). Outlier Mining: Find the top k objects dissimilarity from the rest of the objects. This might mean: 1.a Find {xh | h=1..k} such that xh maximizes distance(xh, X-{xj | jh}) 1.b Find the top set of k objects, Sk, that maximizes distance(X-Sk.Sk) 2. Given a Training Set, X, identify outliers in each class (correctly classified but noticeably dissimilar from classmates) or Fuzzy cluster X, i.e., assign a weight for each (object, cluster) pair. Then x isa outlier iff w(x,k) < OutlierThreshold k 3. Examine individual new samples for outlierhood, assuming they come in after normalcy has been established by 1 o 2. FCOuses FC as an outlier miner. It identifies and removes large clusters using FCP, so outliers reveal themselves. FDO (FAUST Distance-based Outlier Miner) uses D2NN = SquareDistance(x, X-{x}) = rankN(x-X)o(x-X) D2NN provides an instantaneous k-slider for 1.a. (useful for the others too. Instantaneous? UDR on D2NN takes log2n time (and is a 1-time calculation), then a k-slider works instantaneously off that distribution - there is no need to sort D2NN) Dset is a set of Ds used to build a model for fast classification (1-class or k-class) by circumscribing each class with a hull. The larger the Dset the better (for accuracy). D, there is, however, the 1-time construction cost of LD,k and HD,k below. Dset should include DAi,jAvg(Ci)Avg(Cj) i>j=1..k [and also the Median connectors?]. Should Dset include all DnextD? (Note: The old version used Dset{DAi,j | i>j=1..k} only.)

D2,0 D2,0 D2,1 D2,1 D1,0 D1,0 D1,1 D1,1 D D XoD = k=1..nXk*Dk 1 1 1 0 3 3 1 2 1 1 0 1 k=1..n ( = 22B Dk,B pk,B k=1..n ( Dk,B pk,B-1 + Dk,B-1 pk,B + 22B-1 k=1..n ( Dk,B pk,B-2 + Dk,B-1 pk,B-1 + Dk,B-2 pk,B + 22B-2 Xk*Dk = Dkb2bpk,b XoD=k=1,2Xk*Dk with pTrees: qN..q0, N=22B+roof(log2n)+2B+1 k=1..n ( +Dk,B-3 pk,B Dk,B pk,B-3 + Dk,B-1 pk,B-2 + Dk,B-2 pk,B-1 + 22B-3 = Dk(2Bpk,B +..+20pk,0) = (2BDk,B+..+20Dk,0) (2Bpk,B +..+20pk,0) . . . k=1..2 ( = 2BDkpk,B +..+ 20Dkpk,0 = 22 Dk,1 pk,1 k=1..n ( Dk,Bpk,B) = 22B( +Dk,Bpk,B-1) + 22B-1(Dk,B-1pk,B Dk,B pk,0 + Dk,2 pk,1 + Dk,1 pk,2 +Dk,0 pk,3 + 23 +..+20Dk,0pk,0 k=1..2 ( Dk,1 pk,0 + Dk,0 pk,1 + 21 pTrees k=1..n ( X Dk,2 pk,0 + Dk,1 pk,1 + Dk,0 pk,2 + 22 B=1 1 3 2 1 0 1 0 1 1 1 1 0 0 0 0 1 0 1 k=1..2 ( k=1..n ( Dk,0 pk,0 Dk,1 pk,0 + Dk,0 pk,1 + 20 + 21 q0 = p1,0 = no carry 1 1 0 k=1..n ( Dk,0 pk,0 + 20 ( ( = 22 = 22 1 p1,1 D1,1p1,1 + 1 p2,1 ) + D2,1p2,1 ) ( ( ( ( + 1 p2,0 ) + D2,0p2,0) D1,1p1,0 1 p1,0 + 1 p11 + D1,0p11 1 p1,0 D1,0p1,0 + 21 + 21 + 1 p2,0 + D2,1p2,0 + 1 p2,1 ) + D2,0p2,1) + 20 + 20 q1= carry1= 1 1 0 0 0 1 ( = 22 D1,1 p1,1 + D2,1 p2,1 ) ( ( + D2,0 p2,0) D1,1 p1,0 +D1,0 p11 D1,0 p1,0 + 21 + D2,1 p2,0 +D2,0 p2,1) + 20 0 0 0 q2=carry1= no carry 0 1 1 1 0 1 1 1 0 0 0 1 q0 = carry0= 0 1 1 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 1 1 0 1 0 1 1 0 1 1 1 1 2 1 1 q1=carry0+raw1= carry1= 1 1 1 1 1 1 q2=carry1+raw2= carry2= 1 1 1 q3=carry2 = carry3= A carryTree is a valueTree or vTree, as is the rawTree at each level (rawTree = valueTree before carry is incl.). In what form is it best to carry the carryTree over? (for speediest of processing?) 1. multiple pTrees added at next level? (since the pTrees at the next level are in that form and need to be added) 2. carryTree as a SPTS, s1? (next level rawTree=SPTS, s2, then s10& s20 = qnext_level and carrynext_level ? FC ClustererIf DT (and/or DUT) are not exceeded at C, partition C further by cutting at each gap and PCC in CoD For a table X(X1...Xn), the SPTS, Xk*Dk is the column of numbers, xk*Dk. XoD is the sum of those SPTSs, k=1..nXk*Dk So, DotProduct involves just multi-operand pTree addition. (no SPTSs and no multiplications) Engineering shortcut tricka would be huge!!!

FO D2NN(x)= k=1..n(xk-Xk)(xk-Xk)=k=1..n(b=B..02bxk,b-2bpk,b)( (b=B..02bxk,b-2bpk,b) ----ak,b--- b=B..02b(xk,b-pk,b) ) =k=1..n( b=B..02b(xk,b-pk,b) )( =k ( 22Bak,Bak,B + (2Bak,B+ 2B-1ak,B-1+..+ 21ak, 1+ 20ak, 0) (2Bak,B+ 2B-1ak,B-1+..+ 21ak, 1+ 20ak, 0) 22B-1( ak,Bak,B-1 + ak,B-1ak,B ) + { which is 22Bak,Bak,B-1 } 22B-2( ak,Bak,B-2 + ak,B-1ak,B-1 + ak,B-2ak,B ) + { which is 22B-1ak,Bak,B-2 + 22B-2ak,B-12 22B-3( ak,Bak,B-3 + ak,B-1ak,B-2 + ak,B-2ak,B-1 + ak,B-3ak,B ) + { 22B-2( ak,Bak,B-3 + ak,B-1ak,B-2 ) } 22B-4(ak,Bak,B-4+ak,B-1ak,B-3+ak,B-2ak,B-2+ak,B-3ak,B-1+ak,B-4ak,B)... {22B-3( ak,Bak,B-4+ak,B-1ak,B-3)+22B-4ak,B-22} =22B ( ak,B2 + ak,Bak,B-1 ) + 22B-1( ak,Bak,B-2 ) + 22B-2( ak,B-12 + ak,Bak,B-3 + ak,B-1ak,B-2 ) 22B-3( ak,Bak,B-4+ak,B-1ak,B-3) 22B-4ak,B-22 ... U.S. Library of Congress is archiving all tweets sent since 2006. USLOCTweetTable may have 1 million trillion rows and 50 columns. Volume 172 billion tweets in 2013 alone (~300 each from 500 million tweeters). Currently > 20 million tweets/hour, 24 hours/day, seven days/week. a tweet is 140 characters. There are 50 fields (Who wrote it. Where. When To Whom ...) Should we pre-compute all pk,i*pk,j p'k,i*p'k,j pk,i*p'k,j Enron DatasetVolume 16GB. 1,000,000 rows. 100,000 columns (terms) Drone data? Maybe just RGB (3 columns) and trillions of rows (one for each pixel each hour for 10 years. Each pixel is GPS located (would be sort by location then before pTree-izing? Table, X(X1...Xn) D2NN yields a 1.a-type outlier detector (top k objects, x, dissimilarity from X-{x}). We install in D2NN, each min[D2NN(x)] (It's a one-time construction but for a trillion xs it's slow. Parallelization?) Does D2NN involve just multi-operand pTree addition? (or SPTSs, multiplication) Notes: When xk,b=1, ak,b=p'k,b and when xk,b=0, ak,b= -pk.b So D2NN has just multi-op pTree multiplications/additions/subtractions! Of course, each entry in D2NN (each xX) is a separate [parallelizable] calculation. Is subtraction just a matter of flipping sign bit and adding, Md?

pTree Rank(K) computation: (Rank(n-1) gives 2nd smallest which is very useful in outlier analysis?) RankKval= 0 1 0 0 0 0 0 23 * + 22 * + 21 * + 20 * = 5P=MapRankKPts= ListRankKPts={2} RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi}; else {p=p-c;P=P&P'i }; return RankKval, P; /*Below K=n-1=7-1=6 (looking for the 6th highest = 2nd lowest value)*/ Cross out the 0-positions of P each step. (n=3) c=Count(P&P4,3)= 3 < 6 p=6–3=3; P=P&P’4,3 masks off highest 3 (val 8) {0} X P4,3P4,2P4,1 P4,0 0 1 1 1 0 0 0 0 1 0 1 1 1 1 10 5 6 7 11 9 3 1 0 0 0 1 1 0 1 0 1 1 1 0 1 (n=2) c=Count(P&P4,2)= 3 >= 3 P=P&P4,2 masks off lowest 1 (val 4) {1} (n=1) c=Count(P&P4,1)=2 < 3 p=3-2=1; P=P&P'4,1 masks off highest 2 (val8-2=6 ) {0} {1} (n=0) c=Count(P&P4,0 )=1 >= 1 P=P&P4,0 {0} {1} {0} {1}

Suppose MinVal is duplicated (occurs at two points). What does the algorithm return? RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi}; else {p=p-c;P=P&P'i }; ret RankKval, P; P4,3P4,2P4,1 P4,0 1. P = P4,3 Ct (P) = 3 < 6 P = P’4,3 masks off highest 3 (Val 8) p = 6 – 3 = 3 0 1 1 0 0 0 0 0 1 0 1 1 1 1 10 5 6 3 11 9 3 1 0 0 0 1 1 0 1 0 1 1 1 0 1 {0} 2. Ct(P&P4,2) = 2 <3 P= P&P'4,2 p=3-2=1 masks off highest 2 (val 4) {0} 3. Ct(P&P4,1 )=2 >=1 P=P&P4,1 {1} 4. Ct (P&P4,0 )=1 >=1 P=P&P4,0 {1} 3=MinVal=rank(n-1)Val. Pmask MinPts=rank(n-1)Pts{#4,#7} 23 * + 22 * + 21 * + 20 * = {0} {0} {1} {1}

Suppose MinVal is triplicated (occurs at three points). What does the algorithm return? RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi}; else {p=p-c;P=P&P'i }; return RankKval, P; P4,3P4,2P4,1 P4,0 1. P = P4,3 Ct (P) = 3 < 6 P = P’4,3 (masks off the highest 3 val 8) p = 6 – 3 = 3 0 0 1 0 0 0 0 0 1 0 1 1 1 1 10 3 6 3 11 9 3 1 0 0 0 1 1 0 1 1 1 1 1 0 1 {0} 2. Ct(P&P4,2) = 1 <3 P= P&P'4,2 p=3-1=2 (masks off highest 1 val 4) {0} 3. Ct(P&P4,1 )=3 >=2 P=P&P4,1 {1} 4. Ct (P&P4,0 )=3 >=2 P=P&P4,0 {1} 3=MinVal. Pc mask MinPts #4,#5,#7 23 * + 22 * + 21 * + 20 * = {0} {0} {1} {1}

y is declared to be class=k iff yHull k where Hull k ={z| l D,k  D o z  h D,k all D}.