110 likes | 125 Views
Standard k-means Clustering : Assume X={x 1 , x 2 , ... x n } is to be divided into k clusters. Select k centroids. Assign each point to the closest centroid. Calculate mean=centroid of each new clusters. Iterate until stop_condition = true.
E N D
Standard k-means Clustering: Assume X={x1, x2, ... xn} is to be divided into k clusters. Select k centroids. Assign each point to the closest centroid. Calculate mean=centroid of each new clusters. Iterate until stop_condition = true. pk-means: Same as above but the means are calculated without scanning (using one pTree formulas). mpk-means: Mohammad's pk-means Below, i=1...k. 1. Pick k centroids, {Ci}i=1..k 2. Calculate distances, Di=D(X,Ci), the result of each is a PTreeSet of d pTrees. ( d = bitwidth of diameter(X)? or? ) (Md creates 1 (for each i) width=d Distances_PTreeSet for all of X without a scan. Md? Does d produce itself automatically? ) 3. Calculate P(DiDj), i>j. (mask pTree where bit is 1 iff distance(x,Ci) distance(x,Cj) (Md: , instead of < ?) 4. Calculate cluster mask pTrees PC1 = P(D1D2) & P(D1D3) & P(D1D4) & ... & P(D1Dk) PC2 = P(D2D3) & P(D2D4) & ... & P(D2Dk) & ~PC1 PC3 = P(D3D4) & ... & P(D3Dk) & ~PC1 & ~PC2 . . . PCk = & ~PC1 & ~PC2 & ... & ~PCk-1 5. Calculate new Centroids, Ci = Sum(X&PCi)/count(PCi) 6. If stop_condition = false, start the next iteration with these new centroids.
pkl-means: p(k-less)-means (pronounced "pickle-means") If n1 is the least value you could possibly want for k (e.g., n1=2) and n2 is the greatest, then for each k, n1 k n2 calculate: 4'. Calculate cluster mask pTrees. For k=2..n, PC1k = P(D1D2) & P(D1D3) & P(D1D4) & ... & P(D1Dk) PC2k = P(D2D3) & P(D2D4) & ... & P(D2Dk) & ~PC1 PC3k = P(D3D4) & ... & P(D3Dk) & ~PC1 & PC2 . . . PCk = P(X) & ~PC1 & ... & ~PCk-1 6'. If k s.t. stop_cond = true, stop and choose that k, else start the next iteration with these new centroids. 3.5'. At each iteration only continue with those k's that meet a requirement. e.g., only continue with the top t of them each iteration (that would require having a comparator that is easy to calculate (using pTrees) with which to compare two k's. The comparator could involve: e.g., a. Sum of the clustter diameters (which should be easy using max and min of D(Clusteri, Cj), or D(Clusteri. Clusterj) ?) b. Sum of the diameters of the gaps between clusters (should be easy using D(listPCi, Cj) or D(listPCi, listPCj). c. other? In 4', first round, can we pick all n2 centroids then find all PC1k at once? (e.g., by doing it for k=2 (find PCh2, PCh2, h=n1..n2), then on the PCh2's do it again... (that is, can we combine to save time and complexity?) I believe we can do that on the first round but how about the subsequent rounds? I don't believe so, but that would be significant! Other ways to make this less exponential? That's the challenge.
From: NDSU Faculty Caucus On Behalf Of Anderson, Sheri Sent: Friday, April 27, 2012 Subject: NASA EPSCoR Program - Message from ND-NASA EPSCoR Dir The National Aeronautics and Space Administration (NASA) usually solicits proposals under the NASA EPSCoR Cooperative Agreement Notice (CAN) program every year around November/December. The announcement for 2011 was delayed and is expected to come anytime soon (according to sources from NASA). Two proposals from NASA can be submitted from North Dakota. Last year, both proposals from ND totaling $1.5 million were successful. When the NASA CAN announcements are released, we may have little time to prepare ourselves to submit the proposals. As we are nearing the end of the semester and given the time required for internal selections, it is necessary to be ready so that we can react quickly to the announcement when it comes. Successful proposals are expected to establish research activities that will make significant contributions to the strategic research and technology development priorities of one or more of NASA Mission Directorates or the Office of the Chief Technologist, and contribute to the overall research infrastructure, science and technology capabilities, higher education, and economic development in North Dakota. Interested applicants are strongly encouraged to start discussions with the ND NASA EPSCoR Director and be ready to submit a preliminary proposal (3 to 4 pages) at short notice, outlining the central idea and why NASA should be interested in their proposal. The ND NASA EPSCoR technical advisory committee will then select two proposals for final submission. Once a preliminary proposal is selected by ND NASA EPSCoR, the PI will be asked to submit a full proposal. Santhosh K. Seelan, Ph.D. Professor and Chair Department of Space Studies Director, ND Space Grant Consortium, & Director, ND-NASA EPSCoR UND Tel: 701 777 2355 e-mail: seelan@space.edu Web address: www.space.edu Sheri L. Anderson Interim Co-Project Director / ND EPSCoR NDSUR1 phone: 701.231.6573 Email Address: sheri.anderson@ndsu.eduwww.ndsu.edu/research
MASTER MINE (My Affinity Set TEaseR and MINEr) dim2 11,10 4,9 2,8 5,8 4,6 3,4 6.3,5.9 6,5.5 10,5 9,4 8,3 7,2 mean median mean median mean median median mean mean median mean mean median median dim1 Algorithm-1: Look for dimension where clustering best. Below, dimension=1 (3 clusters: {r1,r2,r3,O}, {v1,v2,v3,v4} and {0}). How to determine? 1.a: Take each dimension in turn working left to right, when d(mean,median)>¼ width, declare a cluster. 1.b: Nexttake those clusters one at a time to the next dimension for further sub-clustering via the same algorithm. At this point we declare {r1,r2,r3,O} a cluster and start over. At this point we need to declare a cluster, but which one, {0,v1} or {v1,v2}? We will always take the one on the median side of the mean - in this case, {v1,v2}. And that makes {0} a cluster (actually an outlier, since it's singleton). Continuing with {v1,v2}: Declare {v1,v2,v3,v4} a cluster. Note we have to loop. However, rather than each single projection, delta can be the next m projs if they're close. Next we would take one of the clusters and go to the best dimension to subcluster... Algorithm-2: 2.a Take each dim in turn, working left to right, when density>Density_Threshold, declare a cluster (density≡count/size). 2b=1b Oblique version: Take grid of Oblique direction vectors, e.g., For 3D dataset, a DirVect pointing to center of each PTM triangle. With projections onto those lines, do 1 or 2 above. Ordering = any sphere surface grid: Sn≡{x≡(x1...xn)Rn | xi2=1}, in polar coords, {p≡(θ1...θn-1) | 0θi179}. Use lexicographical polar coords? 180n too many? Use e.g., 30 deg units, giving 6n vectors, for dim=n. Attrib relevance important dim2 o 0 r1 v1 r2 v2 r3 v3 v4 Can skip doubletons since mean always same as median. Algorithm-3: Another variation of this is to calculate the dataset mean and vector of medians. Then on the projections of the dataset onto the line connecting the two, do 1a or 1b. Then repeat on each declared cluster, but use projection line other than the one through the mean and vom, this second time, since the mean-vom-line would likely be in approx in the same direction as the first round) Do until no new clusters? Adjust? e.g., proj lines and stop cond,... dim1 Algorithm-4: Proj onto line of dataset mean, vom, mn=6.3,5.9 vom=6,5.5 (11,10=outlier). 4.b, Repeat on any perp line thru mean. (mn, vom far apartmulti-modality. Algorithm-4.1: 4.b.1 In each cluster, find 2 points furthest from line? (Require projection be done one point at a time? Or can we determine those 2 points in one pTree formula?) Algorithm-4.2: 4.b.2 use a grid of unit direction lines, {dvi | i=1..m}. For each, calc mn, vom of projs of each cluster (except singletons). Take the one for which the separation is max.
mean=(8.18, 3.27, 3.73) vom=(7,4,3) 3 1. no clusters determined yet. 2.(9,2,4) determined as an outlier cluster. 435 524 504 545 323 3.Using red dim line, (7,5,2) is determined as an outlier cluster. maroon pts determined as cluster, purple pts too. 924 b43 e43 c63 752 f72 3.a However, continuing to use line connecting (new) mean and vom of the projections onto this plane, would the same be determined? 1 Other option? use (at some judicious point) a p-Kmeans type approach. This could be done using K=2 and a divisive top down approach (using a GA mutation at various times to get us off a non-convergent track)? Notes:Each round, reduce dim by one (low bound on the loop.) Each round, just need good line (in remaining hyperplane) to project cluster (so far). 1. pick line thru proj'd mean, vom (vom is dependent on basis used. better way?) 2. pick line thru longest diameter? ( or diam 1/2 previous diam?). 3. try a direction vector. Then hill climb it in direction increase in diam of proj'd set. 2 From: Mark Silverman [mailto:msilverman@treeminer.com] April 21, 2012 8:22 AM Subject: RE: oblique faust I’ve been doing some tests, so far not so accurate (I’m still validating the code – I “unhardcoded” it so I can deal with arbitrary datasets and it’s possible there’s a bug, so far I think it’s ok). Something rather unique about the test data I am using is that it has four attributes, but for all of the class decisions it is really one of the attributes driving the classification decision (e.g. for classes 2-10, attribute 2 is dominant decision, class 11 attribute 1 is dominant, etc). I have very wide variability in std deviation in the test data (some very tight, some wider). Thus, I think that placing “a” on the basis of relative deviation makes a lot of sense in my case (and probably in general). My assumption is that all I need to do is to modify as follows: Now: a[r][v] = (Mr + Mv) * d / 2 Changes to a[r][v] = (Mr + Mv) * d * std(r) / (std(r) + std(s)) Is this correct?
Separate class R using midpoint of means (mom) method: Calc a vomV vomR d-line d v2 v1 std of distances, vod, from origin along the d-line FAUST Oblique (our best classifier?) PR=P(X o dR ) < aR1 pass gives classR pTree D≡ mRmV d=D/|D| (mR+(mV-mR)/2)od = a = (mR+mV)/2od(works also if D=mVmR, Training≡placingcut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1 horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time) Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the 1. vectors_of_median, vom, to represent each class, not the meanmV, where vomV ≡(median{v1|vV}, 2. mom_std, vom_std methods: project each class on d-line; then calculate std (one horizontal formula per class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr and mv median{v2|vV}, ...) dim 2 Note:training (finding a and d) is a one-time process. If we don’t have training pTrees, we can use horizontal data for a,d (one time) then apply the formula to test data (as pTrees) r r vv r mR r v v v r r v mV v r v v r v dim 1
The PTreeSet Genius for Big Data 1 A1,bw1 A1,bw1-1 ... A1,0 A2,bw2 ... Ak+1,c1 ..An,ccn 2 3 4 1 0 0 0 1 0 0 1 5 ... 0 1 1 0 0 1 1 2 7B 0 0 0 0 0 0 0 P 3 1 0 0 0 1 0 0 4 0 0 1 0 0 0 1 5 1 0 0 0 1 0 0 ... 0 0 1 0 0 0 0 N A1,bw1 A1,bw1-1 ... A1,0 A2,bw2 ... Ak+1,c1 ...An,ccn row number 1 0 1 0 1 1 0 1 1 1 1 0 0 0 1 2 gene chromosome 0 0 0 0 0 0 0 ... 1 0 0 0 1 0 1 roof (N/64) bpp 1 2 3 4 5 ... 3B inteval number pc bc lc cc pe age ht wt 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 AHG(P,bpp) Big Vertical Data: PTreeSet (Dr. G. Wettstein's) perfect for BVD! (pTrees both horiz and vert) PTreeSets incl methods for horiz querying and vertical DM, multihopQuery/DM, and XML. T(A1...An) is a PTreeSet data structure = bit matrix with (typically) each numeric attr converted to fixedpt(?), (negs?) bitsliced (pt_posschema) and category attr bitmapped; coded then bitmapped; num coded then bisliced (or as is, ie, char(25) NAME col stored outside PTreeSet? A1..Ak num w bitwidths=bw1..bwk; Ak+1..An categorical w counts=cck+1...ccn, PTreeSet is bitmatrix: Methods for this data structure can provide fast horizontal row access , e.g., an FPGA could (with zero delay) convert each bit-row back to original data row. Methods already exist to provide vertical (level-0 or raw pTree) access. Add any Level1 PTreeSet can be added: given any row partition (eg, equiwidth =64 row intervalization) and a row predicate (e.g., 50% 1-bits ). Add "level-1 only" DM meth, e.g., FPGA converts unclassified rowsets to equiwidth=64, 50% level1 pTrees, then entire batch would be FAUST classified in one horiz program. Or lev1 pCKNN. pDGP (pTree Darn Good Protection) by permuting col ord (permution = key). Random pre-pad for each bit-column would makes it impossible to break the code by simply focusing on the first bit row. More security?: all pTrees same (max) depth, and intron-like pads randomly interspersed... Relationships (rolodex cards) are 2 PTreeSets, AHGPeoplePTreeSet (shown) and AHGBasePairPositionPTreeSet (rotation of shown). Vertical Rule Mining, Vertical Multi-hop Rule Mining and Classification/Clustering methods (viewing AHG as either a People table (cols=BPPs) or as a BPP table (cols=People). MRM and Classification done in combination? Any table is a relationship between row and column entities (heterogeneous entity) - e.g., an image = [reflect. labelled] relationship between pixel entity and wavelength interval entity. Always PTreeSetting both ways facilitates new research and make horizontal row methods (using FPGAs) instantaneous (1 pass across the row pTree) Most bioinformatics done so far is not really data mining but is more toward the database querying side. (e.g., a BLAST search). A radical approach View whole Human Genome as 4 binary relationships between People and base-pair-positions (ordered by chromosome first, then gene region?). AHG [THG/GHG/CHG] is relationship between People and adenine(A) [thymine(T)/guanine(G)/cytosine(C)] (1/0 for yes/no) Order bpp? By chromosome and by gene or region (level2 is chromosome, level1 is gene within chromosome.) Do it to facilitate cross-organism bioinformatics data mining? Create both People and BPP-PTreeSet w human health records feature table (training set for classification and multi-hop ARM.) comprehensive decomp (ordering of bpps) FOR cross species genomic DM. If separate PTreeSets for each chrmomsome (even each region - gene, intron exon...) then we can may be able to dataming horizontally across the all of these vertical pTrees. The red person features used to define classes. AHGp pTrees for data mining. We can look for similarity (near neighbors) in a particular chromosome, a particular gene sequence, of overall or anything else.
Facebook-Buys: A facebook Member, m, purchases Item, x, tells all friends. Let's make everyone a friend of him/her self. Each friend responds back with the Items, y, she/he bought and liked. I≡Items I≡Items I≡Items F≡Friends(M,M) Members F≡Friends(K,B) F≡Friends(K,B) Buddies Buddies 1 0 1 1 4 1 1 0 0 1 1 1 1 4 4 0 1 0 0 3 0 0 1 1 0 0 0 0 3 3 1 0 0 1 2 1 1 0 0 0 0 1 1 2 2 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 P≡Purchase(M,I) P≡Purchase(B,I) P≡Purchase(B,I) Kiddos Kiddos 2 4 4 4 2 2 3 3 3 3 3 3 2 2 4 4 2 4 1 5 5 5 1 1 Members Groupies Groupies 4 4 1 1 0 0 1 1 1 1 4 4 1 2 4 2 2 2 4 1 3 3 0 0 1 1 0 0 0 0 3 3 2 2 1 1 0 0 0 0 1 1 2 2 0 0 1 1 0 1 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 4 4 4 3 3 3 2 2 2 1 1 1 4 4 4 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 Others(G,K) Compatriots (G,K) 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 2 2 2 1 0 0 0 1 1 0 0 Kx=OR Ogx frequent if Kx large (tractable- one x at a time and OR. gORbPxFb XI MX≡&xXPx People that purchased everything in X. FX≡ORmMXFb = Friends of a MX person. So, X={x}, is Mx Purchases x strong" Mx=ORmPxFmx frequent if Mx large. This is a tractable calculation. Take one x at a time and do the OR. K2 = {1,2,4} P2 = {2,4} ct(K2) = 3 ct(K2&P2)/ct(K2) = 2/3 Mx=ORmPxFmx confident if Mx large. ct( Mx Px ) / ct(Mx) > minconf To mine X, start with X={x}. If not confident then no superset is. Closure: X={x.y} for x and y forming confident rules themselves.... ct(ORmPxFm & Px)/ct(ORmPxFm)>mncnf Fcbk buddy, b, purchases x, tells friends. Friend tells all friends. Strong purchase poss? Intersect rather than union (AND rather than OR). Ad to friends of friends K2={2,4} P2={2,4} ct(K2) = 2 ct(K2&P2)/ct(K2) = 2/2 K2={1,2,3,4} P2={2,4} ct(K2) = 4 ct(K2&P2)/ct(K2)=2/4
The Multi-hop Closure Theorem A hop is a relationship, R, hopping from entities E to F. Lemma-2: Let AD, &clist(&aDXa)Yc covers Lemma-2: Let AD, &clist(&aDXa)Yc covers &clist(&aAXa)Yc &clist(&aAXa)Yc A'=list(&aDXa) D'=list(&aAXa) so by lemma-1, we get lemma-2: Lemma-3: AD, &elist(&clist(&aAXa)Yc)We covers &elist(&clist(&aDXa)Yc)We A condition is downward [upward]closed: If when it is true of A, it is true for all subsets [supersets], D, of A. Given an (a+c)-hop multi-relationship, where the focus entity is a hops from the antecedent and c hops from the consequent, if a [or c] is odd/eventhendownward/upwardclosure applies to frequency and confidence. A pTree, X, is said to be "covered by" a pTree, Y, if one-bit in X, there is a one-bit at that same position in Y. Lemma-0: For any two pTrees, X and Y, X&Y is covered by X and thus ct(X&Y) ct(X) and list(X&Y)list(X) Proof-0: ANDing with Y may zero some of X's ones but it will never change any zeros to ones. Lemma-1: Let AD, &aAXa covers &aDXa Proof-1&2: Let Z=&aD-AXa then &aDXa =Z&(&aAXa). lemma-1 now follows from lemma-0, as does Proof-3: lemma-3 in the same way from lemma-1 and lemma-2. Continuing this establishes: If there are an odd number of nested &'s then the expression with D is covered by the expression with A. Therefore the count with D with A. Thus, if the frequent expression and the confidence expression are > threshold for A then the same is true for D. This establishes downward closure. Exactly analogously, if there are an even number of nested &'s we get the upward closures.
APPENDIX: Multi-hop Closure Thmhop is a relationship, R, hopping from entities E to F. 4 U(H,I) 3 2 1 C I H G S(F,G) 0 1 0 1 4 0 0 0 1 3 1 0 1 0 2 0 0 0 1 1 T(G,H) 2 2 3 3 4 4 5 5 F 0 1 0 0 4 0 0 0 1 3 0 0 1 0 2 0 0 0 1 1 A R(E,F) E 0 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 downward closure: If a condition is true of A, then it is true for all subsets D of A. upward closure: If a condition is true of A then it is true of all supersets D of A. For transitive (a+c)-hop strong rule mine where the focus or count entity is a hops from the antecedent and c hops from the consequent, if a (or c) is odd/eventhendownward/upwardclosure applies to frequency (confidence). Odd downward Even upward The proof of the theorem: a pTree, X, is said to be "covered by" a pTree, Y, if 1-bit in X, there is a 1-bit at that same position in Y. Lemma-0: For any two pTrees, X and Y, X & Y is covered by X and ct(X) ct(X&Y) Proof-0: ANDing with Y may zero some of X's 1-positions but never ones any of X's 0-positions. Lemma-1: Let AB, &aBXa is covered by &aAXa Proof-1: Let Z=&aB-AXa then &aB Xa = Z & (&aA Xa), so the result follows from lemma-0. Lemma-2: For a (or c) =0, frequency and confidence are upward closed Proof-2: Lemma-3: If a (or c) we have upward/downward closure of frequency or confidence, then for a+1 (or c+1) we have downward/upward closer. Proof-3: Taking the a and upward closure, going to a+1 and DA, we are removing ANDs in the numerator for both frequency and confidence, so by Lemma-1, the a+1 numerator is covers the a numerator and therefore the a+1_count the a_count. Therefore, the condition (frequency or confidence) holds in the a+1 case and we have downward closure. ct(B)ct(A), so ct(A)>mnsp ct(B)>mnsp and ct(C&A)/ct(C)>mncf ct(C&B)/ct(C)>.mncf
The Multi-hop Closure Theorem A hop is a relationship, R, hopping from entities E to F. &a1(&... )Sa2Ta1) ct( Lemma2: If n is even/odd Threshold is upward/downward closed on A &a(n-1)(&anARan) &anDRan &anARan Proof-2: Let AD, then We are ANDing over additional pTrees on the left. We are ANDing over additional pTrees on the right. We are ANDing over additional pTrees on the left & & a(n-1)(&anDRan) a(n-1)(&anARan) & & a(n-2)& a(n-2)& a(n-1)(&anDRan) a(n-1)(&anARan) We are ANDing over additional pTrees on the right. & & a(n-3)& a(n-3)& a(n-2)& a(n-2)& a(n-1)(&anDRan) a(n-1)(&anARan) ... &a1(&... )Sa2Ta1 &a1(&... )Sa2Ta1 (if is even and if n is odd) &a(n-1)(&anDRan) &a(n-1)(&anARan) A condition is downward [upward]closed: If when it is true of A, it is true for all subsets [supersets], D, of A. Given an (a+c)-hop multi-relationship, where the focus entity is a hops from the antecedent and c hops from the consequent, if a [or c] is odd/eventhendownward/upwardclosure applies to frequency and confidence. A pTree, X, is said to be "covered by" a pTree, Y, if one-bit in X, there is a one-bit at that same position in Y. Lemma-0: For any two pTrees, X and Y, X&Y is covered by X and therefore ct(X) ct(X&Y). Proof-0: ANDing with Y may zero some of X's ones but it will never change any zeros to ones. Lemma-1.a: Let AD, &aDXa is covered by &aAXa Proof-1: Let Z=&aD-AXa then &aDXa = Z & (&aAXa), so the result follows from lemma-0. Lemma-1.b: Let AD, &c&aDXc is covers &caAXc Proof-1: Let Z=&aD-AXa then &aDXa = Z & (&aAXa), so the result follows from lemma-0.