1 / 35

Netflix Recommender System Implementation Description

Learn about the Hadoop implementation and vertical methods effectiveness of Netflix Recommender System. Understand the pTree advantage in processing very big data. Gain insights into the competitive strategies and techniques used in the Netflix Contest. Explore the nearest neighbor voting and association rule mining methodologies applied in the recommender system design. Discover the similarities between recommender and text classification environments, and the importance of speed in handling dynamic data updates.

dinac
Download Presentation

Netflix Recommender System Implementation Description

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more robust, as well as responding to some rfps.  I'm very close. Do you have any power points etc on what you did for Netflix?  I have some "recommend engine" type opps. Never head from Jonathan on code. Will push him. pen out what we discussed. Original From: Perrizo, William Sent: Saturday, April 27, 2013 2:04 PM To: 'Mark Silverman' When you get to it to a describable point, I would love to hear about the Hadoop implementation. I have attached some slides on our Netflix work (probably overload?) From: Mark Silverman Monday, April 29, 8:20 AM Is there anything in particular about the Netflix approach that makes vertical methods more effective potentially?  If datasets start getting large I assume we have a nice performance advantage, possibly also an advantage that new “movies” could be added to the collection dynamically rather than requiring remodeling? From: Perrizo, William Sent: Monday, April 29, 10:35 AM In the Netflix Contest, the task was to beat Cinematch (Netflix’s recommender) by 10%.  Contestants were given 5 yrs so it was not a speed contest (accuracy only).  The “pTree advantage” is mostly a speed advantage.  My  main pTree sales pitch has always been “get information  from your Very Big Data in human time!”.   Most recommenders have Very Big Training Sets (which are getting ever bigger - and the bigger, the better!).  Therefore, the rubber meets the recommender road on speed, not accuracy.  Difficult to devise a recommender-speed contest, so Netflix didn’t. We used Nearest Neighbor Voting in two ways (and combined the two votes into one at the end). 1.        We made a PTreeSet of the User Rating-History Table, UT(User, Movie, Rating) in which each row is a user and each column is a movie.  We used pTree horizontal processing (ANDs, ORs …) of UT to get a “Near neighbor user vote” which predicted a rating for each Test pair, (u,m) (near nbr users, v, close to u in terms of their rating history voted on which rating u might give to m.  “Near” was defined in terms of the ratings correlation of v and m (over a pruned set of the movies rated by both v and u). 2.       We made a PTreeSet of the Movie Rated History Table, MT(Movie, User, Rating) in which each row is a movie and each column is a user.  We used pTree horizontal processing (ANDs, ORs …) of MT to get a “Near neighbor movie vote” which predicted the rating for each Test pair, (m,u) (near nbr movies, n, close to m in terms of their rated history voted on which rating m might be given by u.  “Near was defined in terms of the ratings correlation of n and m (over a pruned down set of the users who rated both n and m) We also tried bringing in Association Rule Mining as a third contributor to the predictions, but without much success. I’m going to spend a little time trying to apply our newer FAUST methods to it.  In lots of ways, the recommender environment is similar to the text classification environment - the main training object is a real_number_labeled relationship between two entities (users and movies entities with ratings labels in the recommender case and documents and terms entities with frequency labels in the text mining case).  In both we have to deal with very high dimensions as well as very high cardinality.  Text mining is easier since label=0 means term_freq=0 while in the recommender case, label=0 does not mean rating=0 (hated it!) but means “didn’t rate it” (one has to be very careful not to allow label=0 to be interpreted in the code as “absolutely hated that movie!”. Hmm.  It occurs to me now that we calculated a weighted average decimal rating (e.g., if it turned out to be 4.2546 we predicted 4.  Using the FAUST methods, we will treat each rating as a categorical class (non-numeric).  Maybe “rating=0 problem” will not rear it’s ugly head?? On the “add to Trainiing Set dynamically” issue, speed seems like the solution here too.  If your recommender is slow (as most are) then you are pretty much forced to treat new training data separately (rather than rebuilding your predictor model).  Off the top of my head, I would think we would just take the entire training set (with the new ratings added) and go.  Remember, a nearest neighbor classifier is a lazy classifier in the sense that it does not build a model of the training set during a slow “build phase” and then use that model swiftly during a classify phase, but it uses the entire training set for each new prediction.  So all we would do is extend out two PTreeSets to include the new training data, not rebuild a training model.  Of course, FAUST builds a decision tree, so I will have to think about issue for FAUST). To: Perrizo, William; Question. Did you deal at all with the issue of recommendation normalization?  What if a 3 or 4 to one user is a 5 to another…. To: 'info@treeminer.com' Usually families had Netflix accounts and parents rate differently than kids (opposite?).   We did not come up with a solution for that one. Also, many new users rated lots of movies they had never seen just to prime the pump.  That one could be ameliorated somewhat by noting the date of rating (e.g., a user who rates 1000 movies in his/her first few days is probably priming the pump). Using a rating vector neighborhood of voters, we would include in the vote only users who also considered a rating of 3 or 4 as a high rating (and exclude those who consider 5 as high, unfortunately).  I remember trying to normalize out that difference by dividing by STD (or max minus min) of a user’s ratings.  If your document recommender environment is one in which a document is recommended (more highly) to a user iff the neighborhood of its tf (or tf*idf) vector contains lots of documents already read by that user? – then there may not be huge a normalization problem (unless you want to try to normalize out differences in author style?)

  2. From: Mark Silverman [mailto:msilverman@treeminer.com] So I am also curious whether there is a way to account for inactivity, for example, a recommendation that is not taken? From: Perrizo, William In the Netflix contest there was no inactivity information given and therefore it did not play in the contest. But I take the “is” in your question to be a probe independent of the Netflix Contest? It’s a very interesting question. Since recommendations are issued for those items that classify as likely to receive a top rating, t, a recommendation not taken could be recorded as a reduced rating where x could be adjusted according to the %, p, of recommendations ignored by that user (e.g., p*t). From: Mark Silverman Sent: Wednesday, May 01, 2013 7:34 PM To: Perrizo, William Bingo.  It occurs to me further in thinking about this  that the concept of a “user” is also a bit fuzzy.  As I remember Greg mentioned to me there were concerns about the “solvability” of Netflix due to the fact that multiple family members might be interested in different movies, for example, my son is not interested in the same movies as me but uses the same account.  Similarly, I as a subscriber to, let’s say, articles of “interest” might be interested in both the New York Knicks (yes, I am a long suffering fan) and the economy of the Congo.  If we average these term-weights together, I get mush.  Perhaps we have “user-profiles”, how close is my recommendation to previous recommendations I’ve made.  I could be interested in 5 different topics, or let’s say 5 different movie genres, and by considering me as one person with one set of average term-weights I lose this.  Time to cut the cord and finish, I have 18 hours to go. From: Perrizo You are right about the “family=user” point.  Several people have written about it but I don’t think anyone has come up with a true solution to it.  It might help somewhat to know that a user is a family and then only use near neighbor voters that also display characteristics of being a family (high ratings for “transformer” type movies and for sophisticated dramas… There is a paper http://dl.acm.org/citation.cfm?id=372071  on movie-based classification in which they look for movies of a similar genre, type, director, actors, … to the movie in question (and that the user in question has rated).  Then the average rating that the user gave to those “similar” movies is the prediction. We also took another movie-based approach in which we took near neighbor movies (voters) to be movies rated similarly to users rating pattern on other movies by a set of other users (who rated the movie whose rating was to be predicted). From: Mark Silverman Sent: Thursday, May 02, 2013 2:01 PM Well, it’s submitted but sufficiently vague but hopefully sounding good. So my thinking was more along the lines of a user having a set of weighted term frequencies that essentially map to a subject matter interest.   Thus, if I am interested in Uganda, I have one set of term-freqs. However, if I am interested in, say, Uganda and the NY Knicks, then tracking interests by user becomes difficult because I am essentially looking for neighbors against things that are rare.  So where I was going was that we are tracking not by user but by user-topic, such that we determine how close a users feedback is to his existing selections, and start a new topic if it is sufficiently far.  Thus, I am matching his likes against other NY Knick fans, not other Ugandan NY Knick fans. That’s sort of what I put in with a lot of caveats given it’s a long project and there plenty of discovery needed. From: Perrizo, William Sent: Thursday, May 02, 2013 2:30 PM This may be too late and it’s not a biggy, but our solution to the “u = a family of user” issue is that we took as our voter set, other users who rated almost the same set of other movies as u did, which meant that the voters probably consisted of a similar family mix (e.g., some “transformer type movies rated by the young males, “16 candles” type movies rated by the young females, “Bridges Over Madison County” type rated by the adult female and “A Few Good Men” type rated by the adult men…)

  3. Netflix provided 100M ratings (from 1 to 5) of 17K movies by 500K users. These essentially arrive in the form of a triplet of numbers: (User,Movie,Rating). In particular, for (User,Movie,?) not in the database, tell me what the Rating would be--that is, predict how the given User would rate the given Movie. For visualizing the problem, it makes sense to think of the data as a big sparsely filled matrix, with users across the top and movies down the side (or vice versa if you feel like transposing everything I say henceforth), and each cell in the matrix either contains an observed rating (1-5) for that movie (row) by that user (column), or is blank meaning you don't know. To quantify "big", sticking with the round numbers, this matrix would have about 8.5 billion entries (number of users times number of movies). Note also that this means you are only given values for one in eighty five of the cells. The rest are all blank. Netflix has then posed a "quiz" which consists of a bunch of question marks plopped into previously blank slots, and your job is to fill in best-guess ratings in their place. They have chosen mean squared error as the measure of accuracy, which means if you guess 1.5 and the actual rating was 2, you get docked for (2-1.5)^2 points, or 0.25. (they specify root mean squared error, referred to as rmse, but since they're monotonically related it's all the same and thus it will simply hurt your head less if you ignore the square root at the end.) They also provide a date for both the ratings and the question marks, which implies that any cell in the matrix can potentially have more than one rating in it. Imagine for a moment that we have the whole shebang--8.5 billion ratings and a lot of weary users. Presumably there are some generalities to be found in there, something more concise and descriptive than 8.5 billion completely independent and unrelated ratings. For instance, any given movie can, to a rough degree of approximation, be described in terms of some basic attributes such as overall quality, whether it's an action movie or a comedy, what stars are in it, and so on. And every user's preferences can likewise be roughly described in terms of whether they tend to rate high or low, whether they prefer action movies or comedies, what stars they like, and so on. And if those basic assumptions are true, then a lot of the 8.5 billion ratings ought to be explainable by a lot less than 8.5 billion numbers, since, for instance, a single number specifying how much action a particular movie has may help explain why a few million action-buffs like that movie. A fun property of machine learning is that this reasoning works in reverse too: If meaningful generalities can help you represent your data with fewer numbers, finding a way to represent your data in fewer numbers can often help you find meaningful generalities. Compression is akin to understanding and all that. In practice this means defining a model of how the data is put together from a smaller number of parameters, and then deriving a method of automatically inferring from the data what those parameters should actually be. In today's foray, that model is called singular value decomposition, which is just saying what I've already eluded: We'll assume that a user's rating of a movie is composed of a sum of preferences about the various aspects of that movie. For example, imagine that we limit it to forty aspects, such that each movie is described only by forty values saying how much that movie exemplifies each aspect, and correspondingly each user is described by forty values saying how much they prefer each aspect. To combine these all together into a rating, we just multiply each user preference by the corresponding movie aspect, and then add those forty leanings up into a final opinion of how much that user likes that movie. E.g., Terminator might be (action=1.2,chickflick=-1,...), and user Joe might be (action=3,chickflick=-1,...), and when you combine the two you get Joe likes Terminator with 3*1.2 + -1*-1 + ... = 4.6+... . Note here that Terminator is tagged as an anti-chickflick, and Joe likewise as someone with an aversion to chickflicks, so Terminator actively scores positive points with Joe for being decidedly un-chickflicky. (Point being: negative numbers are ok.) Anyway, all told that model requires 40*(17K+500K) values, or about 20M -- 400 times less than the original 8.5B. ratingsMatrix[user][movie] = sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40 In matrix terms, the original matrix has been decomposed into two very oblong matrices: the 17,000 x 40 movie aspect matrix, and the 500,000 x 40 user preference matrix. Multiplying those together just performs the products and sums described above, resulting in our approximation to the 17,000 x 500,000 original rating matrix. Singular value decomposition is just a mathematical trick for finding those two smaller matrices which minimize the resulting approximation error--specifically the mean squared error (rather convenient!). So, in other words, if we take the rank-40 singular value decomposition of the 8.5B matrix, we have the best (least error) approximation we can within the limits of our user-movie-rating model. I.e., the SVD has found our "best" generalizations for us. Pretty neat, eh? Only problem is, we don't have 8.5B entries, we have 100M entries and 8.4B empty cells. Ok, there's another problem too, which is that computing the SVD of ginormous matrices is... well, no fun. But, just because there are five hundred really complicated ways of computing singular value decompositions in the literature doesn't mean there isn't a really simple way too: Just take the derivative of the approximation error and follow it. This has the added bonus that we can ignore the unknown error on the 8.4B empty slots.

  4. If you write out the equations for the error between the SVD-like model and the original data--just the given values, not the empties--and then take the derivative with respect to the parameters we're trying to infer, you get a rather simple result which I'll give here in C code to save myself the trouble of formatting the math: userValue[user]+= lrate*err*movieValue[movie]; movieValue[movie]+= lrate*err*userValue[user]; The above code is evaluated for each rating in the training database. Lrate is the learning rate, a rather arbitrary number which I fortuitously set to 0.001 on day one and regretted it every time I tried anything else after that. Err is the residual error from the current prediction. So, the whole routine to train one sample might look like: /* * Where: * real *userValue = userFeature[featureBeingTrained]; * real *movieValue = movieFeature[featureBeingTrained]; * real lrate = 0.001; */ static inline void train(int user, int movie, real rating) { real err = lrate * (rating - predictRating(movie, user)); userValue[user] += err * movieValue[movie]; movieValue[movie] += err * userValue[user];} predictRating() here would also use userValue and movieValue to do its work, so there's a tight feedback loop. I mention the "static inline" and cram the lrate into err just to make the point that: this is the inside of the inner loop, and every clock cycle counts. My wee laptop is able to do a training pass through the entire data set of 100 million ratings in about seven and a half seconds. Slightly uglier but more correct, unless you're using an atemporal programming language you will want to do: uv = userValue[user]; userValue[user] += err * movieValue[movie]; movieValue[movie] += err * uv; Anyway, this will train one feature (aspect), and in particular will find the most prominent feature remaining (the one that will most reduce the error that's left over after previously trained features have done their best). When it's as good as it's going to get, shift it onto the pile of done features, and start a new one. For efficiency's sake, cache the residuals (all 100 million of them) so when you're training feature 72 you don't have to wait for predictRating() to re-compute the contributions of the previous 71 features. You will need 2 Gig of ram, a C compiler, and good programming habits to do this. There remains the question of what to initialize a new feature to. Unlike backprop and many other gradient descent algorithms, this one isn't really subject to local minima that I'm aware of, which means it doesn't really matter. I initialize both vectors to 0.1, 0.1, 0.1, 0.1, .... Profound, no? (How it's initialized actually does matter a bit later, but not yet...) The end result, it's worth noting, is exactly an SVD if the training set perfectly covers the matrix. Call it what you will when it doesn't. (If you're wondering where the diagonal scaling matrix is, it gets arbitrarily rolled in to the two side matrices, but could be trivially extracted if needed.) A host of refinements: Prior to even starting with the SVD, one can get a good head start by noting the average rating for every movie, as well as the average offset between a user's rating and the movie's average rating, for every user. I.e., the prediction method for this baseline model is: static inline real predictRating_Baseline(int movie, int user) {return averageRating[movie] + averageOffset[user];} So, that's the return value of predictRating before the first SVD feature even starts training. You would think the average rating for a movie would just be... its average rating! Alas, Occam's razor was a little rusty that day.

  5. Trouble is, what if there's a movie which only appears in the training set once, say with a rating of 1. Does it have an average rating of 1? Probably not! In fact you can view that single observation as a draw from a true probability distribution who's average you want... and you can view that true average itself as having been drawn from a probability distribution of averages--the histogram of average movie ratings essentially. If we assume both distributions are Gaussian, then according to my shoddy math the actual best-guess mean should be a linear blend between the observed mean and the apriori mean, with a blending ratio equal to the ratio of variances. That is: If Ra and Va are the mean and variance (squared standard deviation) of all of the movies' average ratings (which defines your prior expectation for a new movie's average rating before you've observed any actual ratings) and Vb is the average variance of individual movie ratings (which tells you how indicative each new observation is of the true mean--e.g,. if the average variance is low, then ratings tend to be near the movie's true mean, whereas if the avg variance is high, ratings tend to be more random and less indicative) then: BogusMean = sum(ObservedRatings)/count(ObservedRatings) K = Vb/Va BetterMean = [GlobalAverage*K + sum(ObservedRatings)] / [K + count(ObservedRatings)] But in fact K=25 seems to work well so I used that instead. :) The same principle applies to computing the user offsets. The point here is simply that any time you're averaging a small number of examples, the true average is most likely nearer the apriori average than the sparsely observed average. Note if the number of observed ratings for a particular movie is zero, the BetterMean (best guess) above defaults to the global average movie rating as one would expect. Moving on: 20 million free parameters is still rather a lot for a training set with only 100 million examples. While it seems like a neat idea to just ignore all those blank spaces in the implicit ratings matrix, the truth is we have some expectations about what's in them, and we can use that to our advantage. As-is, this modified SVD algorithm tends to make a mess of sparsely observed movies or users. To give an example, imagine you have a user who has only rated one movie, say American Beauty. Let's say they give it a 2 while the average is (just making something up) 4.5, and further that their offset is only -1, so we would, prior to even employing the SVD, expect them to rate it 3.5. So the error given to the SVD is -1.5 (the true rating is 1.5 less than we expect). Now imagine that the current movie-side feature, based on broader context, is training up to measure the amount of Action, and let's say that's a paltry 0.01 for American Beauty (meaning it's just slightly more than average). The SVD, recall, is trying to optimize our predictions, which it can do by eventually setting our user's preference for Action to a huge -150.0. I.e., the algorithm naively looks at the one and only example it has of this user's preferences, in the context of the one and only feature it knows about so far (Action), and determines that our user so hates action movies that even the tiniest bit of action in American Beauty makes it suck a lot more than it otherwise might. This is not a problem for users we have lots of observations for because those random apparent correlations average out and the true trends dominate. So, once again, we need to account for priors. As with the average movie ratings, it would be nice to be able to blend our sparse observations in with some sort of prior, but it's a little less clear how to do that with this incremental algorithm. But if you look at where the incremental algorithm theoretically converges, you get: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2)] The numerator there will fall in a roughly zero-mean Gaussian distribution when charted over all users, which through various gyrations I won't bore you with leads to: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2 + K)] And finally back to: userValue[user] += lrate * (err * movieValue[movie] - K * userValue[user]); movieValue[movie] += lrate * (err * userValue[user] - K * movieValue[movie]); This is essentially equivalent to penalizing the magnitude of the features, and so is probably related to Tikhonov regularization. The point: to try to cut down on over fitting, ultimately allowing use of more features. Last, Vincent liked K=0.02 or so, with well over 100 features (singular vector pairs--if you can still call them that). Moving on: As I mentioned a few entries ago, linear models are pretty limiting. Fortunately, we've bastardized the whole matrix analogy so much by now that we aren't really restricted to linear models any more: We can add non-linear outputs such that instead of predicting with: sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40

  6. We can use: sum G(userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40 Two choices for G proved useful. One is to simply clip the prediction to the range 1-5 after each component is added in. That is, each feature is limited to only swaying the rating within the valid range, and any excess beyond that is lost rather than carried over. So, if the first feature suggests +10 on a scale of 1-5, and the second feature suggests -1, then instead of getting a 5 for the final clipped score, it gets a 4 because the score was clipped after each stage. The intuitive rationale here is that we tend to reserve the top of our scale for the perfect movie, and the bottom for one with no redeeming qualities whatsoever, and so there's a sort of measuring back from the edges that we do with each aspect independently. More pragmatically, since the target range has a known limit, clipping is guaranteed to improve our performance, and having trained a stage with clipping on we should use it with clipping on. However, I did not really play with this extensively enough to determine there wasn't a better strategy. A second choice for G is to introduce some functional non-linearity such as a sigmoid. I.e., G(x) = sigmoid(x). Even if G is fixed, this requires modifying the learning rule slightly to include the slope of G, but that's straightforward. The next question is how to adapt G to the data. I tried a couple of options, including an adaptive sigmoid, but the most general and the one that worked the best was to simply fit a piecewise linear approximation to the true output/output curve. That is, if you plot the true output of a given stage vs the average target output, the linear model assumes this is a nice 45 degree line. But in truth, for the first feature for instance, you end up with a kink around the origin such that the impact of negative values is greater than the impact of positive ones. That is, for two groups of users with opposite preferences, each side tends to penalize more strongly than the other side rewards for the same quality. Or put another way, below-average quality (subjective) hurts more than above-average quality helps. There is also a bit of a sigmoid to the natural data beyond just what is accounted for by the clipping. The linear model can't account for these, so it just finds a middle compromise; but even at this compromise, the inherent non-linearity shows through in an actual-output vs. average-target-output plot, and if G is then simply set to fit this, the model can further adapt with this new performance edge, which leads to potentially more beneficial non-linearity and so on... This introduces new free parameters and encourages over fitting especially for the later features which tend to represent small groups. We found it beneficial to use this non-linearity only for the first twenty or so features and to disable it after that. Moving on: Despite the regularization term in the final incremental law above, over fitting remains a problem. Plotting the progress over time, the probe rmse eventually turns upward and starts getting worse (even though the training error is still inching down). We found that simply choosing a fixed number of training epochs appropriate to the learning rate and regularization constant resulted in the best overall performance. I think for the numbers mentioned above it was about 120 epochs per feature, at which point the feature was considered done and we moved on to the next before it started over fitting. Note that now it does matter how you initialize the vectors: Since we're stopping the path before it gets to the (common) end, where we started will affect where we are at that point. I wonder if a better regularization couldn't eliminate overfitting altogether, something like Dirichlet priors in an EM approach--but I tried that and a few others and none worked as well as the above. Here is the probe and training rmse for the first few features with and w/o regularization term "decay" enabled. Same thing, just the probe set rmse, further along where you can see the regularized version pulling ahead: Same plot again, but this time showing probe rmse (vertical) against train rmse (horizontal). Note how the regularized version has better probe performance relative to the training performance: Anyway, that's about it. I've tried a few other ideas over the last couple of weeks, including a couple of ways of using the date information, and while many of them have worked well up front, none held their advantage long enough to actually improve the final result. If you notice any obvious errors or have reasonably quick suggestions for better notation or whatnot to make this explanation more clear, let me know. And of course, I'd love to hear what y'all are doing and how well it's working, whether it's improvements to the above or something completely different. Whatever you're willing to share,

  7. R = UT o I Ri1i iTestSizeI u1 . u . uTestSizeU UTf1f fF u1 : : uTestSizeU I i1i iTestSizeI f1 fF o f rf,i ru,i = u ru,f ru,i = u o i = f=1..F ru,f * rf,i ^ ^ u+ = lrate ( u,i * iT -  * u ) where u,i = ru,i - ru,i where ru,i = actual rating value UT+f1f fF u1 : : uTestSizeU I i1i iTestSizeI f1 fF o f rf,i u+ ru,f

  8. How do we use this theory? For Dot Product gap based Clustering, we can hill-climb akk below to a d that gives us the global maximum variance. Heuristically, higher variance means more prominent gaps. Xod=Fd(X)=DPPd(X) d1 x1od x1 x2 : xN x2od = - ( j=1..nXj dj)2 = i=1..N(j=1..nxi,jdj)2 xNod dn V(d)≡VarianceXod=(Xod)2 - (Xod)2 M1 M2 : MC For Dot Product Gap based Classification, we can start with X = the table of the C Training Set Class Means, where Mk≡MeanVectorOfClassk. = i(jxi,jdj) - (jXj dj) (kXk dk) (kxi,kdk) + j<kxi,jxi,kdjdk = ijxi,j2dj2 1 1 1 2 Then Xi = Mean(X)i and N N N N and XiXj = Mean Mi1 Mj1 . : +2j<kXjXkdjdk - " = jXj2 dj2 +2j<kXjXkdjdk - jXj2dj2 2a11d1 V(d)= +j1a1jdj MiC MjC XjXk)djdk ) +(2j=1..n<k=1..n(XjXk- 2a22d2 = j=1..n(Xj2 - Xj2)dj2 + +j2a2jdj : 2anndn +jnanjdj V(d) = V(d)=jajjdj2 ijaijdidj + jkajkdjdk subject to i=1..ndi2=1 dTo A o d = V(d) d1 : dn V i XiXj-XiX,j : d1 ... dn V(d)≡Gradient(V)=2Aod 2a11 2a12 ... 2a1n 2a21 2a22 ... 2a2n : ' 2an1 ... 2ann d1 : di : dn or Ubhaya Theorem1:  k{1,...,n} s.t. d=ek will hill-climb V to its globally max. Theorem2 (working on it): Let d=ek s.t. akk is a maximal diagonal element of A, d=ek will hill-climb V to its globally maximum. Maximizing theVariance Given any table, X(X1, ..., Xn), and any unit vector, d, in n-space, let These computations are O(C) (C=number of classes) and are instantaneous. Once we have the matrix A, we can hill-climb to obtain a d that maximizes the variance of the dot product projections of the class means. FAUST Classifier MVDI (Maximized Variance Definite Indefinite: Build a Decision tree. 1. Find d that maximizes variance of dot product projections of class means each round. 2. Apply DI each round FAUST technology relies on: 1. a distance dominating functional, F. 2. Use of gaps in range(F) to separate. We can separate out the diagonal or not: For Unsupervised (Clustering) Hierarchical Divisive? Piecewise Linear? other? Perf Anal (which approach is best for which type of table?) For Supervised (Classification), Decision Tree? Nearest Nbr? Piecewise Linear? Perf Anal (which is best for training set?) d1≡(V(d0));  d0, one can hill-climb it to locally maximize the variance, V, as follows: d2≡(V(d1)):... where White papers: Terabyte Head Wall. The Only Good Data is Data in Motion Multilevel pTrees: k=0,1 suffices! A PTreeSet is defined by specifying a table, an array of stride_lengths (usually equi-length so just that one length is specified) and a stride_predicate (T\F condition on a stride (stride=bag [or array?] of bits): So the metadata of PTreeSet(T,sl,sp) specifies T, sl and sp. A “raw” PTreeSet has sl=1 and the identity predicate (sl and sp not used). A “cooked” PTreeSet (AKA Level-1 PTreeSet) for a table with sl1 (main purpose: provide compact summary information on the table.) Let PTS(T) be a raw PTreeSet, then it, plus PTS(T,64,p), ..., PTS(T,64^k,p) form a tree of vertical summarizations of T. Note that P(T, 64*64, p) is different from P(P(T,64,p), 64, p), but both make sense since P(t, 64, p) is a table and P(P(T, 64, p), 64, p) is just a cooked pTree on it.

  9. FAUST MVDI (-1, 16.5=avg{23,10})s sCt=50 (16.5, 38)e eCt=24 (48.128)i iCt=39 d=(.33, -.1, .86, .38) (-1,8)e Ct=21 (10,128)i Ct=9 indef[38, 48]se_i seCt=26 iCt=13 indef[8,10]e_i eCt=5 iCt=4 Definite Indefinite i-Mean 62.8 29.2 46.1 14.5 i -1 8 e-Mean 59 26.9 49.6 18.4 e 10 17 i_e 8 10 empty d=(-.55, -.33, .51, .57) d0=(.33, -.1, .86,.38) 16.5  xod0 < 38 xod0 < 16.5 38  xod0 48 48 < xod0 Setosa Virginica Versicolor d1=(-.55, -.33, .51, .57) xod1 < 9 xod1 9 Virginica Versicolor on IRIS 15 records from each Class for Testing (Virg39 was removed as an outlier.) Definite_____ Indefinite s-Mean 50.49 34.74 14.74 2.43 s -1 10 e-Mean 63.50 30.00 44.00 13.50 e 23 48 s_ei 23 10 empty i-Mean 61.00 31.50 55.50 21.50 i 38 70 se_i 38 48 In this case, since the indefinite interval is so narrow, we absorb it into the two definite intervals; resulting in decision tree:

  10. FAUST MVDI SatLog 413train 4atr 6cls 127test Using class means: FoMN Ct min max max+1 mn4 83 101 104 82 113 8 110 121 122 mn3 85 103 108 85 117 79 105 128 129 mn1 69 106 115 94 133 12 123 148 149 Using full data: (much better!) mn4 83 101 104 82 59 8 56 65 66 mn3 85 103 108 85 62 79 52 74 75 mn1 69 106 115 94 81 12 73 95 96 d=(0.39 0.89 0.35 0.10 ) F[a,b) 0 92 104 118 127 146 156 157 161 179 190 Class 2 2 2 2 2 2 5 5 5 5 7 7 7 7 7 7 1 1 1 1 1 1 1 4 4 4 4 4 3 3 3 3 d=(-.11 -.22 .54 .81) F[a,b) 89 102 Class 5 2 d=(-.15 -.29 .56 .76) F[a,b) 47 65 81 101 Class 7 5 5 2 2 d=(-.81, .17, .45, .33) F[a,b) 21 3541 59 Class 3 1 d=(-.01, -.19, .7, .69) d=(-.66, .19, .47, .56) F[a,b) 57 6169 87 Class 5 7 F[a,b) 5256667375 Class 333 3 4 11 cl=4 cl=7 Cl=7 Gradient Hill Climb of Variance(d) d1 d2 d3 d4 Vd) 0.00 0.00 1.00 0.00 282 0.13 0.38 0.64 0.65 700 0.20 0.51 0.62 0.57 742 0.26 0.62 0.57 0.47 781 0.30 0.70 0.53 0.38 810 0.34 0.76 0.48 0.30 830 0.36 0.79 0.44 0.23 841 0.37 0.81 0.40 0.18 847 0.38 0.83 0.38 0.15 850 0.39 0.84 0.36 0.12 852 0.39 0.84 0.35 0.10 853 Fomn Ct min max max+1 mn2 49 40 115 119 106 108 91 155 156 mn5 58 58 76 64 108 61 92 145 146 mn7 69 77 81 64 131 154 104 160 161 mn4 78 91 96 74 152 60 127 178 179 mn1 67 103 114 94 167 27 118 189 190 mn3 89 107 112 88 178 155 157 206 207 Gradient Hill Climb of Var(d)on t25 d1 d2 d3 d4 Vd) 0.00 0.00 0.00 1.00 1137 -0.11 -0.22 0.54 0.81 1747 MNod Ct ClMn ClMx ClMx+1 mn2 45 33 115 124 150 54 102 177 178 mn5 55 52 72 59 69 33 45 88 89 Gradient Hill Climb of Var(d)on t257 0.00 0.00 1.00 0.00 496 -0.15 -0.29 0.56 0.76 1595 Same using class means or training subset. Gradient Hill Climb of Var(d)on t75 0.00 0.00 1.00 0.00 12 0.04 -0.09 0.83 0.55 20 -0.01 -0.19 0.70 0.69 21 Gradient Hill Climb of Var(d)on t13 0.00 0.00 1.00 0.00 29 -0.83 0.17 0.42 0.34 166 0.00 0.00 1.00 0.00 25 -0.66 0.14 0.65 0.36 81 -0.81 0.17 0.45 0.33 88 On the 127 sample SatLog TestSet: 4 errors or 96.8% accuracy. speed? With horizontal data, DTI is applied one unclassified sample at a time (per execution thread). With this pTree Decision Tree, we take the entire TestSet (a PTreeSet), create the various dot product SPTS (one for each inode), create ut SPTS Masks. These masks mask the results for the entire TestSet. Gradient Hill Climb of Var(d)on t143 0.00 0.00 1.00 0.00 19 -0.66 0.19 0.47 0.56 95 0.00 0.00 1.00 0.00 27 -0.17 0.35 0.75 0.53 54 -0.32 0.36 0.65 0.58 57 -0.41 0.34 0.62 0.58 58 For WINE: min max+1 8.40 10.33 27.00 9.63 28.65 9.9 53.4 7.56 11.19 32.61 10.38 34.32 7.7 111.8 8.57 12.84 30.55 11.65 32.72 8.7 108.4 8.91 13.64 34.93 11.97 37.16 13.1 92.2 Awful results! Gradient Hill Climb of Var t156161 0.00 0.00 1.00 0.00 5 -0.23 -0.28 0.89 0.28 19 -0.02 -0.06 0.12 0.99 157 0.02 -0.02 0.02 1.00 159 0.00 0.00 1.00 0.00 1 -0.46 -0.53 0.57 0.43 2 Inconclusive both ways so predict purality=4(17) (3ct=3 tct=6 Gradient Hill Climb of Var t146156 0.00 0.00 1.00 0.00 0 0.03 -0.08 0.81 -0.58 1 0.00 0.00 1.00 0.00 13 0.02 0.20 0.92 0.34 16 0.02 0.25 0.86 0.45 17 Inconclusive both ways so predict purality=4(17) (7ct=15 2ct=2 Gradient Hill Climb of Var t127 0.00 0.00 1.00 0.00 41 -0.01 -0.01 0.70 0.71 90 -0.04 -0.04 0.65 0.75 91 0.00 0.00 1.00 0.00 35 -0.32 -0.14 0.59 0.73 105 Inconclusive predict purality=7(62 4(15) 1(5) 2(8) 5(7)

  11. FAUST MVDI Concrete d0= -0.34 -0.16 0.81 -0.45 xod3<969 xod0<320 xod2<28 xod>=19.3 xod2>=662 xod2>=92 xod0>=634 xod>=18.6 d1= .85 -.03 .52 -.02 d2= .85 -.00 .53 .05 Class=m (test:1/1) Class= l or m Cl=l *test 6/9) Class=m errs0/1) Class=m errs8/12) Cl=h (test:11/12) Class=m errs0/4) Class=m errs0/0) Class=l (test:1/1) Class=m (test:2/2) xod<13.2 xod<13.2 .00 .00 1.00 .00 1.0 8.0 6 4 l 4.0 5.0 0 0 m 2.0 9.0 0 0 h 0 2 2 99 .97 .19 .08 .16 d1 13.4 19.6 0 0 l 16.9 19.9 4 3 m 13.5 16.0 0 0 h 0 13.45 18.6 99 0.97 0.19 0.06 0.15 14.4 19.6 0 0 l 16.8 18.8 0 0 m 13.5 15.8 11 1 h 0 14.366 17.816 99 Class=l errs:0/4) Class=h errs:0/5) Class=h errs:0/5) Class=h errs:0/1) d3= .81 .04 .58 .01 xod4>=681 xod3>=868 Cl=m (test:1/1) Cl=l (test:0/3) d4 = .79 .14 .60 .03 xod4<640 Cl=l *test 2/2) xod3<544 Cl=m *test 0/0) 7 test errors / 30 = 77% For Concrete min max+1 train 335.3 657.1 0 l 120.5 611.6 12 m 321.1 633.5 0 h Test 0 l ****** 1 m ****** 0 h ****** 0 321 3.0 57.0 0 l 3.0 361.0 11 m 28.0 92.0 0 h 0 l ***** 2 m ***** 0 h 92 ***** 999 .97 .17 -.02 .15 d0 13.3 19.3 0 0 l 16.4 23.5 0 0 m 12.2 15.2 25 5 h 0 13.2 19.3 23.5 Seeds d3 547.9 860.9 4 l 617.1 957.3 0 m 762.5 867.7 0 h 0 l ******* 0 m ******* 0 h . 0 ******* 617 8 test errors / 32 = 75% d2 544.2 651.5 0 l 515.7 661.1 0 m 591.0 847.4 40 h 1 l ****** 0 m ****** 11 h 662 ****** 999

  12. 0. Cut in middle of the means: a= (mR+(mV-mR)/2)od = (mR+mV)/2od D≡mRmVd=D/|D| PR=Pxod<a PV=Pxoda 5. PR=Pxod<CutR PV=Pxod>CutV Min{Vod}Max{Rod} CutR=CutV=avg{minVod,minRod}, else CutR≡Min{Vod}, Cut≡Max{Rod} vomR vomV MnVod V MaxRod R d2-line d-line d d2 a FAUST Classifier 1. Cut in the middle of:VectorOfMedians (VOM), not the means. Use stdev ratio not middle for even better cut placement? 2. Cut in the middle of{Max{Rod},Min{Vod}. (assuming mRodmVod) If no gap, move cut to minimize Rerrors + Verrors. 3. Hill-climb d to maximize gap or to minimize training set errors or (simplest) to minimize dis(max{rod},min{vod}) . 4. Replace mr, mv with the avg of the margin points? y PR or yPV , Definite classifications; else re-do on Indefinite region,PCutRxodCutV until actual  gap (AND with certain stop cond? E.g., "On nth round, use definite only (cut at midpt(mR,mV)." Another way to view FAUST DI is that it is a Decision Tree Method. With each non-empty indefinite set, descend down the tree to a new level For each definite set, terminate the descent and make the classification. dim 2 Each round, it may be advisable to go through an outlier removal process on each class before setting Min{Vod} and Max{Rod} (E.g., Iteratively check if F-1(Min{Vod}) consists of V-outliers). rvv rmRrv v v v r    rv mV v rv v r v dim 1

  13. FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK): Let mi≡meanCi s.t. dom1dom2 ...domKMni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj} Definitei = ( Mx<i, Mn>i ) Indefinitei,i+1 = [ Mn>i, Mx<i+1 ] Then recurse on each Indefinite. For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEANsMEANe Definite_____ Indefinite__ s-Mean 50.49 34.74 14.74 2.43 s -1 25 e-Mean 63.50 30.00 44.00 13.50 e 10 37 se 25 10 empty i-Mean 61.00 31.50 55.50 21.50 i 48 128 ei 37 48 F < 18  setosa (35 seto) 1ST ROUND D=MeansMeane 18 < F < 37  versicolor (15 vers) 37  F  48  IndefiniteSet2 (20 vers, 10 virg) 48 < F  virginica (25 virg) F < 7  versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7  F  10  IndefSet3 ( 3 vers, 5 virg) 10 < F  virginica ( 0 vers, 5 virg) F < 3  versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3  F  7  IndefSet4 ( 2 vers, 1 virg) Here we will assign 0  F  7 versicolor 7 < F  virginica ( 0 vers, 3 virg) 7 < F virginica Test: F < 15  setosa (15 seto) 1ST ROUND D=MeansMeane 15 < F < 15  versicolor ( 0 vers, 0 virg) 15  F  41  IndefiniteSet2 (15 vers, 1 virg) 41 < F  virginica ( 14 virg) F < 20  versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F  virginica ( 0 vers, 1 virg) 100% accuracy. Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k=1... (and Mean could be replaced by VOM or?) Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k=1... (and Mean could be replaced by VOM or?) Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?) Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster. Option-4: D seq.: always pick the means pair which are furthest separated from each other. Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(meani), F(meanj) Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially, IS=X)

  14. FAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing. Cl=1 2 3 0 0 0 0 0 1 Cls3 outlier (F=0) Cl=1 2 3 0 0 0 1 0 0 Cls1 outlier (F=29) Cl=1 2 3 0 0 0 0 0 0 done! declare Class=1 Cl=1 2 3 0 0 0 1 0 0 Cls1 outlier(F=54) Cl=1 2 3 5 0 2 Cl=1 2 3 5 0 3 Cl=1 2 3 6 0 3 Cl=1 2 3 5 0 2 m1 13.2 5.2 4.0 5.0 9 avF1 DEFINITE INDEFINITE def3[ -inf 0 ) m3 13.0 5.0 4.0 5.0 6 avF3 def1[ 13 inf ) in11[ 0 13 ) m1 13.0 5.2 3.6 5.0 13 avF1 DEFINITE INDEFINITE def3[ -inf 9 ) m3 13.0 5.0 4.0 5.0 9 avF3 def1[ 19 inf ) in1111[ 9 19 ) m1 13.0 5.1 3.7 5.0 30 avF1 DEFINITE INDEFINITE def3[ -inf 0 ) m3 13.0 5.0 4.0 5.0 27 avF3 def1[ 37 inf ) in11[ 0 37 ) m1 13.0 5.2 3.6 5.0 13 avF1 DEFINITE INDEFINITE def3[ -inf 9 ) m3 13.0 5.0 4.0 5.0 9 avF3 def1[ 19 inf ) in111[ 9 19 ) On Indef-11 On Indef-111 On Indef-1111 On Indef-1 Option-4, means pair most separated in X. m1 14.4 5.6 2.7 5.1 4.4 d(m1,m2) DEFINITE INDEFINITE m2 18.6 6.2 3.7 6.0 3.4 d(m1,m3) 2 -inf 0 m3 11.8 5.0 4.7 5.0 7.0 d(m2,m3) 1 106 0 12 0 106 0  F  106, 3 106 inf 23 0 106 so totally non-productive! Option-6: D Median-to-Mean of IndefSet (initially IS=X) m1 14.4 5.6 2.7 5.1 37.3 meanF1 DEFINITE Cl=1 2 3 INDEFINITE m2 18.6 6.2 3.7 6.0 71.2 meanF2 def3[ -inf 21) 0 0 32 m3 11.8 5.0 4.7 5.0 `2.0 meanF3 def1[ 28 49) 22 0 0 ind1[ 21 28 ) On whole TR def2[ 58 inf) 0 30 0 ind2[ 49 58 )

  15. FAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing. D Mean(loF)-to-Mean(hiF) of IndefSet12 D Mean(loF)-to-Mean(hiF) of IndefSet313131 (d repeats after this so=C1 D Mean(loF)-to-Mean(hiF) of IndefSet31 D Mean(loF)-to-Mean(hiF) of IndefSet1313 Cl=1 2 3 5 0 0 0 5 0 Cl=1 2 3 0 0 1 1 0 0 Cl=1 2 3 1 0 0 0 0 0 Cl=1 2 3 0 0 0 1 0 0 The rest, Class=1 Cl=1 2 3 . 5 0 2 Cl=1 2 3 . 4 0 2 Cl=1 2 3 . 0 0 0 Cl=1 2 3 . 6 0 3 m1 16.2 6.0 1.8 5.2 5.8 avF1 DEFINITE INDEFINITE m2 16.6 6.0 4.6 6.0 6.2 avF2 def1[ -inf 2 ) def2[ 15 inf ) in1212[ 2 15 ) m1 12.8 5.2 3.2 5.0 18 avF1 DEFINITE INDEFINITE m3 13.0 5.0 4.0 5.0 10 avF3 def3[ -inf 10 ) . def1[ 20 inf ) in313131[ 10 20 ) . m1 13.0 5.1 3.7 5.0 30 avF1 DEFINITE INDEFINITE m3 13.0 5.0 4.0 5.0 27 avF3 def1[-inf 18 ) . def3[ 55 inf ) in1313[ 18 55 ) m1 13.0 5.2 3.6 5.0 4 avF1 DEFINITE INDEFINITE m3 13.0 5.0 3.5 5.0 2 avF3 def1[ -inf 0 ) def3[ 5 inf ) C1= [ 0 5 ) Option-6: D Median-to-Mean of X m1 14.4 5.6 2.7 5.1 37.3 meanF1 DEFINITE Cl=1 2 3 INDEFINITE m2 18.6 6.2 3.7 6.0 71.2 meanF2 def3[ -inf 21) 0 0 32 m3 11.8 5.0 4.7 5.0 `2.0 meanF3 def1[ 28 49) 22 0 0 ind31[ 21 28 ) On whole TR def2[ 58 inf) 0 30 0 ind12[ 49 58 ) [-inf, 21)class=3 [28, 49)class=2 [58.inf) class=3 d=(.,9, -,1, -.2, -.2) [21,28)ind31 d=(-.9, -.1, .14, -.1)[49, 58)ind12 d=(0, .31, -.9, 0) [-inf,18)def[49, 58)ind23

  16. Xod=Fd(X)=DPPd(X) d1 x1od x1 x2 : xN x2od = - ( j=1..nXj dj)2 = i=1..N(j=1..nxi,jdj)2 xNod dn V(d)≡VarDPPd(X)= (Xod)2 - (Xod)2 = i(jxi,jdj) - (jXj dj) (kXk dk) (kxi,kdk) + j<kxi,jxi,kdjdk sub to i di2=1 = ijxi,j2dj2 Maximize wrt d, |Mean(DPPd(X)) - Median(DPPd(X)| Mean(DPPdX)=(1/N)i=1..Nj=1..nxi,jdj = j=1..n Xjdj =j (1/Nixi,j ) dj 1 2 1 1 N N N N +2j<kXjXkdjdk - " = jXj2 dj2 +2j<kXjXkdjdk - jXj2dj2 2a11d1 V(d)= +j1a1jdj do=ek s.t. akk is max or d0k=akk d1≡(V(d0)) d2≡(V(d1)) til F(dk) XjXk)djdk ) +(2j=1..n<k=1..n(XjXk- 2a22d2 = j=1..n(Xj2 - Xj2)dj2 + +j2a2jdj : 2anndn +jnanjdj V(d) = V(d)=jajjdj2 ijaijdidj + jkajkdjdk subject to i=1..ndi2=1 dTo VX o d = VarDPPdX≡V d1 : dn V i XiXj-XiX,j : d1 ... dn MEDIAN picks out last 2 sequences which have best gaps (discounting outlier gaps at the extremes) and it discards 1,3,4 which are not so good. Finding good unit vector, d, for Dot Prod functional, DPP. to maximize gaps GRADIENT(V) = 2A o d 2a11 2a12 ... 2a1n 2a21 2a22 ... 2a2n : ' 2an1 ... 2ann d1 : di : dn Compute Median(DPPd(X)? Want to use only pTree processing. Want a formula in d and numbers only (like the one above for the mean (involves only the vector d and the numbers X1 ,..., Xn ) FAUST CLUSTERING Use DPPd(x), but which unit vector, d*, provides the best gap(s)? 1. DPPd exhaustively searches a grid of d's for the best gap provider. 2. Use some heuristic to choose a good d? GV: Gradient-optimized Variance MM: Use the d that maximizes |MedianF(X)-Mean(F(X))|. We have Avg as a function of d. Median? (Can you do it?) HMM: Use a heuristic for MedianF(X): F(VectorOfMedians)=VOMod MVM: Use D=MEAN(X)VOM(X), d=D/|D| Maximize variance - is it wise? 0 0 0 0 0 0 0 0 1 0 5 0 0 0 0 0 2 0 5 2 0 0 0 0 3 0 5 2 3 0 0 0 4 0 5 4 3 6 0 0 median 5 0 5 4 3 6 9 0 6 0 5 6 6 6 9 10 7 0 5 6 6 6 9 10 8 0 5 8 6 9 9 10 9 0 5 8 9 9 9 10 10 10 10 10 10 10 10 10 std 3.16 2.87 2.13 3.20 3.35 3.82 4.57 4.98 variance 10.0 8.3 4.5 10.2 11.2 14.6 20.9 24.8 Avg 5.00 0.91 5.00 4.55 4.18 4.73 5.00 4.55 consecutive 1 0 5 0 0 0 0 0 differences 1 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 0 0 2 0 6 0 0 1 0 0 0 0 0 9 0 1 0 0 2 3 0 0 10 1 0 0 0 0 0 0 0 1 0 0 2 0 3 0 0 1 0 0 0 3 0 0 0 1 10 5 2 1 1 1 0 avgCD 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 maxCD 1.00 10.00 5.00 2.00 3.00 6.00 9.00 10.00 ||mean-VOM| 0.00 0.91 0.00 0.55 1.18 1.27 4.00 4.55

  17. FAUST Clustering, simple example: Gd(x)=xod Fd(x)=Gd(x)-MinG on a dataset of 15 image points 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf Fp=MN,q=z1=0 F=1 F=2 X x1 x21 2 3 4 5 6 7 8 9 a b 1 1 1 1=q 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 p d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 The 15 Value_Arrays (one for each q=z1,z2,z3,...) z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 za 0 1 2 3 4 5 7 11 12 13 zb 0 1 2 3 4 6 8 10 11 12 zc 0 1 2 3 5 6 7 8 9 11 12 13 zd 0 1 2 3 7 8 9 10 ze 0 1 2 3 5 7 9 11 12 13 zf 0 1 3 5 6 7 8 9 10 11 The 15 Count_Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 za 2 1 1 1 1 1 4 1 1 2 zb 1 2 1 1 3 2 1 1 1 2 zc 1 1 1 2 2 1 1 1 1 1 1 2 zd 3 3 3 1 1 1 1 2 ze 1 1 2 1 3 2 1 1 2 1 zf 1 2 1 1 2 1 2 2 2 1 gap: [F=6, F=10] gap: [F=2, F=5] pTree masks of the 3 z1_clusters (obtained by ORing) z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

  18. What have we learned? What is the DPPd FAUST CLUSTER algorithm? X2=SubCluster2 SubCluster1 D=MedianMean, d1≡D/|D| is a good start. But first, Variance-Gradient hill-climb it. (Median means Vector of Medians). For X2=SubCluster2 use a d2 which is perpendicular to d1? In high dimensions, there are many perpendicular directions. GV hill-climb d2=D2/|D2| (D2=MedianX2-MeanX2) constrained to be  to d1, i.e., constrained to d2od1=0 (in addition to d2od2=1. We may not want to constrain this second hill-climb to unit vectors perpendicular to d1. It might be the case that the gap gets wider using a d2 which is not perpendicular to d1? GMP:Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) Variance-Gradient hill-climbed subject only to dod=1 (We shouldn't constrain the 2nd hill-climb to d1od2=0 and subsequent hill-climbs to dkodh=0, h=2...k-1. (gap could be larger). So the 2nd round starts at d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) and hill-climbs subject only to dod=1) GCCP:Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where D2=CCi(X2)-CCj(X2), and hill-climbs subject to dod=1, where the CCs are two of the Circumscribing rectangle's Corners (the CCs may be a faster calculations than Mean and Vom). Taking all edges and diagonals of CCR(X) (the Coordinate-wise Circumscribing Rectangle of X) provides a grid of unit vectors. It is an equi-spaced grid iff we use a CCC(X) (Coordinate-wise Circumscribing Cube of X). Note that there may be many CCC(X)s. A canonical one is the one that is furthest from the origin (take the longest side first. Extend each other side the same distance from the origin side of that edge. A good choice may be to always take the longest side of CR(X) as D, D≡LSCR(X). Should outliers on the (n-1)-dim-faces at the ends of LSCR(X) be removed first? So remove all LSCR(X)-endface outliers until after removal the same side is still the LSCR(X). Then use that LSCR(X) as D.

  19. MVM C11 F-MN gp2 0 1 1 1 1 1 2 3 1 3 3 2 5 2 1 6 1 2 8 2 2 10 2 1 11 1 1 12 4 1 13 1 2 15 2 WINE GV GM ACCURACY WINE GV 62.7 MVM 66.7 GM 81.3 .11 .19 .96 .19 209 -.02 .41 .91 0 232 C1(F-MN) gp3 0 1 1 1 6 1 2 5 1 3 2 1 4 4 1 5 8 1 6 8 1 7 4 1 8 3 1 9 7 1 10 1 1 11 4 1 12 6 1 13 4 1 14 2 1 15 3 1 16 3 1 17 2 1 18 2 1 19 3 1 20 4 1 21 6 1 22 4 1 23 1 1 24 2 1 25 4 1 26 1 1 27 1 2 29 2 1 30 2 2 32 1 3 35 1 1 36 1 1 37 1 1 38 1 1 39 4 1 40 2 2 42 2 2 44 1 1 45 2 2 47 4 1 48 2 1 49 1 1 50 1 3 53 1 1 54 2 1 55 2 [0.12) 1L 0H F-MN Ct gp8 0 1 12 12 1 3 15 2 13 28 1 2 30 1 2 32 2 2 34 1 1 35 2 3 38 1 8 46 1 1 47 3 10 57 1 1 58 1 1 59 1 1 60 1 2 62 1 2 64 1 1 65 1 1 66 1 1 67 4 1 68 2 1 69 1 1 70 1 2 72 3 1 73 1 1 74 3 1 75 2 1 76 1 1 77 1 2 79 1 3 82 1 1 83 1 1 84 2 1 85 1 1 86 1 2 88 2 1 89 4 1 90 2 1 91 1 1 92 6 1 93 3 1 94 5 1 95 4 2 97 5 1 98 2 1 99 1 1 100 4 1 101 7 1 102 4 1 103 2 1 104 3 1 105 6 1 106 3 1 107 8 1 108 10 1 109 2 1 110 4 1 111 5 1 112 2 1 113 4 1 114 1 .07 .15 .98 .12 588 -.01 .26 .97 .00 608 (F-MN) gp8 0 1 1 1 4 1 2 4 1 3 5 1 4 4 1 5 6 1 6 8 1 7 6 1 8 4 1 9 5 1 10 2 1 11 3 1 12 7 1 13 4 1 14 3 1 15 2 1 16 2 1 17 3 1 18 4 1 19 3 1 20 4 1 21 1 1 22 7 1 23 2 1 24 4 1 25 1 1 26 1 1 27 1 1 28 1 1 29 1 1 30 1 1 31 1 1 32 1 3 35 1 2 37 3 1 38 1 1 39 1 1 40 3 1 41 3 3 44 2 1 45 2 1 46 4 1 47 2 2 49 1 2 51 1 1 52 1 3 55 1 1 56 1 1 57 1 9 66 2 1 67 2 8 75 1 4 79 2 1 80 1 2 82 2 1 83 1 2 85 1 13 98 1 2 100 1 3 103 1 11 114 1 -.05 -.31 -.95 -.01 605 .01 -.27 -.96 -.0 608 XF-M gp3 0 1 11 11 1 4 15 1 1 16 1 13 29 1 1 30 1 2 32 2 2 34 1 1 35 2 4 39 1 8 47 2 1 48 2 9 57 1 1 58 1 1 59 1 2 61 1 2 63 1 2 65 1 1 66 1 1 67 1 1 68 5 1 69 2 1 70 1 3 73 3 1 74 3 1 75 1 1 76 2 1 77 2 2 79 1 3 82 1 1 83 1 1 84 1 1 85 1 1 86 1 1 87 1 1 88 1 1 89 1 1 90 4 1 91 2 1 92 7 1 93 1 1 94 5 1 95 4 1 96 2 1 97 3 1 98 2 1 99 2 1 100 3 1 101 4 1 102 7 1 103 3 1 104 2 1 105 6 1 106 3 1 107 5 1 108 9 1 109 6 1 110 4 1 111 5 1 112 4 1 113 4 1 114 1 _4L2H ___ _ [12,28) 1L2H _2L1H 2L 0H C1 -.11 -.02 -.86 .5 43 -.05 -.4 -.92 .01 68 C7F-M*3 g3 0 1 3 3 1 2 5 1 4 9 2 6 15 1 3 18 1 2 20 1 1 21 1 1 22 3 3 25 2 3 28 1 2 30 3 1 31 2 1 32 1 1 33 1 1 34 2 1 35 1 1 36 3 3 39 2 1 40 1 1 41 2 3 44 1 2 46 3 2 48 1 2 50 2 2 52 1 1 53 1 1 54 1 1 55 1 1 56 2 1 57 1 1 58 2 1 59 2 1 60 2 1 61 1 1 62 1 1 63 2 2 65 1 1 66 1 1 67 1 1 68 1 1 69 3 1 70 1 1 71 1 1 72 1 1 73 1 1 74 2 1 75 2 1 76 1 1 77 3 1 78 4 1 79 3 1 80 4 1 81 1 1 82 1 1 83 1 2 85 2 1 86 2 2 88 3 1 89 2 2 91 2 2 93 4 3 96 1 _0L 2H _0L 2H C2 3L5H -.08 .59 -.8 -.07 80 .08 .83 -.56 -.01 95 C5 g3 0 1 4 4 1 8 12 1 3 15 1 2 17 1 2 19 1 4 23 1 1 24 1 2 26 1 1 27 1 2 29 3 2 31 1 1 32 1 1 33 1 ___ _ [28,46) 2L6H 1L1H ___ _ [46,57) 2L2H .05 .59 -.293 .75 18 -.1 .9 -.3 .1 34 C6*8 16 0 1 4 4 2 16 20 1 11 31 1 37 68 1 15 83 1 15 98 1 8 106 1 11 117 1 1 118 2 _2L4H C71 C121 max thin 0 1 1 1 6 1 2 5 1 3 3 1 4 3 1 5 8 1 6 8 1 7 4 1 8 7 1 9 3 1 10 1 1 11 5 1 12 6 1 13 3 1 14 2 1 15 3 1 16 3 1 17 4 _2L5H C3 _0L 1H C4 _3L 0H C1 F-M Ct g3 0 1 1 1 2 1 2 2 3 5 1 1 6 1 1 7 4 1 8 2 2 10 2 1 11 1 2 13 2 1 14 1 1 15 1 1 16 5 2 18 1 2 20 2 3 23 1 1 24 1 1 25 1 1 26 2 2 28 1 1 29 1 1 30 5 1 31 2 1 32 1 1 33 4 1 34 5 1 35 4 1 36 4 1 37 2 1 38 3 1 39 3 1 40 2 1 41 4 1 42 3 1 43 5 1 44 3 1 45 4 1 46 5 1 47 4 1 48 3 1 49 11 1 50 5 1 51 3 1 52 5 1 53 4 1 54 4 1 55 1 _1L2H ___ 4L2H _0L 1H C4 _1L 2H ___ 0L 2H _2L 3L2H 23L 25H 6L 21H _1L6H 5L5H _2L 12H _9L 7H C5 .19 .8 -.54 .18 7 -.21 .7 -.7 -.09 9 C763F-M*8 g8 0 2 16 16 1 13 29 1 12 41 2 4 45 1 7 52 1 4 56 1 7 63 1 8 71 2 _1L4H .01 -.27 -.96 -.01 23 -.04 -.43 -.9 .03 24 C76*4 g3 0 1 31 31 1 3 34 1 1 35 2 2 37 1 2 39 2 2 41 1 2 43 1 1 44 1 2 46 3 3 49 1 1 50 1 1 51 2 1 52 1 1 53 2 2 55 2 2 57 2 3 60 1 2 62 2 1 63 1 2 65 3 1 66 2 3 69 1 1 70 1 1 71 2 2 73 2 1 74 1 1 75 2 1 76 3 1 77 2 1 78 2 1 79 3 1 80 3 2 82 1 2 84 1 2 86 2 1 87 1 1 88 1 2 90 2 1 91 2 3 94 2 2 96 2 1 97 2 C11 10L 13H C12 0L 2H ___ _1L 0H ___ _0L 1H _2L 0H [0.35) C11 38L68H C12 F-M gp2 0 1 1 1 8 1 2 3 1 3 2 1 4 4 1 5 11 1 6 8 1 7 2 1 8 6 1 9 4 1 10 3 1 11 4 1 12 4 1 13 5 1 14 3 1 15 3 1 16 4 2 18 2 1 19 5 1 20 6 1 21 4 1 22 1 1 23 2 1 24 3 1 25 3 3 28 2 1 29 2 2 31 1 4L 8H C6 _2L4H 4L8H C763 _0L 2H -.21 .34 -.91 .9 8 C766 *16 g4 0 1 30 30 1 2 32 1 7 39 1 1 40 1 1 41 1 1 42 1 4 46 1 2 48 1 2 50 2 5 55 1 3 58 1 7 65 1 2 67 1 5 72 1 3 75 2 2 77 1 1 78 4 2 80 1 3 83 1 1 84 2 4 88 1 1 89 1 11 100 1 4 104 1 11 115 1 _0L 1H [35,53) C12 10L13H ___ _ 2L9H _2L 0H ___ [53,56) 3L 2H _3L 1H 29L 46H ___ _ 1L8H 51L 83H [0.66) C1 _1L 3H ___ _ [66,75) 2L2H _2L 0H _2L 0H 7L 19H 2L2H ___ _ [75,98) 2L6H 0L 1H ___ [57,115) 51L 83H C1 _4L 8H ___ _ [98,115) 2L2H 17L 15H C766 _2L 0H 38L 68H C7 _0L 2H ___ 28L 44H C76 1L _1L 0H ___ _ 3L3H

  20. SEEDS GV MVM 256 36 10 32 akk .98 .14 .04 .12 0 .00 -.00 .96 .29 3 C6 10(F-M) g12 0 3 10 10 1 12 22 3 10 32 3 9 41 2 7 48 1 ACCURACY SEEDS WINE GV 94 62.7 MVM 93.3 66.7 GM 96 81.3 219 31 14 29 akk d1 d2 d3 d4 V(d .98 .14 .06 .13 9 .98 .14 .06 .13 9 10(F-MN) gp6 0 2 1 1 10 1 2 5 1 3 1 6 9 3 1 10 10 1 11 10 1 12 2 6 18 2 1 19 3 1 20 7 1 21 2 1 22 1 1 23 3 6 29 6 1 30 4 1 31 7 1 32 1 6 38 1 1 39 2 1 40 6 1 41 5 1 42 1 7 49 3 1 50 1 2 52 7 1 53 2 7 60 1 2 62 4 1 63 3 8 71 5 1 72 2 2 74 1 6 80 5 1 81 8 1 82 5 1 83 3 9 92 2 10 102 1 1 103 2 1 104 1 10(F-MN)gp6 0 2 1 1 10 1 2 5 1 3 1 6 9 3 1 10 10 1 11 10 1 12 2 6 18 2 1 19 3 1 20 7 1 21 2 1 22 1 1 23 3 6 29 6 1 30 4 1 31 7 1 32 1 6 38 1 1 39 2 1 40 6 1 41 5 1 42 1 7 49 3 1 50 1 2 52 7 1 53 2 7 60 1 2 62 4 1 63 3 8 71 5 1 72 2 2 74 1 6 80 5 1 81 8 1 82 5 1 83 3 9 92 2 10 102 1 1 103 2 1 104 1 ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [9,18) 1k 0r 24c C2 GM ___ ___ [9,18) 1k 0r 24c C2 .794 -.403 -.304 .337 6 0.957 .156 -.205 .132 9 10(F-MN) gp3 0 1 2 2 1 2 4 4 2 6 3 2 8 7 2 10 2 2 12 1 2 14 1 2 16 10 2 18 10 1 19 2 3 22 2 1 23 2 1 24 1 1 25 1 2 27 4 2 29 4 2 31 4 2 33 2 5 38 3 1 39 3 2 41 7 2 43 2 2 45 2 1 46 1 2 48 1 1 49 1 1 50 4 2 52 5 1 53 1 1 54 3 3 57 2 2 59 3 2 61 3 1 62 1 2 64 3 2 66 3 3 69 5 7 76 1 2 78 2 2 80 2 2 82 4 2 84 1 2 86 1 2 88 4 1 89 1 1 90 8 2 92 5 11 103 2 1 104 1 1 105 1 1 106 1 2 108 1 ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [38,49) 13k2r 0c C5 -.577 .577 .577 .000 1 .119 .112 .986 .000 3 C2: 10(F-MN) gp10 0 1 10 10 2 1 11 3 10 21 3 10 31 5 10 41 1 10 51 1 11 62 1 1 63 1 ___ ___ [0,22) 0k 0r 42c C1 ___ ___ [38,49) 13k2r 0c C5 ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [0,31) 9k 0r 0c C21 ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [31,41) 1k 0r 4c C22 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [22,33) 10k 0r 8c C2 ___ ___ [41,64) 0k 0r 4c C23 ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [102,105) 0k 4r 0c Cb C3 200(F-MN)gp12 0 2 12 12 3 12 24 3 12 36 5 12 48 1 12 60 1 12 72 1 40 112 2 ___ ___ [102,105) 0k 4r 0c Cb ___ ___ [33,57) 33k2r 0c C3 C3 .97 .15 .09 .14 0 0 .07 1 0 4 10F-M g9 0 2 10 10 3 10 20 3 10 30 4 1 31 1 9 40 1 10 50 1 11 61 1 9 70 2 ___ ___ [0,35) 8k 0r 0c ___ ___ [0,10) 2k 0r 0c ___ ___ [35,48) 2k 0r 3c ___ ___ [10,20) 2k 0r 1c -.832 -.282 .134 -.458 0 -.44 .00 -.87 -.22 2 C4: 10(F-MN) gp21 0 3 11 11 2 20 31 3 21 52 3 27 79 1 20 99 3 ___ ___ [20,30) 2k 0r 1c ___ ___ [48,72) 0k 0r 2c ___ ___ [57,69) 6k 9r 0c C4 ___ ___ [69,76) 1k4r 0c C6 ___ ___ [30,40) 4k 0r 1c ___ ___ [72,113) 0k 0r 3c ___ ___ [40,50) 0k 0r 1c ___ ___ [50,61) 0k 0r 1c ___ ___ [61,70) 0k 0r 1c ___ ___ [0,52) 1k7r C41 ___ ___ [70,71) 0k 0r 2c ___ ___ [52,79) 1k2r C42 C6 200(F-MN)gp12 0 3 12 12 1 38 50 3 10 60 1 2 62 3 12 74 2 ___ ___ [79100) 4k 0r C43 ___ ___ [0,50) 4k 0r 0c ___ ___ [50,60) 1k 0r 2c ___ ___ [76,103) 0k 26r 0c C7 ___ ___ [0,22) 4k 0r 0c ___ ___ [60,74) 1k 0r 3c ___ ___ [74,75) 1k 0r 1c ___ ___ [103,109) 0k 6r 0c C8 ___ ___ [22,49) 3k 6r 0c

  21. MVM C2 3762 808 2260 266 d1 d2 d3 d4 .84 .18 .51 .06 64 .57 .22 .71 .34 82 .51 .22 .74 .38 83 (F-MN)*3 Ct gp3 0 1 2 2 1 1 3 1 2 5 1 15 20 2 3 23 1 3 26 2 2 28 1 1 29 1 2 31 1 2 33 2 2 35 2 2 37 1 1 38 3 1 39 1 1 40 1 1 41 1 1 42 1 4 46 1 1 47 2 2 49 1 1 50 1 1 51 1 2 53 1 1 54 2 2 56 1 2 58 1 1 59 2 2 61 2 1 62 2 1 63 3 1 64 1 1 65 2 2 67 3 1 68 2 1 69 1 1 70 2 1 71 2 1 72 2 2 74 1 1 75 1 2 77 1 1 78 1 1 79 1 2 81 1 2 83 1 1 84 1 3 87 2 1 88 1 1 89 1 1 90 2 1 91 1 1 92 2 3 95 1 1 96 2 1 97 1 2 99 1 2 101 2 2 103 1 3 106 3 3 109 1 1 110 1 1 111 1 .81 .28 -.28 .42 13... .53 .23 .73 .37 39 C12 4*F-M g3 0 2 4 4 1 4 8 2 2 10 1 2 12 1 2 14 1 3 17 1 1 18 1 2 20 1 1 21 1 1 22 1 2 24 3 1 25 1 2 27 1 1 28 1 2 30 1 4 34 1 2 36 2 2 38 1 3 41 2 3 44 1 2 46 2 2 48 1 2 50 1 2 52 2 2 54 1 1 55 1 1 56 1 1 57 1 1 58 3 1 59 1 1 60 2 2 62 1 1 63 4 2 65 1 1 66 1 1 67 1 1 68 1 1 69 3 3 72 1 2 74 1 2 76 1 2 78 1 1 79 1 3 82 1 2 84 1 1 85 1 4 89 1 1 90 1 2 92 1 1 93 1 IRIS GM GV ACCURACY IRIS SEEDS WINE GV 82.7 94 62.7 MVM 94 93.3 66.7 GM 94.7 96 81.3 C23 F-M*3 g3 3847 818 2284 257 .96 .22 .06 -.14 15 0 1 6 6 1 2 8 1 4 12 1 3 15 1 1 16 1 2 18 2 8 26 1 2 28 1 1 29 1 1 30 1 2 32 1 1 33 1 3 36 1 3 39 2 1 40 1 1 41 2 1 42 2 2 44 2 2 46 1 1 47 2 5 52 1 1 53 1 3 56 1 1 57 1 3 60 1 1 61 1 1 62 1 2 64 1 6 70 2 5 75 1 2 77 2 3 80 1 9 89 1 8 97 1 F-MN gp8 0 2 3 3 5 1 4 5 1 5 14 1 6 11 1 7 6 1 8 1 1 9 5 1 10 1 5 15 1 8 23 1 2 25 2 2 27 1 2 29 1 1.. 68 1 .88 .09 -.98 -.18 168 -.29 .13 -.88 -.36 417 -.36 .09 -.86 -.36 420 F-MN Ct gp5 0 1 3 3 2 1 4 1 2 6 1 1 7 1 2 9 2 1 10 1 2 12 3 1 13 1 1 14 3 1 15 4 1 16 2 1 17 3 1 18 1 1 19 6 1 20 3 1 21 1 1 22 2 1 23 2 1 24 6 1 25 7 1 26 2 1 27 3 1 28 2 1 29 6 1 30 3 1 31 2 1 32 3 1 33 3 1 34 3 1 35 3 1 36 5 1 37 1 1 38 2 1 39 1 1 40 2 1 41 1 2 43 1 2 45 1 1 46 1 1 47 1 5 52 1 8 60 2 1 61 3 1 62 4 1 63 3 1 64 13 1 65 12 1 66 4 1 67 5 1 68 2 2 70 2 .90 .24 .37 .04 180 .41 -.04 .84 .35 418 .36 -.08 .86 .36 420 F-MN Ct gp3 0 2 2 2 2 1 3 2 1 4 5 1 5 7 1 6 16 1 7 6 1 8 4 1 9 4 1 10 2 8 18 1 5 23 1 2 25 2 2 27 1 2 29 1 1 30 1 1 31 1 1 32 2 1 33 1 1 34 3 1 35 5 1 36 4 1 37 3 1 38 1 1 39 4 1 40 3 1 41 3 1 42 4 1 43 4 1 44 2 1 45 5 1 46 7 1 47 3 1 48 2 1 49 1 1 50 3 1 51 4 1 52 3 1 53 2 1 54 3 1 55 3 1 56 3 1 57 1 1 58 4 3 61 2 1 62 1 1 63 1 2 65 1 1 66 1 1 67 2 3 70 1 ___ 1e 0i -.36 .09 -.86 -.36 105 -.54 -0.17 -.76 -.33 118 C1 2*(F-M g3 0 2 4 4 1 1 5 1 1 6 1 5 11 1 2 13 1 3 16 1 2 18 1 3 21 1 1 22 1 1 23 1 2 25 2 1 26 2 2 28 2 1 29 1 2 31 3 1 32 1 2 34 1 1 35 2 1 36 2 1 37 4 3 40 2 1 41 1 2 43 3 2 45 1 2 47 4 1 48 1 1 49 2 1 50 4 1 51 3 2 53 5 1 54 2 1 55 2 1 56 1 1 57 3 2 59 3 2 61 2 2 63 1 1 64 1 1 65 2 2 67 1 1 68 1 1 69 2 1 70 2 1 71 3 1 72 1 1 73 2 2 75 1 1 76 1 1 77 1 1 78 1 1 79 1 1 80 1 2 82 2 10 92 1 2 94 2 2 96 1 __2e 5i 50s1i C1 C2 ___ 4e 1i C21 ___ 19e1i C22 4(F-) g4 0 1 6 6 1 4 10 1 2 12 1 4 ... 33 2 1 34 1 4 38 1 1 39 1 3 ... 79 1 2 81 1 5 86 1 2 88 2 2 90 1 1 91 1 1 92 2 2 94 1 1 95 1 2 97 1 1 98 1 3 101 2 1 102 2 4 106 1 1 107 1 2 109 1 1 110 2 1 111 2 6 117 1 1 118 1 1 119 1 1 120 1 ___50s1i C1 ___ 6e 0i ___ 18e C221 29e 14i ___ ___ ___ 19e1i C22 ___28i C11 ___ 16e11i 18e 11i C123 ___ 6e ___ 2e ___ 3e2i C221 8F- g5 0 1 7 7 1 4 11 1 5 16 1 1 17 1 3 20 1 1 21 1 2 23 1 1 24 1 5 29 1 3 32 2 2 34 1 1 35 1 4 39 3 5 44 1 3 47 2 3 50 1 3 53 1 4 57 1 3 60 1 3 63 1 1 64 2 5 69 2 1 70 1 3 73 1 1 74 1 1 75 1 4 79 1 1 80 2 2 82 2 1 83 1 1 84 1 2 86 1 4 90 1 5 95 1 ___1e ___ 0e 3i ___ 2e ___ 26i ___ 0e 4i C221 8F-)g5 0 1 7 7 1 4 11 1 5 16 1 1 17 1 3 20 1 1 21 1 2 23 1 1 24 1 5 29 1 3 32 2 2 34 1 1 35 1 4 39 3 5 44 1 3 47 2 3 50 1 3 53 1 4 57 1 3 60 1 3 63 1 1 64 2 5 69 2 1 70 1 3 73 1 1 74 1 1 75 1 4 79 1 1 80 2 2 82 2 1 83 1 1 84 1 2 86 1 4 90 1 5 95 1 ___50e 49i C1 __ 4e1i ___ 3e . -.034 .37 -.31 .87 4 C123 12*F-M g4 0 1 6 6 1 10 16 1 2 18 1 3 21 1 1 22 1 1 23 1 6 29 1 3 32 1 3 35 1 5 40 2 5 45 1 4 49 1 1 50 2 4 54 1 2 56 1 5 61 2 1 62 1 2 64 1 1 65 1 2 67 1 3 70 1 1 71 1 12 83 1 1 84 1 1 85 1 __ 1i . ___ 50e 40i C2 9i C3 ___ 1e . _46e 21i C12 ___9e ___ 5e 1i ___ 4e C13 ___ 27e 16i C23 ___9e . ___ 50s1i C2 ___ 9e1i . __9e2i MVM C2 2(F-)g4 0 1 4 4 1 1 5 1 4 9 1 3 ... 69 1 4 73 1 1 74 1 2 76 2 4 80 1 4 84 1 2 86 2 5 91 1 ___ 9i C24 _ 4e . __9e2i ___ 3e __ 0e 2i . 47e 40i C22 ___ 8i ___ 3i ___ 2e6i . ___ 5e10i ___ _3i ___ 1i ___ 0e 11i ___ 2e1i ___ 5e11i

  22. CONCRETE GM MVM C11 F-/4 g4 0 4 2 2 1 2 4 4 2 6 25 2 8 2 1 9 7 1 10 4 1 11 9 2 13 3 1 14 6 1 15 4 1 16 1 3 19 5 4 23 2 3 26 5 1 27 4 1 28 9 1 29 5 2 31 6 1 32 5 3 35 6 5 40 2 C232 g2 F-M/8 0 1 1 1 1 1 2 2 1 3 1 2 5 2 1 6 1 1 7 2 1 8 2 1 9 1 7 16 1 1 17 3 1 18 2 2 20 7 1 21 8 1 22 7 1 23 1 2 25 2 1 26 3 1 27 2 1 28 3 1 29 1 1 30 1 1 31 2 2 33 1 1 34 4 1 35 3 3 38 3 1 39 8 11 50 2 1 51 1 MVM (F-)/4 gp4 C23 g3 F-M/8 0 2 2 2 1 1 3 3 1 4 3 1 5 1 1 6 1 1 7 6 1 8 1 1 9 8 1 10 2 1 11 6 1 12 2 1 13 5 1 14 2 1 15 2 3 18 1 1 19 7 1 20 1 1 21 3 1 22 1 1 23 2 1 24 4 1 25 1 2 27 8 1 28 9 1 29 4 2 31 2 1 32 1 1 33 3 1 34 3 2 36 7 1 37 12 2 39 1 1 40 1 1 41 1 1 42 6 6 48 1 2 50 2 0L 32M 13H 11L 13M 54H ACCURACY CONCRETE IRIS SEEDS WINE GV 76 82.7 94 62.7 MVM 78.8 94 93.3 66.7 GM 83 94.7 96 81.3 C2-.6 .2 -.07 .771 6882.. -.72 .19 -.40 .54 9251 .38 .14 -.79 .46 11781 F-m/8 g4 C2 0 1 2 2 1 1 3 1 2 5 2 3 8 1 2 10 1 1 11 1 5 16 1 2 18 1 5 23 1 1 24 1 1 25 2 1 26 2 1 27 1 2 29 4 1 30 2 1 31 2 1 32 1 1 33 3 2 ... 1s 65 1 X g4 (F-MN)/8 0 2 2 2 1 2 4 2 1 5 1 3 8 2 3 11 1 1 12 3 2 14 4 1 15 3 1 16 3 1 17 2 1 18 3 1 19 6 1 20 3 1 21 3 1 22 2 1 23 5 1 24 4 1 25 3 1 26 6 1 27 3 1 28 1 1 29 6 1 30 3 1 31 2 1 32 3 1 33 3 1 34 1 2 36 3 1 37 1 1 38 2 1 39 3 1 40 5 1 41 1 1 42 6 1 43 1 1 44 3 2 46 5 1 47 1 1 48 3 1 49 1 1 50 2 1 51 1 1 52 1 1 53 1 1 54 1 1 55 1 1 56 3 1 57 3 2 59 1 2 61 1 1 62 3 3 65 2 9 74 1 4 78 1 3 81 1 2 83 1 3 86 1 2 88 1 2 90 1 1 91 1 4 95 1 2 97 1 1 98 1 2 100 1 4 104 1 3 107 1 0 1 1 1 1 4 5 1 1 ... 1s 46 4 3 49 1 7 56 1 2 58 1 3 61 1 4 65 1 1 66 1 3 69 1 2 71 1 6 77 1 3 80 1 3 83 1 3 86 1 14 100 1 3 103 1 2 105 1 3 108 2 4 112 1 ___ 2M C2 gp8 (F-MN)/5 0 2 2 2 1 2 4 2 1 5 1 3 8 2 3 11 1 1 12 2 2 14 4 1 15 3 1 16 3 1 17 2 1 18 3 1 19 6 1 20 3 1 21 3 1 22 1 1 23 5 1 24 3 1 25 3 1 26 6 1 27 3 1 28 1 1 29 6 1 30 3 1 31 2 1 32 1 1 33 3 1 34 1 2 36 3 2 38 2 1 39 2 1 40 5 1 41 1 1 42 6 1 43 1 1 44 3 2 46 5 1 47 1 1 48 1 1 49 1 1 50 2 1 51 1 1 52 1 1 53 1 1 54 1 1 55 1 1 56 3 1 57 2 2 59 1 2 61 1 1 62 3 3 65 2 9 74 1 4 78 1 8 86 1 2 88 1 2 90 1 5 95 1 2 97 1 1 98 1 2 100 1 4 104 1 C21 0L 8M 0H C1 43L 33M 55H C22 2M 0H C23 C211 g5 F-M)/4 0 1 6 6 2 1 7 2 5 12 1 1 13 4 1 14 1 1 15 4 2 17 1 1 18 2 1 19 2 2 21 2 1 22 3 1 23 1 1 24 3 4 28 1 14 42 1 2 44 1 1 45 1 3 48 2 2 50 1 5 55 1 2 57 1 1 58 1 5 63 1 1 64 1 7 71 1 11 82 1 16 98 2 g4 F-MN/8 0 1 2 2 1 2 4 1 2 6 1 1 7 1 1 8 1 2 10 1 1 11 1 1 12 1 1 13 1 3 16 2 3 19 1 2 21 1 5 26 1 1 27 1 1 28 2 1 29 2 1 30 1 2 32 5 1 33 2 1 34 2 1 35 1 1 36 3 1 37 3 1 38 3 1 39 5 1 40 3 1 41 7 1 42 6 1 43 3 1 44 5 1 45 1 1 46 3 1 47 3 1 48 4 1 49 7 1 50 4 1 51 6 1 52 10 1 53 3 1 54 4 1 55 8 1 56 5 1 57 3 1 58 7 1 59 2 1 60 2 1 61 1 1 62 2 2 64 1 1 65 2 1 66 1 1 67 2 C21 g4 F-M/4 0 1 1 1 1 3 4 1 3 7 2 1 8 2 1 9 1 2 11 1 2 13 4 1 14 2 1 15 4 1 16 1 2 18 2 1 19 3 1 20 1 1 21 2 1 22 6 2 24 2 1 25 3 1 26 1 2 28 2 2 30 1 1 31 1 2 33 1 4 37 1 1 38 2 1 39 2 1 40 1 1 41 1 1 42 1 1 43 2 1 44 1 1 45 2 1 46 1 1 47 1 1 48 1 1 49 2 2 51 2 4 55 1 1 56 8 1 57 4 1 58 4 1 59 2 1 60 1 1 61 1 2 63 5 2 65 1 2 67 2 1 68 1 3 71 1 1 72 4 1 73 8 1 74 5 1 75 1 8 83 3 1 84 3 1 85 2 1 86 1 99 3 GV ___5L . C111 3L 23M 49H ___ 7M C2 ___ 4M C3 ___6M C4 ___ 30L 1M 4H C231 g4 F-M/8 0 1 7 ... 1s 12 1 2 14 6 1 15 7 4 19 1 1 20 3 1 21 3 1 22 2 1 23 1 2 25 1 2 27 1 2 29 1 1 30 1 1 31 1 2 33 1 6 39 1 3 42 1 4 46 1 10 56 2 __20L5M . C1F-/4 g4 ___14M 0H C1 C2 0 1 1 1 1 7 8 1 4 12 1 4 16 1 2 18 1 2 20 2 1 21 2 2 ... 1s+2s 71 2 2 73 1 1 74 1 2 76 2 2 78 2 4 82 2 2 84 1 6 90 2 8 98 1 9 107 1 16 123 1 ___ 5L1M . ___ 4M . ___ 2L1M . C211 32L 13M 0H ___5L1M C11 43L 23M 53H _30L8H_ . 3L2M C212 g5 F-M/3 0 1 20 20 1 8 28 1 1 29 2 9 38 1 11 49 1 5 54 1 11 65 1 10 75 2 3 78 1 11 89 1 7 96 1 2 98 1 2 100 1 11 111 2 1 112 1 ___6M2H C212 7L 3M 10H 2L2M1H __6L3M . __1L2H C111 F-/4 g4 0 1 16 16 3 1 17 2 1 18 9 1 19 3 2 21 5 6 27 3 1 28 5 1 29 14 1 30 1 8 38 2 2 40 15 1 41 3 4 45 3 2 47 2 19 66 3 21 87 1 ___1L4M3H ___ __1L ___1L 1M4H ___ 8H 43L 38M 55H C2 0L 14M 0H C1 ___ 3L 2M18H 1L 21M 43L 28M 55H C21 __ 1L 2M 20H C213 4L 7M 38H ___4L 2M8H ___ 8H C214 0L 5M 7H ___ 2M9H ___ ___ . 1H 2M 0L 10M 0H C22 ___ __ 31H ___1L2H

  23. ABALONE GV 0.11 0.09 0.03 0.14 2 0.27 0.86 0.33 0.27 73 1.00 0.00 0.00 0.00 5 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 0.00 1.00 0.00 0.00 56 0.25 0.88 0.31 0.25 73 0.00 0.00 1.00 0.00 8 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 0.00 0.00 0.00 1.00 5 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 1.00 1.00 0.00 0.00 93 0.26 0.87 0.32 0.26 73 1.00 0.00 1.00 0.00 27 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 1.00 0.00 0.00 1.00 22 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 1.00 1.00 1.00 0.00 154 0.27 0.87 0.33 0.27 73 1.00 1.00 0.00 1.00 141 0.26 0.87 0.33 0.26 73 1.00 0.00 1.00 1.00 57 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 0.00 1.00 1.00 1.00 154 0.27 0.87 0.33 0.27 73 1.00 1.00 1.00 1.00 216 0.27 0.86 0.33 0.27 73 GM MVM 1.00 0.00 0.00 0.00 23 0.71 0.23 0.66 0.01 47 C1 g3 400*F-M 0 1 1 1 1 6 7 1 3 10 2 2 12 3 2 14 3 1 15 1 3 18 1 2 20 1 2 22 3 4 26 1 3 29 1 3 32 1 1 33 1 2 35 1 2 37 2 2 39 1 1 40 1 5 45 2 2 47 1 1 48 2 1 49 1 2 51 1 1 52 2 1 53 2 1 54 2 2 56 1 2 58 3 1 59 1 1 60 1 2 62 2 1 63 1 1 64 2 3 67 4 1 68 1 1 69 2 1 70 1 3 73 1 2 75 2 1 76 2 2 78 1 1 79 2 2 81 1 1 82 1 1 83 1 1 ... 97 1 ACR CONC IRIS SEEDS WINE ABAL GV 76 83 94 63 73 MVM 79 94 93 67 79 GM 83 95 96 81 81 0.39 0.57 0.10 -0.72 0.21 0.57 0.44 0.09 -0.69 0.24 0.77 0.61 0.17 0.01 2.19 0.58 0.48 0.17 0.64 3.8 0.55 0.46 0.16 0.68 3.81 g3 200*F-M 0 1 11 11 1 14 25 1 17 42 1 1 43 1 5 48 1 3 51 1 2 ... 67 2 1 68 2 1 69 3 2 ... 1s 92 1 1H 1M _ 1H X g2 100(F-M) 3 2 3 6 1 2 8 1 1 9 2 3 12 1 3 15 2 1 16 1 2 18 2 1 19 1 1 20 2 1 21 3 1 22 2 1 23 1 1 24 6 1 25 1 1 26 1 2 28 3 1 29 2 1 30 2 2 32 3 1 33 2 1 34 3 1 35 5 1 36 4 1 37 4 1 38 3 1 39 5 1 40 3 1 41 2 1 42 1 1 43 2 1 44 3 1 45 4 1 46 2 1 47 3 1 48 3 1 49 1 1 50 3 1 51 1 1 52 1 1 53 7 1 54 4 1 55 3 1 56 3 1 57 4 1 58 2 1 59 1 1 60 3 1 61 4 1 62 2 2 64 2 1 65 1 1 66 1 2 68 3 1 69 2 1 70 1 4 74 1 2 76 1 3 79 2 1 80 2 3 83 2 2 85 1 4 89 1 13 102 1 0.25 0.30 -0.20 -0.90 0.18 -0.44 -0.37 -0.19 -0.79 0.81 -0.52 -0.42 -0.19 -0.72 0.83 C1 g3 300(F-M) 0 1 1 1 1 2 3 2 1 4 1 1 5 1 1 6 2 1 7 1 3 10 1 1 11 1 3 14 3 2 16 2 1 17 1 1 18 2 2 20 1 2 22 1 1 23 2 1 24 1 1 25 2 1 26 3 1 27 1 1 28 2 1 29 1 2 31 1 1 32 1 3 35 1 1 36 1 2 38 1 3 41 1 3 44 3 1 45 1 1 46 2 2 48 1 1 49 1 1 50 2 2 52 2 1 53 1 1 54 1 1 55 1 4 59 2 1 60 1 4 64 1 1 65 1 1 66 1 1 67 2 2 69 2 1 70 2 1 71 2 2 73 1 1 74 1 1 75 2 1 76 2 1 77 1 1 78 3 2 80 1 1 81 3 2 83 2 1 84 1 1 85 1 1 86 1 2 88 1 1 89 1 1 90 1 2 92 1 2M 1H _ 5M12H _ 6L . 1M _ 3L . 30L 85M 12H C1 C1 g3 100*F-M 0 1 6 6 1 1 ... 1s 54 1 2 56 2 3... 71 2 7M 4H . 1H 20L 84M 11H C11 10L1M 0H 12L 7M _ 3L4M _ C11 g3 400*F-M 0 1 1 1 1 4 5 1 3 8 4 1 9 1 3 12 2 2 .. 81 2 3 84 2 1 85 1 2M 1H _ 4M 1H _ 2L 0M 0H _ 1L19M1H _ 16M 8H C11 17L 78M 9H C111 3L 1.0 .00 .00 .00 10 .62 .41 .13 .65 46 .33 .29 .13 .89 56 C2 g3 300*F-M 0 1 8 8 1 1 9 1 2 11 1 1 12 1 1 13 3 1 14 1 2 16 2 1 17 1 1 18 3 2 20 2 1 21 1 3 24 1 1 25 1 2 27 2 1 28 1 1 29 2 1 30 1 1 31 1 2 33 2 1 34 1 1 35 1 2 37 1 1 38 3 1 39 1 1 40 1 5 45 1 1 46 1 2 48 1 6 54 1 4 58 1 1 59 1 3 62 1 1 63 1 1 64 1 4 68 1 1 69 1 14 83 1 3 86 1 23 109 1 7L 3M 0H _ C111 g3 1500*F-M 0 1 15 15 1 5 20 1 4 24 1 1 25 1 1 26 1 3 29 1 1 30 1 1 31 2 1 32 1 1 33 2 3 36 1 2 38 3 1 39 2 2 41 2 1 42 1 1 43 2 2 45 1 2 47 3 1 48 1 2 50 1 1 51 1 4 55 2 1 56 3 2 58 1 2 60 3 1 61 2 1 62 2 2 64 1 1 65 2 3 68 2 1 ... 112 1 4 116 2 .55 .43 .14 .27 .38 C11 g3 1000(F-M) 0 1 10 10 1 7 17 1 2 19 1 8 27 1 9 36 1 11 47 2 2 49 1 3 52 2 4 56 1 4 60 1 2 62 1 2 64 1 7 71 3 1 72 1 5 77 2 4 81 1 3 84 1 6 90 1 3L _ 3M_ 6L8M 0H _ 17M 2H . 13M 5H _ 1M 2H _ 4L 3M _ 0M 6H _ 1M 2H _ 4L 72M 15H C1 10L1M 0H 3M 1H _ 2L21M1H _ 12M 7H _ 3L13M2H 15H _ 1L7M _ 5M 10H _ 1M _ 4L 8M4H 1H 6M 5H _ 3L 30M1H 1M 1H _ 1H 3L 51M3H

  24. KOSblogs d=UnitSTDVec g>6*avg GV on 22 highest STD KOS wds d=(.46 .16 .03 .32 .71 .07 .06 .03 .09 .03 .10 .10 .19 .04 .16 .14 .01 .02 .04 .02 .00 .02) d=e841 (highest STD). gp=1 Ct=8 C16 . outliers. Some of them are substantial MVM gaps>6*avg DOC W=841 1716 0 ... ... 1379 C0 2427 0 Doc F=DPPd Gap 24=MxGp 2682 0 2749 7.574 0.038 0 3029 2983 8.436 0.079 0 42 3402 8.629 0.052 0 2 864 9.184 0.053 0 10 2293 9.462 0.106 1 4 2994 13.45 0.055 0 316 1445 13.66 0.029 0 4 3399 14.05 0.099 0 6 185 14.21 0.156 1 1 2731 14.35 0.143 1 1 2948 14.65 0.066 0 5 1495 14.99 0.014 0 2 804 15.20 0.205 1 1 3177 15.42 0.034 0 6 1316 15.61 0.024 0 2 1335 16.01 0.028 0 3 1637 16.35 0.330 1 1 880 16.86 0.039 0 3 1509 17.03 0.176 1 1 2885 17.21 0.177 1 1 446 18.07 0.863 1 1 1197 18.65 0.005 0 4 3189 19.30 0.644 1 1 1252 20.65 1.352 1 1 2750 13 54 13 2293 13 183 13 2870 13 1222 13 3217 13 1519 13 8 C13 1027 1 ... ... 3427 1 743 C1 2164 14 otlrs 1656 14 3244 14 1709 14 185 15 otlrs 401 15 414 15 893 15 2731 16 otlrs 1396 16 3220 16 3190 16 1832 17 otlr 2852 18 otlrs 3201 18 1234 18 3189 19 otlr 1524 22 otlr 1529 24 otlr 1197 25 otlr 201 27 otlr 1150 29 otlr 1335 34 otlr 1 2 ... ... 2519 2 470 C2 868 3 ... ... 3224 3 274 C3 1882 4 ... ... 3257 4 175 C4 1434 5 ... ... 910 5 127 C5 Cluster size: d=USTDMVM 10 7 11 8 15 8 16 9 17 11 21 11 27 12 42 30 48 45 68 87 422 502 2667 2613 GV 3 3 4 4 4 5 6 6 10 42 316 3029 2753 6 ... ... 549 6 75 C6 1186 7 ... ... 1015 7 79 C7 503 8 ... ... 3156 8 43 C8 2971 9 ... ... 2182 9 39 C9 2868 10 ... ... 1316 10 32 C10 2648 11 ... ... 336 11 18 C11 2983 12 ... ... 3177 12 14 C12 3364 1804 185.38 0.56 0 3365 3399 186.38 1.00 1 3366 980 186.68 0.30 0 3367 1518 187.84 1.15 1 3368 2090 188.45 0.61 1 3369 890 189.10 0.65 1 3370 24 189.74 0.65 1 3371 2435 189.77 0.03 0 3372 804 190.14 0.36 0 3373 930 190.24 0.11 0 3374 1096 191.30 1.06 1 3375 1441 191.39 0.09 0 3376 2885 191.86 0.47 0 3377 2315 191.91 0.05 0 3378 699 192.04 0.13 0 3379 2108 194.34 2.30 1 3380 1316 195.58 1.24 1 3381 991 195.85 0.27 0 3382 1564 196.05 0.20 0 3383 2800 196.37 0.32 0 3384 880 196.62 0.25 0 3385 2038 196.75 0.13 0 3386 481 197.09 0.34 0 3387 480 197.85 0.76 1 3388 295 198.38 0.53 0 3389 1234 200.42 2.04 1 3390 2140 201.46 1.04 1 3391 3353 202.36 0.90 1 3392 3402 202.64 0.28 0 3393 45 202.86 0.21 0 3394 3017 204.63 1.77 1 3395 3365 207.54 2.91 1 3396 2436 207.77 0.24 0 3397 553 209.73 1.96 1 3398 2545 210.52 0.79 1 3399 54 213.63 3.11 1 3400 1933 214.58 0.95 1 3401 3201 216.16 1.57 1 3402 2895 217.18 1.02 1 3403 446 217.83 0.65 1 3404 2302 218.43 0.61 1 3405 2873 219.47 1.04 1 3406 3388 223.00 3.52 1 3407 1509 225.98 2.99 1 3408 32 229.46 3.48 1 3409 3189 231.30 1.84 1 3410 3228 231.43 0.13 0 3411 2107 232.39 0.96 1 3412 1150 232.79 0.40 0 3413 2279 236.69 3.90 1 3414 2289 237.43 0.74 1 3415 2385 238.03 0.60 0 3416 1037 245.93 7.90 1 3417 201 246.72 0.79 1 3418 1252 249.23 2.51 1 3419 1739 250.34 1.11 1 3420 2446 257.59 7.26 1 3421 1637 258.64 1.05 1 3422 3220 260.55 1.91 1 3423 1304 262.67 2.12 1 3424 2355 271.20 8.53 1 3425 232 293.86 22.66 1 3426 3411 299.23 5.37 1 3427 1955 303.42 4.19 1 3428 1832 328.03 24.61 1 3429 1197 335.83 7.81 1 3430 2852 364.01 28.18 1 AvgGp.0085 gp>6*avg ROW KOS F GAP CT 1 1791 0.2270 --- -- 2 1317 0.2920 0.065 1 2668 1602 6.6576 0.007 2667 3090 1390 9.8504 0.004 422 3132 1546 10.278 0.012 42 3148 2662 10.507 0.021 16 3216 505 11.289 0.019 68 3264 2219 11.994 0.027 48 3291 231 12.445 0.039 27 3302 710 12.631 0.038 11 3317 220 12.934 0.023 15 3338 405 13.315 0.028 21 3355 194 13.693 0.009 17 3368 12 14.151 0.078 8 3378 2731 14.590 0.011 10 3392 1096 15.459 0.022 5 0.1=AvgGp 64=#gaps Row#Doc#F 28.2=MxGp .6=GapThreshold 1 1791 5.67 Gap 0 ... ... ... ... ... 8 3389 7.00 0.19 0 9 2397 7.65 0.65 1 10 2841 7.82 0.17 0 ... ... ... ... ... 2621 2334 89.40 0.06 0 2622 1122 90.00 0.60 1 2623 245 90.06 0.06 0 ... ... ... ... ... 3123 3169 132.06 0.00 0 3124 321 132.81 0.75 1 3125 2047 133.05 0.24 0 ... ... ... ... ... 3210 343 145.29 0.37 0 3211 2475 145.89 0.60 1 3212 458 146.10 0.21 0 ... ... ... ... ... 3240 542 151.15 0.09 0 3241 2569 151.76 0.61 1 3242 1143 151.92 0.15 0 ... ... ... ... ... 3285 1803 157.97 0.00 0 3286 2257 158.70 0.73 1 3287 2723 158.77 0.07 0 ... ... ... ... ... 3293 129 159.56 0.32 0 3294 2541 160.45 0.89 1 3295 2870 160.48 0.03 0 ... ... ... ... ... 3301 401 161.38 0.04 0 3302 2918 162.03 0.65 1 3303 100 162.07 0.04 0 ... ... ... ... ... 3312 1157 164.54 0.08 0 3313 185 165.26 0.72 1 3314 685 165.91 0.65 1 3315 2948 166.25 0.34 0 ... ... ... ... ... 3325 190 168.59 0.37 0 3326 2498 169.20 0.61 1 3327 264 169.31 0.11 0 3328 1611 169.64 0.33 0 3329 3052 169.96 0.32 0 3330 1002 170.43 0.47 0 3331 1628 170.64 0.20 0 3332 1241 171.80 1.16 1 3333 3155 172.00 0.20 0 ... ... ... ... ... 3342 861 173.84 0.15 0 3343 2509 174.98 1.13 1 3344 2293 175.65 0.67 1 3345 1257 175.67 0.02 0 3346 2776 176.04 0.37 0 3347 1422 177.15 1.11 1 3348 12 177.24 0.09 0 3349 183 177.26 0.02 0 3350 620 177.29 0.03 0 3351 679 179.08 1.79 1 3352 462 179.15 0.07 0 3353 3404 180.02 0.88 1 3354 1850 180.79 0.76 1 3355 3342 181.21 0.43 0 3356 1396 183.04 1.82 1 3357 2982 183.26 0.22 0 ___ ___ gap=.65 Ct=9 C1 ___ ___ gap=.6 Ct=2613 C2 ___ ___ gap=.75 Ct= 502 C3 ___ ___ gap=.6 Ct= 87 C4 ___ ___ gap=.61 Ct=30 C5 ___ ___ gap=.73 Ct=45 C6 ___ ___ gap=.89 Ct=8 C7 ___ ___ gap=.65 Ct=8 C8 ___ ___ gp=.72 Ct= 11 C9 ___ ___ gp=.65 Ct=1 outlr ___ ___ gp=.61 Ct=12 C11 ___ ___ gp=1.2 Ct=6 C12 ___ ___ gp=1.1 Ct=11 C13 ___ ___ gap=.67 Ct=1 utlr ___ ___ gp=1.1 Ct=3 C15 ___ ___ gp=1.8 Ct=4 C16 ___ ___ gp=1.8 Ct=5 otl;r

  25. GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians) UCUC(1101) UCUC(1011) UCUC(0111) 0.58 0.58 0.00 0.58 6756 0.69 0.10 -0.43 0.57 11945 0.65 0.10 -0.58 0.48 12599 0.60 0.11 -0.66 0.45 12784 0.55 0.12 -0.70 0.45 12864 0.51 0.13 -0.72 0.45 12908 0.49 0.13 -0.73 0.46 12933 0.46 0.14 -0.74 0.46 12947 0.45 0.14 -0.75 0.47 12956 0.43 0.14 -0.76 0.47 12960 0.42 0.14 -0.76 0.47 12963 0.42 0.14 -0.76 0.48 12965 0.41 0.15 -0.76 0.48 12966 0.58 0.00 0.58 0.58 6414 0.82 -0.10 0.46 0.33 8390 0.93 -0.12 0.32 0.12 9506 0.97 -0.11 0.20 0.02 9889 0.99 -0.10 0.11 -0.00 10069 1.00 -0.08 0.02 0.01 10254 0.99 -0.06 -0.08 0.05 10508 0.98 -0.04 -0.18 0.11 10851 0.94 -0.01 -0.29 0.18 11263 0.89 0.02 -0.40 0.24 11695 0.82 0.05 -0.49 0.30 12084 0.75 0.07 -0.56 0.35 12391 0.68 0.09 -0.62 0.38 12609 0.62 0.10 -0.66 0.41 12751 0.57 0.12 -0.69 0.43 12839 0.53 0.12 -0.71 0.44 12892 0.50 0.13 -0.73 0.45 12924 0.47 0.13 -0.74 0.46 12942 0.45 0.14 -0.75 0.47 12953 0.44 0.14 -0.75 0.47 12959 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.76 0.48 12966 0.00 0.58 0.58 0.58 3102 -0.15 0.02 0.71 0.68 5237 -0.34 -0.08 0.86 0.37 7997 -0.46 -0.12 0.88 -0.09 11648 -0.47 -0.13 0.81 -0.33 12756 -0.45 -0.14 0.77 -0.42 12928 -0.44 -0.14 0.76 -0.45 12955 -0.43 -0.14 0.76 -0.47 12962 -0.42 -0.14 0.76 -0.47 12964 -0.41 -0.14 0.76 -0.48 12965 -0.41 -0.15 0.76 -0.48 12966 CONC d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) -0.06 -0.19 0.83 -0.52 12619 -0.14 -0.18 0.82 -0.52 12758 -0.20 -0.17 0.82 -0.51 12843 -0.25 -0.17 0.81 -0.51 12895 -0.28 -0.16 0.80 -0.50 12925 -0.31 -0.16 0.79 -0.50 12943 -0.33 -0.16 0.79 -0.49 12953 -0.35 -0.16 0.79 -0.49 12959 -0.36 -0.15 0.78 -0.49 12962 -0.37 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.48 12965 -0.38 -0.15 0.78 -0.48 12966 -0.38 -0.15 0.77 -0.48 12967 0.71 0.00 0.00 0.71 9105 0.78 0.05 -0.32 0.53 11499 0.74 0.07 -0.50 0.44 12306 0.68 0.09 -0.60 0.42 12601 0.62 0.10 -0.65 0.42 12753 0.57 0.12 -0.69 0.43 12841 0.53 0.12 -0.71 0.45 12894 0.50 0.13 -0.73 0.45 12924 0.47 0.13 -0.74 0.46 12942 0.45 0.14 -0.75 0.47 12953 0.44 0.14 -0.75 0.47 12959 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.77 0.48 12966 0.40 0.15 -0.77 0.48 12967 0.00 0.71 0.71 0.00 3491 -0.19 -0.13 0.94 -0.25 12162 -0.25 -0.17 0.86 -0.41 12806 -0.28 -0.16 0.82 -0.47 12915 -0.31 -0.16 0.80 -0.49 12942 -0.33 -0.16 0.79 -0.49 12953 -0.35 -0.16 0.79 -0.49 12959 -0.36 -0.15 0.78 -0.49 12963 -0.37 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.48 12966 0.00 0.71 0.00 0.71 4926 0.01 0.20 -0.54 0.81 11209 0.09 0.18 -0.73 0.65 12473 0.16 0.18 -0.79 0.56 12765 0.22 0.17 -0.80 0.53 12861 0.26 0.17 -0.80 0.51 12907 0.29 0.16 -0.80 0.50 12932 0.32 0.16 -0.79 0.50 12947 0.34 0.16 -0.79 0.49 12955 0.35 0.15 -0.78 0.49 12960 0.36 0.15 -0.78 0.49 12963 0.37 0.15 -0.78 0.49 12965 0.37 0.15 -0.78 0.48 12966 0.00 0.00 0.71 0.71 4951 -0.06 -0.09 0.89 0.45 6835 -0.16 -0.15 0.97 -0.02 10755 -0.23 -0.17 0.90 -0.33 12547 -0.28 -0.16 0.84 -0.44 12876 -0.31 -0.16 0.81 -0.48 12934 -0.33 -0.16 0.80 -0.49 12951 -0.34 -0.16 0.79 -0.49 12958 -0.35 -0.15 0.78 -0.49 12962 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.38 -0.15 0.78 -0.48 12966 UCUC(1111) akk MVM 0.50 0.50 0.50 0.50 4385 0.83 -0.04 0.32 0.46 8393 0.95 -0.06 0.09 0.28 9943 0.97 -0.04 -0.09 0.20 10663 0.95 -0.01 -0.24 0.21 11151 0.90 0.01 -0.36 0.25 11601 0.83 0.04 -0.47 0.30 12007 0.76 0.07 -0.55 0.34 12334 0.69 0.09 -0.61 0.38 12569 0.63 0.10 -0.65 0.41 12726 0.58 0.11 -0.69 0.43 12824 0.54 0.12 -0.71 0.44 12883 0.50 0.13 -0.73 0.45 12918 0.48 0.13 -0.74 0.46 12939 0.46 0.14 -0.75 0.46 12951 0.44 0.14 -0.75 0.47 12958 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.76 0.48 12966 0.17 0.05 0.98 0.01 9327 0.06 -0.19 0.93 -0.30 11888 -0.04 -0.19 0.88 -0.44 12502 -0.12 -0.18 0.84 -0.49 12715 -0.19 -0.18 0.83 -0.50 12822 -0.24 -0.17 0.81 -0.50 12882 -0.27 -0.17 0.80 -0.50 12918 -0.30 -0.16 0.80 -0.50 12939 -0.32 -0.16 0.79 -0.49 12951 -0.34 -0.16 0.79 -0.49 12958 -0.35 -0.15 0.78 -0.49 12962 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.38 -0.15 0.78 -0.48 12966 0.00 -0.00 0.00 -0.01 1 0.28 -0.19 0.49 -0.80 10378 0.18 -0.20 0.71 -0.65 11773 0.06 -0.20 0.79 -0.58 12296 -0.04 -0.19 0.82 -0.54 12563 -0.12 -0.18 0.82 -0.53 12724 -0.19 -0.18 0.82 -0.52 12823 -0.24 -0.17 0.81 -0.51 12883 -0.27 -0.17 0.80 -0.50 12918 -0.30 -0.16 0.80 -0.50 12939 -0.33 -0.16 0.79 -0.49 12951 -0.34 -0.16 0.79 -0.49 12958 -0.35 -0.15 0.78 -0.49 12962 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.38 -0.15 0.78 -0.48 12966 1.00 0.00 0.00 0.00 10249 0.99 -0.05 -0.11 0.06 10585 0.97 -0.03 -0.21 0.13 10947 0.93 -0.00 -0.32 0.19 11370 0.87 0.03 -0.42 0.26 11796 0.80 0.05 -0.51 0.31 12168 0.73 0.08 -0.58 0.36 12453 0.66 0.09 -0.63 0.39 12649 0.61 0.11 -0.67 0.42 12776 0.56 0.12 -0.70 0.43 12855 0.52 0.13 -0.72 0.45 12902 0.49 0.13 -0.73 0.46 12929 0.47 0.14 -0.74 0.46 12945 0.45 0.14 -0.75 0.47 12954 0.44 0.14 -0.75 0.47 12960 0.43 0.14 -0.76 0.47 12963 0.42 0.14 -0.76 0.47 12965 0.41 0.14 -0.76 0.48 12966 0.00 1.00 0.00 0.00 795 -0.23 0.33 -0.78 0.49 11645 -0.12 0.21 -0.82 0.52 12191 -0.01 0.19 -0.83 0.52 12469 0.09 0.19 -0.83 0.52 12660 0.16 0.18 -0.82 0.52 12783 0.22 0.17 -0.81 0.51 12859 0.26 0.17 -0.81 0.50 12904 0.29 0.16 -0.80 0.50 12931 0.32 0.16 -0.79 0.50 12946 0.33 0.16 -0.79 0.49 12955 0.35 0.15 -0.78 0.49 12960 0.36 0.15 -0.78 0.49 12963 0.37 0.15 -0.78 0.49 12965 0.37 0.15 -0.78 0.48 12966 0.00 0.00 1.00 0.00 9950 -0.10 -0.18 0.93 -0.31 12279 -0.17 -0.18 0.86 -0.44 12749 -0.23 -0.17 0.83 -0.48 12865 -0.27 -0.17 0.81 -0.49 12911 -0.30 -0.16 0.80 -0.50 12935 -0.32 -0.16 0.79 -0.49 12949 -0.34 -0.16 0.79 -0.49 12956 -0.35 -0.15 0.78 -0.49 12961 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.37 -0.15 0.78 -0.48 12966 0.00 0.00 0.00 1.00 6686 0.08 0.16 -0.44 0.88 10572 0.16 0.17 -0.69 0.68 12435 0.22 0.17 -0.77 0.57 12816 0.26 0.17 -0.79 0.53 12901 0.29 0.16 -0.79 0.51 12932 0.32 0.16 -0.79 0.50 12947 0.34 0.16 -0.79 0.49 12955 0.35 0.15 -0.78 0.49 12960 0.36 0.15 -0.78 0.49 12963 0.37 0.15 -0.78 0.49 12965 0.37 0.15 -0.78 0.48 12966 0.71 0.71 0.00 0.00 4968 0.94 0.02 -0.29 0.18 11266 0.88 0.02 -0.40 0.24 11709 0.82 0.05 -0.49 0.30 12096 0.74 0.07 -0.57 0.35 12400 0.68 0.09 -0.62 0.38 12614 0.62 0.10 -0.66 0.41 12754 0.57 0.12 -0.69 0.43 12841 0.53 0.12 -0.71 0.44 12894 0.50 0.13 -0.73 0.45 12924 0.47 0.13 -0.74 0.46 12942 0.45 0.14 -0.75 0.47 12953 0.44 0.14 -0.75 0.47 12959 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.77 0.48 12966 0.40 0.15 -0.77 0.48 12967 UCUC(1110) 0.58 0.58 0.58 0.00 4647 0.76 -0.15 0.62 -0.14 9784 0.72 -0.19 0.61 -0.27 10149 0.65 -0.20 0.64 -0.36 10422 0.56 -0.20 0.69 -0.41 10750 0.44 -0.21 0.74 -0.46 11149 0.32 -0.21 0.78 -0.49 11582 0.19 -0.21 0.81 -0.51 11988 0.07 -0.20 0.83 -0.52 12319 -0.04 -0.19 0.83 -0.52 12559 -0.12 -0.18 0.83 -0.52 12719 -0.18 -0.18 0.82 -0.51 12820 -0.23 -0.17 0.81 -0.51 12881 -0.27 -0.17 0.80 -0.50 12917 -0.30 -0.16 0.80 -0.50 12938 -0.32 -0.16 0.79 -0.49 12950 -0.34 -0.16 0.79 -0.49 12957 -0.35 -0.15 0.78 -0.49 12961 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.38 -0.15 0.78 -0.48 12966 UCUC(1010) 0.71 0.00 0.71 0.00 9007 0.69 -0.18 0.67 -0.21 10074 0.62 -0.20 0.68 -0.33 10486 0.52 -0.21 0.72 -0.41 10867 0.40 -0.21 0.76 -0.46 11289 0.27 -0.21 0.80 -0.50 11721 0.15 -0.20 0.82 -0.51 12106 0.03 -0.20 0.83 -0.52 12408 On these pages we display the variance hill-climb for each of the four datasets (Concrete, IRIS, Seeds, Wine) for a grid of starting unit vectors, d. I took the circumscribing unit non-negative cube and used all the Unitized diagonals. In low dimension (all dimension=4 here) this grid is very nearly a uniform grid. Note that this will work less and less well as the dimension grows. In all cases, the same local max and nearly the same unit vector are reached.

  26. GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians) 2 SEEDS d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM WINE d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM IRIS d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM 1.00 0.00 0.00 0.00 8 0.97 0.16 -0.11 0.14 9 0.00 1.00 0.00 0.00 0 0.96 0.23 -0.14 0.13 9 0.00 0.00 1.00 0.00 2 -0.36 -0.07 0.93 -0.00 4 -0.82 -0.15 0.55 -0.09 8 -0.94 -0.16 0.27 -0.12 9 0.00 0.00 0.00 1.00 0 0.97 0.15 -0.00 0.19 9 0.71 0.71 0.00 0.00 6 0.97 0.17 -0.12 0.13 9 0.71 0.00 0.71 0.00 4 0.96 0.16 0.20 0.15 8 0.97 0.16 -0.05 0.14 9 0.71 0.00 0.00 0.71 5 0.97 0.16 -0.10 0.14 9 0.00 0.71 0.71 0.00 1 0.19 0.06 0.98 0.08 2 0.33 0.04 0.94 0.10 3 0.70 0.11 0.69 0.14 5 0.96 0.16 0.18 0.15 8 0.97 0.16 -0.06 0.14 9 0.00 0.71 0.00 0.71 0 0.97 0.20 -0.08 0.15 9 0.00 0.00 0.71 0.71 1 0.08 -0.01 0.99 0.09 2 -0.07 -0.03 1.00 0.05 3 -0.51 -0.10 0.86 -0.03 5 -0.88 -0.15 0.44 -0.10 8 -0.95 -0.16 0.23 -0.13 9 0.58 0.58 0.58 0.00 4 0.96 0.17 0.15 0.15 8 0.97 0.16 -0.07 0.14 9 0.58 0.58 0.00 0.58 5 0.97 0.17 -0.10 0.14 9 0.58 0.00 0.58 0.58 4 0.96 0.16 0.17 0.16 8 0.97 0.16 -0.06 0.14 9 0.00 0.58 0.58 0.58 1 0.56 0.11 0.80 0.14 4 0.92 0.15 0.31 0.15 8 0.98 0.16 -0.02 0.14 9 0.50 0.50 0.50 0.50 4 0.97 0.17 0.13 0.15 8 0.97 0.16 -0.07 0.14 9 0.98 0.14 0.06 0.13 9 -0.62 0.36 0.27 -0.30 4 -0.95 -0.15 0.22 -0.13 9 1.00 0.00 0.00 0.00 4 0.40 -0.06 -0.91 -0.07 497 0.02 -0.25 -0.97 -0.01 608 0.00 1.00 0.00 0.00 82 -0.00 0.49 0.87 0.00 577 -0.01 0.28 0.96 0.00 608 0.00 0.00 1.00 0.00 567 -0.01 0.25 0.97 0.00 608 0.00 0.00 0.00 1.00 1 -0.20 0.17 0.84 0.47 455 -0.02 0.26 0.96 0.01 608 0.71 0.71 0.00 0.00 42 0.02 0.51 0.86 -0.00 570 -0.01 0.29 0.96 0.00 608 0.71 0.00 0.71 0.00 277 -0.01 0.25 0.97 0.00 608 0.71 0.00 0.00 0.71 2 0.46 0.00 -0.88 0.12 447 0.02 -0.25 -0.97 -0.00 608 0.00 0.71 0.71 0.00 472 -0.01 0.31 0.95 0.00 608 0.00 0.71 0.00 0.71 42 -0.01 0.48 0.88 0.01 578 -0.01 0.28 0.96 0.00 608 0.00 0.00 0.71 0.71 287 -0.02 0.25 0.97 0.01 608 0.58 0.58 0.58 0.00 310 -0.01 0.31 0.95 0.00 607 -0.01 0.27 0.96 0.00 608 0.58 0.58 0.00 0.58 29 0.02 0.50 0.86 0.01 572 -0.01 0.29 0.96 0.00 608 0.58 0.00 0.58 0.58 186 -0.01 0.25 0.97 0.01 608 0.00 0.58 0.58 0.58 317 -0.01 0.30 0.95 0.01 608 0.50 0.50 0.50 0.50 234 -0.01 0.31 0.95 0.01 607 -0.01 0.27 0.96 0.00 608 0.07 0.15 0.98 0.12 588 -0.01 0.26 0.97 0.00 608 -0.13 -1.00 -3.07 -0.03 6314 0.01 -0.27 -0.96 -0.00 608 1.00 0.00 0.00 0.00 68 0.45 -0.03 0.83 0.34 415 0.36 -0.08 0.86 0.36 420 0.00 1.00 0.00 0.00 19 -0.10 0.48 -0.82 -0.30 334 -0.34 0.10 -0.86 -0.36 420 0.00 0.00 1.00 0.00 311 0.35 -0.09 0.86 0.35 420 0.00 0.00 0.00 1.00 58 0.34 -0.08 0.85 0.39 420 0.71 0.71 0.00 0.00 39 0.53 0.12 0.78 0.33 390 0.37 -0.07 0.86 0.36 420 0.71 0.00 0.71 0.00 316 0.38 -0.07 0.85 0.35 420 0.71 0.00 0.00 0.71 114 0.40 -0.05 0.84 0.36 419 0.36 -0.08 0.86 0.36 420 0.00 0.71 0.71 0.00 133 0.37 -0.04 0.86 0.36 419 0.36 -0.08 0.86 0.36 420 0.00 0.71 0.00 0.71 27 0.41 0.06 0.82 0.40 410 0.37 -0.08 0.86 0.36 420 0.00 0.00 0.71 0.71 312 0.35 -0.09 0.86 0.36 420 0.58 0.58 0.58 0.00 193 0.40 -0.04 0.85 0.35 419 0.36 -0.08 0.86 0.36 420 0.58 0.58 0.00 0.58 72 0.43 0.01 0.83 0.36 414 0.37 -0.08 0.86 0.36 420 0.58 0.00 0.58 0.58 349 0.37 -0.07 0.85 0.36 420 0.00 0.58 0.58 0.58 185 0.36 -0.05 0.85 0.37 420 0.50 0.50 0.50 0.50 243 0.90 0.24 0.37 0.04 180 0.41 -0.04 0.84 0.35 418 0.36 -0.08 0.86 0.36 420 0.90 0.24 0.37 0.04 180 0.41 -0.04 0.84 0.35 418 0.36 -0.08 0.86 0.36 420 -0.00 -0.04 0.05 0.01 1 0.35 -0.09 0.86 0.36 420 As we all know, Dr. Ubhaya is the best Mathematician on campus and he is attempting to prove three things: 1. That a GV-hill-climb that does not reach the global max Variance is rare indeed. 2. That one is guaranteed to reach the global maximum with at least one of the coordinate unit vectors (so a 90 degree grid will always suffice). 3. That akk will always reach the global max.

  27. Finding round clusters that aren't DPPd separable? (no linear gap) d Find the golf ball? Suppose we have a white mask pTree. No linear gaps exits to reveal it. Search a grid of d-tubes until a DPPd gap is found in the interior of the tube (Form mask pTree for interior of the d-tube. Apply DPPd that mask to reveal interior gaps.) Look for conical gaps (fix the the cone point at the middle of tube) over all cone angles (look for an interval of angles with no points). Notice that this method includes DPPd since a gap for a cone angle of 90 degrees is linear.

  28. FAUST Gap Revealer Width  24 so compute all pTree combinations down to p4 and p'4 d=M-p 0 &p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 C=3 0 &p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 C=1 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 C=1 &p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 C=0 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 C=3 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 C=2 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 C=2 0 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 C=5 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 C=5 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 C=5 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 C=5 &p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 C=2 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 C=1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 C=2 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 C=6 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 C=2 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 C=2 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 C=8 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 C=2 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 C=8 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 C10 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 C10 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 C10 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 C10 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 z1 z2 z7 2 z3 z5 z8 3 z4 z6 z9 4 za 5 M 6 7 8 zf 9 zb a zc b zd ze c 0 1 2 3 4 5 6 7 8 9 a b c d e f F=zod 11 27 23 34 53 80 118 114 125 114 110 121 109 125 83 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 p2 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0 p1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1 p0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1 p2' 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 p1' 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 p0' 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 Z z1 1 1 z2 3 1 z3 2 2 z4 3 3 z5 6 2 z6 9 3 z7 15 1 z8 14 2 z9 15 3 za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 p= [011 0000, 011 1111] = [ 48, 64). z5od=53 is 19 from z4od=34 (>24) but 11 from 64. But the next int [64,80) is empty z5 is 27 from its right nbr. z5 is declared an outlier and we put a subcluster cut thru z5 [000 0000, 000 1111]= [0,15]=[0,16) has 1 point, z1. This is a 24 thinning. z1od=11 is only 5 units from the right edge, so z1 is not declared an outlier) Next, we check the min dis from the right edge of the next interval to see if z1's right-side gap is actually  24 (the calculation of the min is a pTree process - no x looping required!) [001 0000, 001 1111] = [16,32). The minimum, z3od=23 is 7 units from the left edge, 16, so z1 has only a 5+7=12 unit gap on its right (not a 24 gap). So z1 is not declared a 24 (and is declared a 24 inlier). [010 0000 , 010 1111] = [32,48). z4od=34 is within 2 of 32, so z4 is not declared an anomaly. [111 0000 , 111 1111]= [112,128) z7od=118 z8od=114 z9od=125 zaod=114 zcod=121 zeod=125 No 24 gaps. But we can consult SpS(d2(x,y) for actual distances: [110 0000 , 110 1111]= [96,112). zbod=110, zdod=109. So both {z6,zf} declared outliers (gap16 both sides. [100 0000 , 100 1111]= [64, 80). This is clearly a 24 gap. [101 0000 , 101 1111]= [80, 96). z6od=80, zfod=83 Which reveals that there are no 24 gaps in this subcluster. And, incidentally, it reveals a 5.8 gap between {7,8,9,a} and {b,c,d,e} but that analysis is messy and the gap would be revealed by the next xofM round on this sub-cluster anyway. X1 X2 dX1X2 z7 z8 1.4 z7 z9 2.0 z7 z10 3.6 z7 z11 9.4 z7 z12 9.8 z7 z13 11.7 z7 z14 10.8 z8 z9 1.4 z8 z10 2.2 z8 z11 8.1 z8 z12 8.5 z8 z13 10.3 z8 z14 9.5 X1 X2 dX1X2 z9 z10 2.2 z9 z11 7.8 z9 z12 8.1 z9 z13 10.0 z9 z14 8.9 z10 z11 5.8 z10 z12 6.3 z10 z13 8.1 z10 z14 7.3 X1 X2 dX1X2 z11 z12 1.4 z11 z13 2.2 z11 z14 2.2 z12 z13 2.2 z12 z14 1.0 z13 z14 2.0

  29. FAUST Tube Clustering:(This method attempts to build tubular-shaped gaps around clusters) y (yof) (yof) (yof) f |f| f |f| f f o y - f y - = y - squared is y- yo fof fof fof f |f| f |f| yo dot prod proj len (yof)2 (yof)2 (yof)2 (yof)2 f Gaps in dot product lengths [projections] on the line. + + fof squared = yoy - 2 squared = yoy - 2 fof (fof)2 fof fof y ( (y-p)o(q-p) )2 Squared y-p on q-p Projection Distance = (y-p)o(y-p) - (q-p)o(q-p) 1st 2 (yo(q-p)-p o(q-p = yoy -2yop+ pop- |q-p| |M-p| |q-p| |M-p| M-p |M-p| (y-p)o (yof)2 Squared y on f Proj Dis = yoy - For the dot product length projections (caps) we already needed: fof tube cap gap width po M-p ) = ( yo(M-p)- tube radius gap width q Allows for a better fit around convex clusters that are elongated in one direction (not round). Exhaustive Search for all tubular gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in tubular gaps. a. SquareTubeRadius functional, STR(y) = (y-p)o(y-p) - ((y-p)od)2 b. TubeLength functional, TL(y) = (y-p)od Given a p, do we need a full grid of ds (directions)? No! d and -d give the same TL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p dot product projection distance That is, we needed to compute the greenconstants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)

  30. Cone Clustering:(finding cone-shaped clusters) x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 52 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 14 1 i25 16 1 i40 18 2 i16 i42 19 2 i17 i38 20 2 i11 i48 22 2 23 1 24 4 i34 i50 25 3 i24 i28 26 3 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 49 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 23 6 i21 24 5 25 1 27 1 28 1 29 2 30 2 i7 41/43 e so picks e Cosine cone gap (over some  angle) Gap in dot product projections onto the cornerpoints line. Corner points x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 26 1 e21 e34 27 2 29 2 37 1 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 10 12 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 26 1 e21 e34 27 2 28 1 29 2 31 1 e35 37 1 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 11 2 i28 12 4 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 34 1 i39 43/50 e so picks out e

  31. "Gap Hill Climbing": mathematical analysis rotation d toward a higher F-STD or grow 1 gap using support pairs: 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 0 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k l m n 4 b c q r s 3 d e f o p 2 g h 1 i 0 0 1 2 3 4 5 6 7 8 9 a b c f 1 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k 4 b c q 3 d e f 2 1 0 =p d2-gap d2-gap p C123 p avg=14 q avg=17 0 1 2 3 3 2 4 4 5 7 6 4 7 8 8 2 9 11 10 4 12 3 13 1 20 1 21 1 22 2 23 1 27 2 28 1 29 1 30 2 31 4 d1-gap d1-gap 32 2 33 3 34 4 35 1 36 3 37 4 38 2 39 2 40 5 41 3 42 3 43 6 44 8 45 1 46 2 47 1 48 3 49 3 51 7 52 2 53 2 54 3 55 1 56 3 57 3 58 1 61 2 63 2 64 1 66 1 67 1 q= q d2 d1 d1 d2 F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows.Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning.This is easy since our method produces the pTree mask the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place. Dot F p=aaan q=aaax 0 6 1 28 2 7 3 7 4 1 5 1 9 7 10 3 11 5 12 13 13 8 14 12 15 4 16 2 17 12 18 5 19 6 20 6 21 3 22 8 23 3 24 3 C1<7 (50 Set) d2-gap >> than d1=gap (still not optimal.) Weight mean by the dist from gap? (d-barrel radius) 7<C2<16 (4i, 48e) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? C3>16 (46i, 2e) hill-climb gap at 16 w half-space avgs. C2uC3 p=avg<16 q=avg>16 0 1 1 1 2 2 3 1 7 2 9 2 10 2 11 3 12 3 13 2 14 5 15 1 16 3 17 3 18 2 19 2 20 4 21 5 22 2 23 5 24 9 25 1 26 1 27 3 28 2 29 1 30 3 31 5 32 2 33 3 34 3 35 1 36 2 37 4 38 1 39 1 42 2 44 1 45 2 47 2 No conclusive gaps Sparse Lo end: Check [0,9] 0 1 2 2 3 7 7 9 9 i39 e49 e8 e44 e11 e32 e30 e15 e31 i39 0 17 21 21 24 22 19 19 23 e49 17 0 4 4 7 8 8 9 9 e8 21 4 0 1 5 7 8 10 8 e44 21 4 1 0 4 6 8 9 7 e11 24 7 5 4 0 7 9 11 7 e32 22 8 7 6 7 0 3 6 1 e30 19 8 8 8 9 3 0 4 4 e15 19 9 10 9 11 6 4 0 6 e31 23 9 8 7 7 1 4 6 0 i39,e49,e11 singleton outliers. {e8,i44} doubleton outlier set There is a thinning at 22 and it is the same one but it is not more prominent. Next we attempt to hill-climb the gap at 16 using the mean of the half-space boundary. (i.e., p is avg=14; q is avg=17. Sparse Hi end: Check [38,47] distances 38 39 42 42 44 45 45 47 47 i31 i8 i36 i10 i6 i23 i32 i18 i19 i31 0 3 5 10 6 7 12 12 10 i8 3 0 7 10 5 6 11 11 9 i36 5 7 0 8 5 7 9 10 9 i10 10 10 8 0 10 12 9 9 14 i6 6 5 5 10 0 3 9 8 5 i23 7 6 7 12 3 0 11 10 4 i32 12 11 9 9 9 11 0 4 13 i18 12 11 10 9 8 10 4 0 12 i19 10 9 9 14 5 4 13 12 0 i10,i18,i19,i32,i36 singleton outliers {i6,i23} doubleton outlier Here, gap between C1,C2 is more pronounced Why? Thinning C2,C3 more obscure? It did not grow gap wanted to grow (tween C2 ,C3.

  32. CAINE 2013 Call for Papers 26th International Conference on Computer Applications in Industry and Engineering September 25{27, 2013, Omni Hotel, Los Angles, Califorria, USA Sponsored by the International Society for Computers and Their Applications (ISCA) CAINE{2013 will feature contributed papers as well as workshops and special sessions. Papers will be accepted into oral presentation sessions. The topics will include, but are not limited to, the following areas: Agent-Based Systems Image/Signal Processing Autonomous Systems Information Assurance Big Data Analytics Information Systems/Databases Bioinformatics, Biomedical Systems/Engineering Internet and Web-Based Systems Computer-Aided Design/Manufacturing Knowledge-based Systems Computer Architecture/VLSI Mobile Computing Computer Graphics and Animation Multimedia Applications Computer Modeling/Simulation Neural Networks Computer Security Pattern Recognition/Computer Vision Computers in Education Rough Set and Fuzzy Logic Computers in Healthcare Robotics Computer Networks Fuzzy Logic Control Systems Sensor Networks Data Communication Scientic Computing Data Mining Software Engineering/CASE Distributed Systems Visualization Embedded Systems Wireless Networks and Communication Important Dates: Workshop/special session proposal . . May 2.5,.2.013 Full Paper Submis . .June 5,.2013. Notice Accept ..July.5 , 2013. Pre-registration & Camera-Ready Paper Due . . . ..August 5, 2013. Event Dates . . .Sept 25-27, 2013 SEDE Conf is interested in gathering researchers and professionals in the domains of SE and DE to present and discuss high-quality research results and outcomes in their fields. SEDE 2013 aims at facilitating cross-fertilization of ideas in Software and Data Engineering, The conference topics include, but not limited to: . Requirements Engineering for Data Intensive Software Systems. Software Verification and Model of Checking. Model-Based Methodologies. Software Quality and Software Metrics. Architecture and Design of Data Intensive Software Systems. Software Testing. Service- and Aspect-Oriented Techniques. Adaptive Software Systems . Information System Development. Software and Data Visualization. Development Tools for Data Intensive. Software Systems. Software Processes. Software Project Mgnt . Applications and Case Studies. Engineering Distributed, Parallel, and Peer-to-Peer Databases. Cloud infrastructure, Mobile, Distributed, and Peer-to-Peer Data Management . Semi-Structured Data and XML Databases. Data Integration, Interoperability, and Metadata. Data Mining: Traditional, Large-Scale, and Parallel. Ubiquitous Data Management and Mobile Databases. Data Privacy and Security. Scientific and Biological Databases and Bioinformatics. Social networks, web, and personal information management. Data Grids, Data Warehousing, OLAP. Temporal, Spatial, Sensor, and Multimedia Databases. Taxonomy and Categorization. Pattern Recognition, Clustering, and Classification. Knowledge Management and Ontologies. Query Processing and Optimization. Database Applications and Experiences. Web Data Mgnt and Deep Web May 23, 2013 Paper Submission Deadline June 30, 2013 Notification of Acceptance July 20, 2013 Registration and Camera-Ready Manuscript Conference Website: http://theory.utdallas.edu/SEDE2013/ ACC-2013 provides an international forum for presentation and discussion of research on a variety of aspects of advanced computing and its applications, and communication and networking systems. Important Dates May 5, 2013 - Special Sessions Proposal June 5, 2013 - Full Paper Submission July 5, 2013 - Author Notification Aug. 5, 2013 - Advance Registration & Camera Ready Paper Due CBR International Workshop Case-Based Reasoning CBR-MD 2013 July 19, 2013, New York/USA Topics of interest include (but are not limited to): CBR for signals, images, video, audio and text Similarity assessment Case representation and case mining Retrieval and indexing Conversational CBR Meta-learning for model improvement and parameter setting for processing with CBR Incremental model improvement by CBR Case base maintenance for systems Case authoring Life-time of a CBR system Measuring coverage of case bases Ontology learning with CBR Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop on Data Mining in Life Sciences DMLS Discovery of high-level structures, incl e.g. association networks Text mining from biomedical literatur Medical images mining Biomedical signals mining Temporal and sequential data mining Mining heterogeneous data Mining data from molecular biology, genomics, proteomics, pylogenetic classification With regard to different methodologies and case studies: Data mining project development methodology for biomedicine Integration of data mining in the clinic Ontology-driver data mining in life sciences Methodology for mining complex data, e.g. a combination of laboratory test results, images, signals, genomic and proteomic samples Data mining for personal disease management Utility considerations in DMLS, including e.g. cost-sensitive learning Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013 Workshop on Data Mining in Marketing DMM'2013In business environment data warehousing - the practice of creating huge, central stores of customer data that can be used throughout the enterprise - is becoming more and more common practice and, as a consequence, the importance of data mining is growing stronger. Applications in Marketing Methods for User Profiling Mining Insurance Data E-Markteing with Data Mining Logfile Analysis Churn Management Association Rules for Marketing Applications Online Targeting and Controlling Behavioral Targeting Juridical Conditions of E-Marketing, Online Targeting and so one Controll of Online-Marketing Activities New Trends in Online Marketing Aspects of E-Mailing Activities and Newsletter Mailing Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013 Workshop Data Mining in Ag DMA 2013Data Mining on Sensor and Spatial Data from Agricultural Applications Analysis of Remote Sensor Data Feature Selection on Agricultural Data Evaluation of Data Mining Experiments Spatial Autocorrelation in Agricultural Data Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013

  33. DEFG ABC           But horizontal anti-chains are clusterngs from top down (or bottom up) method(s). Hierarchical Clustering Any maximal anti-chain (maximal set of nodes s.t no 2 directly connected) is a clustering. (dendogram offers many DE FG A BC G F D E C B

  34. GV F=(DPP-MN)/4 Concrete(C, W, FA, A) med=71 med=40 med=18 med=61 med=14 med=56 med=10 med=62 med=86 med=57 med=34 med=9 med=21 med=23 med=71 med=33 med=17 C1 C2 C3 C4 0 1 1 1 5 1 6 1 7 1 8 4 9 1 10 1 11 2 12 1 13 5 14 1 15 3 16 3 17 4 18 1 19 3 20 9 21 4 22 3 23 7 24 2 25 4 26 8 27 7 28 7 29 10 30 3 31 1 32 3 33 6 34 4 35 5 37 2 38 2 40 1 42 3 43 1 44 1 45 1 46 4 49 1 56 1 58 1 61 1 65 1 66 1 69 1 71 1 77 1 80 1 83 1 86 1 100 1 103 1 105 1 108 2 112 1 CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 _______ =0 0L 0M 3H CLUS 4.4.1 gap=7 Median=0 Avg=0 =7 0L 0M 4H CLUS 4.4.2 gap=2Median=7 Avg=7 [8,14] 1L 5M 22H CLUS 4.4.3 1L+5M err HMedian=11 Avg=10.7 gap=3 ______ =15 0L 0M 4H CLUS 4.3.1 gap=3 Median=15 Avg=15 =18 0L 0M 10H CLUS 4.3.2 gap=3Median=18 Avg=18 ______ [20,24) 0L 10M 2H CLUS 4.7.2 gap=2Median=22 Avg=22 2H errs in L [24,30) 10L 0M 0H CLUS_4.7.1 Median=26 Avg=26 gap=2 [30,33] 0L 4M 0H CLUS 4.2.1 gap=2Median=31 Avg=32.3 =34 0L 2M 0H CLUS 4.2.2 gap=6Median=34 Avg=34 ______ =40 0L 4M 0H CLUS_4.2.3 gap=7 Median=40 Avg=40 =47 0L 3M 0H CLUS_4.2.4 gap=5Median=47 Avt=47 Accuracy=90% ______ [50,59) 12L 1M 4H CLUS 4.8.1 gap=2Median=55 Avg=55 1M+4H errs in L [59,63) 8L 0M 0H CLUS_4.8.2 Median=61.5 Avg=61.3 gap=2 ______ =64 2L 0M 2H CLUS 4.6.1 gap=3Median=64 Avg=64 2 H errs in L [66,70) 10L 0M 0H CLUS 4.6.2 Median=67 Avg=67.3 gap=3 [70,79) 10L 0M 0H CLUS_4.5 Median=71 Avg=71.7 ______ gap=7 =79 5L 0M 0H CLUS_4.1.1 gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_4.1 1 Merr in L Median=87 Avg=86.3 Suppose we know (or want) 3 clusters, Low, Medium and High Strength. Then we find ______ CLUS 4 gap=7 [52,74) 0L 7M 0H CLUS_3 Suppose we know that we want 3 strength clusters, Low, Medium and High. We can use an anti-chain that gives us exactly 3 subclusters two ways, one show in brown and the other in purple Which would we choose? The brown seems to give slightly more uniform subcluster sizes. Brown error count: Low (bottom) 11, Medium (middle) 0, High (top) 26, so 96/133=72% accurate. The Purple error count: Low 2, Medium 22, High 35, so 74/133=56% accurate. ______ gap=6 [74,90) 0L 4M 0H CLUS_2 What about agglomerating using single link agglomeration (minimum pairwise distance? ________ [0.90) 43L 46 M 55H gap=14 [90,113) 0L 6M 0H CLUS_1 Agglomerate (build dendogram) by iteratively gluing together clusters with min Median separation. Should I have normalize the rounds? Should I have used the same Fdivisor and made sure the range of values was the same in 2nd round as it was in the 1st round (on CLUS 4)? Can I normalize after the fact, I by multiplying 1st round values by 100/88=1.76? Agglomerate the 1st round clusters and then independently agglomerate 2nd round clusters? _____________At this level, FinalClus1={17M} 0 errors CONCRETE

  35. Agglomerating using single link (min pairwise distance = min gap size! (glue min-gap adjacent clusters 1st) GV CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 _______ =0 0L 0M 3H CLUS 4.4.1 gap=7 Median=0 Avg=0 =7 0L 0M 4H CLUS 4.4.2 gap=2Median=7 Avg=7 [8,14] 1L 5M 22H CLUS 4.4.3 1L+5M err HMedian=11 Avg=10.7 gap=3 ______ =15 0L 0M 4H CLUS 4.3.1 gap=3 Median=15 Avg=15 =18 0L 0M 10H CLUS 4.3.2 gap=3Median=18 Avg=18 ______ [20,24) 0L 10M 2H CLUS 4.7.2 gap=2Median=22 Avg=22 2H errs in L [24,30) 10L 0M 0H CLUS_4.7.1 Median=26 Avg=26 gap=2 [30,33] 0L 4M 0H CLUS 4.2.1 gap=2Median=31 Avg=32.3 =34 0L 2M 0H CLUS 4.2.2 gap=6Median=34 Avg=34 ______ =40 0L 4M 0H CLUS_4.2.3 gap=7 Median=40 Avg=40 =47 0L 3M 0H CLUS_4.2.4 gap=5Median=47 Avt=47 Accuracy=90% ______ [50,59) 12L 1M 4H CLUS 4.8.1 gap=2Median=55 Avg=55 1M+4H errs in L [59,63) 8L 0M 0H CLUS_4.8.2 Median=61.5 Avg=61.3 gap=2 ______ =64 2L 0M 2H CLUS 4.6.1 gap=3Median=64 Avg=64 2 H errs in L [66,70) 10L 0M 0H CLUS 4.6.2 Median=67 Avg=67.3 gap=3 [70,79) 10L 0M 0H CLUS_4.5 Median=71 Avg=71.7 ______ gap=7 =79 5L 0M 0H CLUS_4.1.1 gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_4.1 1 Merr in L Median=87 Avg=86.3 The first thing we can notice is that outliers mess up agglomerations which are supervised by knowledge of the number of subclusters expected. Therefore we might remove outliers by backing away from all gap5 agglomerations, then looking for a 3 subcluster max anti-chains. What we have done is to declare F<7 and F>84 as extreme tripleton outliers sets; and F=79. F=40 and F=47 as singleton outlier sets because they are F-gapped by at least 5 (which is actually 10) on either side. The brown gives more uniform sizes. Brown errors: Low (bottom) 8, Medium (middle) 12 and High (top) 6, so 107/133=80% accurate. The one decision to agglomerate C4.7.1 to C4.7.2 (gap=3) instead of C4.3.2 to C4.7.2 (gap=3) lots of error. C4.7.1 and C4.7.2 are problematic since they are separate out, but in increasing F order, it's H M L M L, so if we suspected this pattern we would look for 5 subclusters. The 5 orange errors in increasing F-order are: 6, 2, 0, 0, 8 so 127/133=95% accurate. If you have ever studied concrete, you know it is a very complex material. The fact that it clusters out with a F-order pattern of HMLML is just bizarre! So we should expect errors. CONCRETE

More Related