1 / 12

pTrees - Efficient Data Mining with Fast Horizontal Processing

pTrees from Predicate Tree Technologies provide fast, accurate processing of compressed data structures, enabling data-mining-ready vertical structures for applications such as PINEPodium, FAUST, MYRRH, PGP-D, ConCur, and DOVED. Utilize vertical methods for string distances in unstructured data analysis. Explore distance metrics for scalar features and optimize feature weighting for classification. Implement various similarity measurement approaches like symmetric similarity evaluations. Understand the importance of weighting scalar distances and correlations for accurate model training with kNN.

fboyette
Download Presentation

pTrees - Efficient Data Mining with Fast Horizontal Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. pTreespredicateTreetechnologies provide fast, accurate horizontal processing of compressed, data-mining-ready, vertical data structures. • 1 • 1 • 1 • 1 • 1 • 1 • 1 • 1 1 • 1 • 0 • 0 course 2 3 4 5 PINEPodium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification. 1 0 • 13 Text • 12 1 1 1 • 1 document • 1 1 • 1 • 1 0 Buy • 1 • 1 1 • 1 • 1 person 0 1 Enroll FAUSTFast Accurate Unsupervised, Supervised Treemining uses pTtrees for classification and clustering of spatial data. 2 3 4 MYRRHManY-Relationship-Rule Harvester uses pTrees for association rule mining of multiple relationships. PGP-DPretty Good Protection of Data protects vertical pTree data. key=array(offset,pad) 5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ... ConCurConcurrency Control uses pTrees for ROCC and ROLL concurrency control. DOVEDOmain VEctors Uses pTrees for database query processing. Applications:

  2. From: msilverman@treeminer.com Sat, Nov 12, 2011 8:32 To: Perrizo, W Subject: question-  Bill- I’m working toward a “generic” framework where we can handle multiple data types easily. Some of the test datasets I am working with have scalar types – from a match perspective, it seems that such features either match, or they don’t match.  The key issue from a classification method would be how much to weight the match or lack of match in a final classification determination.  Is this correct? Now, after having typed that, it seems that perhaps you could construct scalar specific distance measures.  Perhaps in a case where one feature is a text string, you could do a variety of string distance measures, and use that as a proxy for distance in that dimension (even as you use other more relevant distance measures for other features in the same dataset, for example, hob distance for a continuous feature).  Similarly, a day of the week scalar could return a distance reflecting that Wednesday is closer to Thursday than is Monday.  In the end, for a heterogeneous dataset, with the goal of a commercially useful product, you could have a “library” of scalar distance functions, with a default of match / no match (if there is no relevant distance metric, a cockpit switch is in position 1, 2, or 3 as an ex)  (another project than to park is how to leverage vertical methods for string distances given the huge issues with unstructured data, if it could be done..  but that’s for later) And then of course this highlights the hyper criticality I think in a product of having a feature optimization working as part of what we do...  This acts as a voting normalization process in a way, weighting properly the relative importance of a scalar distance (if there is a way to calculate it) versus a hob distance in a final class vote.  From: Perrizo, W Sent: Mon, Nov 14, 2011 10:05 AM To: 'Mark Silverman' Subject: RE: question- We can think of distance as  a measure of dissimilarity, and focus on similarities (we weight by similarity anyway). sim(a,b) requires:  sim(a,a)>= sim(a,b)  (nothing more similar to a than a itself) sim(a,b)=sim(b,a)   (symmetric). Any correlations is a similarity and any distance produces a similarity (via some inversion or reciprocation) Similarities are in 1-1 correspondence with “diagonal dominant” upper triangular matrixes. E.g.,     M  T  W  Th F  Sa Su M   7  6  5  4  4  5   6 T      7  6  5  4  4   5 W        7  6  5  4   4 Th         7  6  5   4 F               7  6   5 Sa                 7   6 Su                     7 Of course, the matrix rep (and corresp table-lookup code implementation) is most useful for “low cardinality” scalar attributes such as “day” or “month”… But, Amal did create a matrix for Pearson correlation of movies in the Netflix code (we didn’t produce a table for renters since it would have been 500K x 500K).  Did you get Amal’s matrix as part of the software? If the scalar values just can’t be coded usefully to numeric (like “days” can be coded to 1,2,3,4,5,6,7), then Hamming distance may be best (1 for a match, 0 for a mismatch).

  3. Mark: why is knn not O(n)? Couldn’t you just track lowest 7 distances and loop once thru the data set?  If a distance was less than 7, substitute it in.  I understand about ties and picking an arbitrary point affecting accuracy... We are exactly linear with training points. WP: Yes kNN and ckNN are O(n). But bigO analysis is not always revealing of true speed.  If Nintendo, Sony and Microsoft had relied on bigO analysis, they might have concluded that video games are impossible. Let’s say there are n records (so uncompressed pTrees are n-bits deep) and d individual pTrees (d is constant), then the vertical processing of the n horizontal records is O(n) (each individual process step being unaffected by the addition of new records). Horizontal processing of the d vertical pTrees would be O(d)=O(0), except that adding records make each pTree longer and slower to process. But by how much?  Analysis should consider current processors (GPUs used?, folding algs, etc.).  BigO analysis would make each of our steps O(n) but that doesn’t reflect reality. Mark: There’s a minor fix in the ptreelib which dramatically improves performance, I’ll send that separately. I am thinking that we are losing a big advantage by structuring ptrees the way we are – we gain no advantage from being able to prune an entire subtree off which would give us at least sublinear as training set increases... WP: The reason we have not yet used “leveled or compressed” pTrees (actual trees) is that it has not been necessary (ANDing is so fast, why compress?).  Now the reason I (and Sony, MS, Nintendo) don’t like BigO gets smoked out.  Even if we have the data as actual compressed leveled pTrees, it’s still O(n) because the worst case is that there are no runs (i.e., 010101010101010101010101…) in which case the leaf level is size n and the accumulation of the other levels is also ~n, so we have O(2n)=O(n).  BigO analysis is too coarse to recognize the difference between retrieving d pTrees then ANDing/ORing (what we do) and retrieving n records then comparing (what they do).  I suppose there is an argument for our claiming sublinear, even logn, but in BigO we’re supposed to assume worst case.   So theoretically it’s a tie at O(n).   Practically we win big. Mark: Agreed. What I am seeing is about a 5-6x speed increase across the board.  Because of the noncompressed implementation, adding to a ptree adds to each step precisely the amount it takes to and/or a 64 bit value.  Thus, if we are 5x the speed with a 1M pTree, we are 5x the speed with a 2M pTree.  From my analysis, the cost of doing a pass on a ptree is virtually completely dependent on the size, there doesn’t seem to be much of a fixed cost component. I’m looking to push the limit – 5x a couple of new cpu’s get them there, 10-20x gets attention....  As well, nobody cares about what you claim, they only care about what they see when they push go (they will not bring up the 1010101010... case and argue – so I agree on your larger point about O(n)).  Images should be highly compressible and at larger sizes these are advantages we are giving away.  Showing a real world image get more efficient as sizes increase is a huge psychological plus. Certainly a level based approach is a bit more complex but I think we can add to the library some methods to convert from non-compressed to compressed, and adjust the operator routines accordingly (and to your point also, O(n) at the lowest level is not the same as O(n) at the next level up, since n is smaller).  Will play with this over the weekend...  WP: Some thoughts regarding multilevel pTree implementations:  Greg has created almost all of the code we need for multi-level pTrees. We are not trying to save storage space any more, so we would create the current PTreeSet implementation for the uncompressed pTrees (e.g., for the 17000 Netflix movies, the movie PTreeSet had  3*17000 pTrees, each 500,000 bits deep – since there were 500,000 users). To implement a new set of level-1 pTrees with  bit-stride=1000 (recall, the uncompressed PTreeSet is called level-0), we would just create another PTreeSet which would have 3*17000 pTrees, each 500 bits deep.  Each of these 500 bits would be set according to the predicate.  For the pure1 predicate, a bit would be set to 1 iff that stride of 1000 consecutive bits is entirely 1-bits, else set to 0. Another 3*17000 by 500 level-1 PTreeSet could be created using the predicate “at least 50% 1-bits”.  This predicate provides blindingly fast “level-1 only” FAUST processing with almost the same accuracy.  Each of these new level-1 PTreeSets could be built using the existing code.  They could be used on their own (for level-1 only processing) or used with the level-0 uncompressed PTreeSet (we would access a non-pure leaf stride by simply calculating the proper offset into the corresponding level-0 pTree). We could even create level-2 PTreeSets, e.g.,  having 3*17000 pTrees, each 64 bits deep (stride=~16000) so that each of these level-2 pTree would fit in a 64 bit register… I predict that FAUST using 50% level-1 pTrees will be 1000x faster and very nearly as accurate.

  4. From: msilverman@treeminer.com Sent: Sat, Nov 12, 2011 8:32 To: Perrizo, W Subject: question-  Bill- I’m working toward a “generic” framework where we can handle multiple data types easily. Some of the test datasets I am working with have scalar types – from a match perspective, it seems that such features either match, or they don’t match.  The key issue from a classification method would be how much to weight the match or lack of match in a final classification determination.  Is this correct? Now, after having typed that, it seems that perhaps you could construct scalar specific distance measures.  Perhaps in a case where one feature is a text string, you could do a variety of string distance measures, and use that as a proxy for distance in that dimension (even as you use other more relevant distance measures for other features in the same dataset, for example, hob distance for a continuous feature).  Similarly, a day of the week scalar could return a distance reflecting that Wednesday is closer to Thursday than is Monday.  In the end, then, for a heterogeneous dataset, and with the goal of a commercially useful product, you could possible have a “library” of scalar distance functions, with a default of match / no match (if there is no relevant distance metric, a cockpit switch is in position 1, 2, or 3 as an example)  (another project than to park is how to leverage vertical methods for string distances given the huge issues with unstructured data, if it could be done..  but that’s for later) And then of course this highlights the hyper criticality I think in a product of having a feature optimization working as part of what we do...  This acts as a voting normalization process in a way, weighting properly the relative importance of a scalar distance (if there is a way to calculate it) versus a hob distance in a final class vote. 

  5. From: Perrizo, W Sent: Mon, Nov 14, 2011 10:05 AM To: 'Mark Silverman' Subject: RE: question- We can think of distance as  a measure of dissimilarity, and then focus on similarities, because we end up weighting by similarity in the end anyway. The only requirement of a similarity sim(a,b) is, for every a and b,   sim(a,a) >= sim(a,b)  (nothing is more similar to a than a itself)   and that    sim(a,b)=sim(b,a)   (symmetric). Any correlations is a similarity and any distance produces a similarity (via some inversion or reciprocation) Similarities are in 1-1 correspondence with “diagonal dominant” upper triangular matrixes. E.g.,     M  T  W  Th F  Sa Su M   7  6  5  4  4  5   6 T      7  6  5  4  4   5 W        7  6  5  4   4 Th         7  6  5   4 F               7  6   5 Sa                 7   6 Su                     7 Of course, the matrix representation (and the corresponding table-lookup code implementation) is most useful for “low cardinality” scalar attributes such as “day” or “month”… On the other hand, Amal did create a matrix for Pearson correlation of renters in the Netflix code (we didn’t produce a table for movies since it would have been 500K x 500K).  Did you get Amal’s matrix as part of the software? If the scalar values just can’t be coded usefully to numeric (like “days” can be coded to 1,2,3,4,5,6,7), then Hamming distance may be best (1 for a match, 0 for a mismatch).

  6. Mark Silvermand said: I'm working to complete the c-knn as a test using my xml ptree description files, and have a question on HOB distance. What happens if I have a 16 bit attr, and an 8 bit attr?  Should I mask based upon smallest field (e.g. in this case, pass 1 would: 0xFFFF   0xFF) If not sufficient matches, pass 2 would be: 0xFFFC  0xFEso that HOB mask is proportionally the same for all attributes, or should I make mask for pass 2 as: 0xFFFE 0xFE ?Have you considered this before? Thanks as always! Mark WP said back: Your question is a very good, but tough to give a definitive answer to (for me anyway). It goes to the "attribute normalization issue" and that is an important but tough issue (few deal with it well in my opinion). Consider Hobbit as "quick look" L1 dist (using just one bit slice in each attribute, namely the highest-order bit-of-difference in that attribute. So, e.g., in 2-D, the question is: how do we [relatively] scale the two attributes so that a "sphere" is correctly round (of course an L1 or Hobbit sphere is a rectangle, and not round at all, but the idea is to relatively scale 2 dims so that it is a "square" in terms of the semantics 2 attr. Take an image with Red and Green (e.g., for vegetation indexing).   If R is 16 bit and G is 8 bit (not likely but let's assume that just for the purposes of this discussion). If we use 0xFFFE for R (horizontal dimension) and 0xFE for G (vertical dimension) then we are really constructing tall skinny rectangles as our neighborhoods,{x|distance(x,center)<epsilon}: In this case it would seem to make sense [to me] to use 0xFE00 R 0xFE  G So that the neighborhood would look more like a square: since both dims count photons detected over the same interval of time, but maybe R counts in Fixed4.2  format and G counts in Fixed2.0??? If R counts in Fixed4.0 and G counts in Fixed2.0 then 0xFFFE R  and  0xFE G would give us a square neighborhood. What I'm trying to say is that the two delta's should be chosen based on the semantics of the two columns (what is being measured and then what does that suggest to us in terms of "moving out from the center" approximately the same amount in each ring). The more general question of normalizing attributes (columns) when the two measurements are not so clearly related, is a harder, but extremely important one (if done poorly, one dimension gets radically favored in terms of the chosen "neighboring point" that get to vote during the classification, for instance.

  7. Mark S.: I realize this is variable depending upon data type...  I’ve had some nibbles from people who are asking to view the solution from a financial perspective, e.g., what is cost of ownership vs. existing methods.  Currently it seems everything is geared towards “in memory” data mining.  What happens if the pTree doesn’t fit into memory (seems to me in the case of classification, for example, if the training set doesn’t fit into memory which appears to be the bigger issue)?  I’m working up a model (geared: If I turbocharged your existing data warehouse with a vertical solution, how much would I save? guesses?   Viewing holistically (only parts are quantifiable – how much is it worth to get an answer 10x faster, e.g., quantifiable in some cases). Has there been any work done on effectiveness of pTree compression (how much will pTrees compress a large dataset)? WP:That’s a tough one since we really don’t have an implementation other than for uncompressed pTrees yet ;-( I can’t say too much definitively, since we have’nt yet produced an implementation with any compression (leveled pTrees).  Images: “Urban images” may not compress so much since they are replete with “edges” (many adjacent bit changes). “Non-urban images” (e.g., mother nature as creator) have massive patches of uniform color w low percent of boundary or edge pixels.   Therefore even raster orderings produce longs strings of 0’s or long strings of 1’s (which compress well in multi-leveled pTrees).  Z-ordering or Hilbert ordering: ompression increases in “Mother Nature” images (even man-made rural spaces crops) the same can be said.  Aside: If Z or Hilbert helps, why not use run-length or LZW compression?  We need to be able to process (at each level) without decompressing.  How do you AND 2 run-length compressed strings of bits?  Lot’s of decisions in the code?, not just AND/OR of massive strings A tangential thought:  A quick predictor of, say, a “terrorist training site” might be a sudden increase in bit change frequency (as you proceed down the pTree sequence there is a lack of compression all of a sudden -because terrorist training sites are edge-busy (almost Urban) and they are usually place in (surrounded by) uniform terrain – like desert or forest or?? – not urban).     As a decision support machine, not a decision maker, the pTree classifier would point to a closer look at that area of sudden hi bit-change?  And this pTree phenomenon could be quickly identified by looking through the pTree_segment ToC for inode subtrees with inordinately many pointers to [non-compressed] leaf segments (assuming no pure0 or pure1 segments are stored, since they do not need to be). This would be yet another potential answer which could come directly from the schema (without even retrieving the pTree data itself)??  E.g., if the segment ToC gets suddenly longer (many more leaf segments to point to) that would signal an area for closer scrutiny. The main pTree turbocharger is horizontal traversal across (pertinency selective) vertical pTrees (& and |) (few comparisons–code decision pts) In this age of infinite storage (nearly infinite non-rotating memory possibilities), maybe the problem goes away to an extent? For pTree classification a critical steps is “attribute relevance” (e.g., In Netflix, applied “pruning” based on relevance=high_correlation (from pre-computed correlation tables?).  In the 2006 KDDcup case GA-based relevant attribute selection was a central part of the win....) If [when] we have multi-level pTrees implemented (with various sets of rough_pTrees pre-computed and available) we will be able to work down from the top level to get the predicted class, possibly, without needing to see the lower levels of the pTrees (i.e., leaf level) – or at least be able to do significant attribute relevance analysis at high levels in the pTrees so that we only access leaves of a select few pTrees. Pre-computed attrib correlation tables help in determining relevance before accessing the training dataset.  Note that correlation calculations are “two columns” at a time calculations and therefore lend themselves to very efficient calculation via pTrees (or any vertical arrangement). Even if all the above fail or are difficult (which may be somewhat the case for pixel classification of large images?) we can code so that training pTrees are brought in in batches (and prefetched while ANDing/ORing the previous batch).  Since we don’t have to pull in entire rows (as in the horizontal data case) to process relevant column feature values, we get a distinct advantage over “vertical processing of horizontal data” (note that indexes help the VPHD people but indexes ARE “vertical [partial] replications” of the dataset.  Or said another way, pTrees ARE comprehensive indexes that contain all the information in the dataset and therefore eliminate the need for the dataset itself altogether ;-)) In many classifications, only relevant training pTrees go into memory at a time.  That often saves us from the problem you point to. Bottom line:  Needing data from disk in real time is usually a bottleneck, but pTrees are a serious palliative medicine to treat that pain.

  8. Discrete Sequential Data? I’ve been going back over the last NASA submission in preparation for the new one, and am trying to improve on areas where reviewers comments indicated weakness. One is on the work plan – e.g. how are we going to solve the problem. Do you remember last year we were discussing discrete sequential data and anomaly detection?  We had discussed back then turning the sequences into vertical columns, but had concerns (rightly so) that we could be growing a huge number of columns eliminating our scalability advantage. The issue is as follows: e.g., a set of 3 switches on a plane – flaps, landing gear, fuel pump.  Assume that we have an example of a correct sequence:         Flaps Gear Pump Flaps Gear Pump Time 0        0      0     0 Time 1 1       0     0 Time 2        1      1      0 Time 3   1        1       0 Time 4        1       1      1 “vertical” is only relevant for a “table centric” view (each row is an instance of the entity to be data mined) 2 entities, switches and times.  focus classification on either – both (data: on/off setting  pair of a switch-instance and a time-instance. Aren’t discrete time sequences already vertical datasets?  Yes. ( See above.) 2nd obs:  assuming actual sequence from pilot:            Flaps Gear Pump Flaps Gear Pump Time 0     0     0     0 Time 1     1     1     0 Time 2     1     1     0 Time 3     1     1    0 Time 4     1     1     1 Call this S(i).  Assume n of these to classify. Can you not effectively determine a few things rather simply with pTree math: 1. an anomalous point in time is one in which: create anomaly ptree as follows: (training set symbol 1 XOR data point symbol 1) OR (training set symbol 2 XOR data point symbol 2) OR (training point symbol 3 XOR data point symbol 3) if this value is 1, there is an anomaly at that point.  (and maybe the count of 1’s measures the extent of anomaly?). The RC( anomaly ptree) is a count of the relative “anomalousness” of seq. to training sequence.  If RC=0 there is no difference. 2. Further, you can find exact pt in time where anomalies exist by indexes of the 1’s in anomaly ptree. Wouldn’t the point in time of anomaly be the “indexed” by the offset of the XOR 1-bit?  (Greg has a function to return index of next 1-bit…) We have a temporal normalization problem, but so does every such method), this should work, and be relatively fast.  Current art is O(n2). This should work even if more than binary switches (multibit). My response: Right. “temporal norm” means?   In Simulation Science terms – fixed increment time advance and next event time advance. My guess is next event. Try data mining both way –combo the of 3 ways as we did in Netflix.  So we would create (one time cost) switch pTrees (as you have done – thinking of the data as rows of time-instances with switch-value columns) and also time pTrees (rotating the table so that each row is a switch entity and each col is a time instance (each successive col is time of the “next switch change”). VSSD outlier=pt w no nbrs (0NNC or avg pairwise dist in set, apds, “outlier”=10*apds or VSSD to set of other pts is large.With respect to this point: FAUST should work well also, I agree.  From Algs to Analytics: Integrated Ag Bus/CS Initiative to Explore Comp Facilitated Commodity Trading Research.The strategy will be to highlight the potential to integrate and focus NDSU's leadership in vertical algorithm development and implementation on problems of interest to agricultural commodity trading.  Highlight opp CTR provides to take CS into practical application by students and researchers. (CTR to serve as a crucible to focus multi-disciplinary research and education efforts across diverse Univ. strength areas.) Get are copies/ptrs to research on stochastic based risk modeling.  

  9. e.g., a set of 3 switches on a plane – flaps, landing gear, fuel pump.  Assume that we have an example of a correct sequence:               Flaps    Gear    Pump Time 0        0            0        0 Time 1        1            0        0 Time 2        1            1        0 Time 3        1            1        0 Time 4        1            1        1 From: M Silverman Sent: 9/2, To: W Perrizo, Subject: need your guidance.... remember discrete sequential data? I’ve been going back over the last NASA submission in preparation for the new one, and am trying to improve upon areas where reviewers comments indicated weakness. One is on the work plan – e.g. how are we going to solve the problem. Do you remember last year we were discussing discrete sequential data and anomaly detection?  We had discussed back then turning the sequences into vertical columns, but had concerns (rightly so) that we could be growing a huge number of columns eliminating our scalability advantage. The issue is as follows: My response: Yes, it looks good.  This is the way I remember thinking of it. The word “vertical” is only relevant when one takes a “table centric” view (or horizontal row centric: each row is an instance of the entity to be studied or data mined) As with the Netflix case when we had two entities, users and moves - in this case we have switches and times.  We can focus the datamine (classification) on either – and typically we do both (datamining with respect to one entity at a time, but do it both ways – rotating the table (swapping rows and columns) to get the other vertical view, but producing a whole new set of pTrees (recall, in Netflix, we had movie-pTrees and user-pTrees. Here we would have switch pTrees and time pTrees – coming from the same data which is an on/off setting for each pair of a switch-instance and a time-instance.

  10. Mark again: 1st observation:  Aren’t discrete time sequences already vertical datasets?  Me: Yes. ( See comments above.) 2nd observation:  assuming for the moment time intervals are normalized, and you have an actual sequence from the pilot:            Flaps Gear Pump Time 0     0     0     0 Time 1     1     1     0 Time 2     1     1     0 Time 3     1     1    0 Time 4     1     1     1 Call this S(i).  Assume you have n of these to classify. Can you not effectively determine a few things rather simply with pTree math: 1. an anomalous point in time is one in which: create anomaly ptree as follows: (training set symbol 1 XOR data point symbol 1) OR (training set symbol 2 XOR data point symbol 2) OR (training point symbol 3 XOR data point symbol 3) if this value is 1, there is an anomaly at that point.  (and maybe the count of 1’s measures the extent of anomaly?) The RC(anomaly ptree) is a count of the relative “anomalousness” of the sequence to the training sequence.  If this RC value is 0 there is no difference.

  11. 2. Further, you can find the exact point in time where the anomalies exist by the indexes of the 1’s in the anomaly ptree. Wouldn’t the point in time of the anomaly be the “indexed” by the offset of the XOR 1-bit?  (Greg has a function to return index of next 1-bit…) Unless I’m missing something (we have a temporal normalization problem perhaps, but so does every such method), this seems like it should work, and be relatively fast.  Current art is O(n2). This should work even if we have more than binary switches (multibit). Am I missing something? My response: You’re right, Mark. Comments “temporal normalization” means?   I would understand it in Simulation Science terms namely that we have essentially two choices – fixed increment time advance and next event time advance. My guess is that the data comes to us as next event – that is, a new time entity instance (time row) is produced whenever one of more of the switches changes regardless of whether that change “events” occurs after an evenly spaced time increment (which it almost certainly would not). I think we would want to try data mining both way – and in fact, - combining the two ways as we did in Netflix.  So we would create (one time cost) switch pTrees (as you have done – thinking of the data as rows of time-instances with switch-value columns) and also time pTrees (rotating the table so that each row is a switch entity and each column is a time instance (each successive column is probably the time of the “next switch change”)

  12. From: Mark Silverman Sent: Wednesday, September 07 To: Perrizo, William Subject: VSSD: I am going through methods now to finish the NASA SBIR proposal, and was looking though some other work on outlier detection. I ran across the VSSD material. Was any work ever done in applying VSSD to outlier detection? From: Perrizo, William Sent: Wednesday, September 07 To: 'Mark Silverman' Subject: RE: VSSD: The VSSD classifier method might be used for outlier detection, in the sense that an outlier might be defined as a sample point with no neighbors (so that they would be determined by applying 0NNC or zero nearest neighbor classification.  One might calculate the average pairwise distance separation in the set, apds, then decide upon a definition of “outlier” such as 10*apds.  And use the VSSD classifier. But more pertinent to VSSD, one might define a point to be an outlier if its VSSD with respect to the set of other points is large.  I actually like that because it is not so sensitive to a few other points being close but most points being far away (e.g., if a fumble fingered pilot hits on/off/on by mistake several times, it still show up as an anomaly even tho it has several nbrs?) From: Perrizo, William Sent: Wednesday, September 07 To: 'Mark Silverman' Subject: RE: VSSD With respect to this point: FAUST should work well also, I agree.  One could analyze the training set for potential outlier anomaly classes (tiny classes – that when scrutinized are determined to be “fumble fingering”).  Then one could do FAUST on the training set with respect to that anomaly class to determine a good cutpoint with respect to each other good class.  Then do the Multi-class FAUST AND operation and wallah!   Or for real speed just lump all good classes into one big multi-class and calculate a cutpoint between that multi-class and the anomaly class.  Then apply the one inequality pTree function and walah! Lumping all good classes into one multiclass might result in there not being a good cutpoint (no good oblique cut-plane).  The lumping approach could be tried first (very quickly) and if it fails, go with the ANDing of binary half-spaces. So if there is a hyperplane separating the anomaly class from the rest of the points the lumping method will find it.  If not one looks for a “intersection of hyperplanes” that separates. Yes, then 2 fumble fingered pilots would show up as anomalies even though they fumble the same way. This would work I think pretty generally not only for discrete time sequential but for all data types...

More Related