Classification Models and Neural Networks

Section 10 Appendix_10_Datamining_Classification(The following “eager” model is based on conditional probabilities. Prediction is done by taking the highest conditionally probable class) A Bayesian classifier is a statistical classifier, which is based on following theorem known as Bayes theorem: Bayes theorem: Let X be a data sample whose class is unknown. Let H be the hypothesis that X belongs to class, H. P(H|X) is the conditional probability of H given X. P(H) is prob of H, then P(H|X) = P(X|H)P(H)/P(X)

Neural Network Classificaton • A Neural Network is trained to make the prediction • Advantages • prediction accuracy is generally high • it is generally robust(works when training examples contain errors) • output may be discrete, real-valued, or a vector of several discrete or real-valued attributes • It provides fast classification of unclassified samples. • Criticism • It is difficult to understand the learned function (involves complex and almost magicweight adjustments.) • It makes it difficult to incorporate domain knowledge • long training time(for large training sets, it is prohibitive!)

A Neuron - mk x0 w0 x1 w1 f å output y xn wn Input vector x weight vector w weighted sum Activation function • The input feature vector x=(x0..xn) is mapped into variable y by means of the scalar product and a nonlinear function mapping, f (called the damping function). and a bias function,

Neural Network Training • The ultimate objective of training • obtain a set of weights that makes almost all the tuples in the training data classify correctly (usually using a time consuming "back propagation" procedure which is based, ultimately on Neuton's method. See literature of Other materials - 10datamining.htmlfor examples and alternate training techniques). • Steps • Initialize weights with random values • Feed the input tuples into the network • For each unit • Compute the net input to the unit as a linear combination of all the inputs to the unit • Compute the output value using the activation function • Compute the error • Update the weights and the bias

Neural Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi

= = X x , x , x , , x Y y , y , y , , y … … 1 2 3 n 1 2 3 n Minkowski distance or Lp distance, Manhattan distance, (P = 1) Euclidian distance, (P = 2) Max distance, (P = ) Canberra distance Squared chi-squared distance Squared cord distance These next series of slides uses the concept of “distance”. You may feel you don't need this much detail regarding distance. If so, skip what you feel you don't need.For Nearest Neighbor Classification, a distance is needed(to make sense of "nearest". Other classifiers also use distance.) A distance is a function, d,applied to two n-dimensional points X and Y, is such that d(X, Y)is positive definite: if (X  Y), d(X, Y) > 0; if (X = Y), d(X, Y) = 0 d(X, Y) issymmetric: d(X, Y) = d(Y, X) d(X, Y) holds triangle inequality: d(X, Y) + d(Y, Z)  d(X, Z)

two-dimensional distances: Y (6,4) A Neighborhood (disk neighborhood) of a point, T, is a set of points, S, : XSiff d(T, X) r Manhattan, d1(X,Y)= XZ+ ZY =4+3 = 7 2r 2r 2r Euclidian, d2(X,Y)= XY = 5 ³ d d X In fact, for any positive integer p, X Z + Max, d(X,Y)= Max(XZ, ZY) = XZ = 4 X 1 p p X (2,1) T T T If Xis a point on the boundary, d(T, X) = r Manhattan Euclidian Max d1d2d always

These slides give great detail on the relative performance of kNN and CkNN, on the use of other distance functions and some exampels, etc. Experimented on two sets of (Arial) Remotely Sensed Images of Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values

Performance – Accuracy (3 horizontal methods in middle, 3 vertical methods (the 2 most accurate and the least accurate) 1997 Dataset: 80 75 70 65 Accuracy (%) 60 55 kNN-Manhattan kNN-Euclidian 50 kNN-Max kNN using HOBbit distance P-tree Closed-KNN0-max Closed-kNN using HOBbit distance 45 40 256 1024 4096 16384 65536 262144 Training Set Size (no. of pixels)

Performance – Accuracy (3 horizontal methods in middle, 3 vertical methods (the 2 most accurate and the least accurate) 1998 Dataset: 65 60 55 50 45 Accuracy (%) 40 kNN-Manhattan kNN-Euclidian kNN-Max kNN using HOBbit distance P-tree Closed-KNN-max Closed-kNN using HOBbit distance 20 256 1024 4096 16384 65536 262144 Training Set Size (no of pixels)

Performance – Speed (3 horizontal methods in middle, 3 vertical methods (the 2 fastest (the same 2) and the slowest) Hint: NEVER use a log scale to show a WIN!!! 1997 Dataset: both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification time (sec) 0.001 0.0001 kNN-Manhattan kNN-Euclidian kNN-Max kNN using HOBbit distance P-tree Closed-KNN-max Closed-kNN using HOBbit dist

Performance – Speed (3 horizontal methods in middle, 3 vertical methods (the 2 fastest (the same 2) and the slowest) Win-Win situation!! (almost never happens) P-tree CkNN and CkNN-H are more accurate and much faster. kNN-H is not recommended because it is slower and less accurate (because it doesn't use Closed nbr sets and it requires another step to get rid of ties (why do it?). Horizontal kNNs are not recommended because they are less accurate and slower! 1998 Dataset : both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification Time (sec) 0.001 0.0001 kNN-Manhattan kNN-Euclidian kNN-Max kNN using HOBbit distance P-tree Closed-kNN-max Closed-kNN using HOBbit dist

Association of Computing Machinery KDD-Cup-02 NDSU Team

Closed Manhattan Nearest Neighbor Classifier (uses a linear fctn of Manhattan similarity) Sample is (000000), attribute weights of relevant attributes are their subscripts) black isattribute complement, red is uncomplemented. The vote is even simpler than the "equal" vote case. We just note that all tuples vote in accordance with their weighted similarity (if the ai values differs form that of (000000) then the vote contribution is the subscript of that attribute, else zero). Thus, we can just add up the root counts of each relevant attribute weighted by their subscript. a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 Class=1 root counts: a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 C=1 vote is: 343 =4*5 + 8*6 + 7*11 + 4*12 + 4*13 + 7*14 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 rc(PC^Pa12)=4 rc(PC^Pa13)=4 rc(PC^Pa14)=7 rc(PC^Pa6)=8 rc(PC^Pa11)=7 rc(PC^Pa5)=4 C=1 vote is: 343 Similarly, C=0 vote is: 258= 6*5 + 7*6 + 5*11 + 3*12 + 3*13 + 4*14

We note that the Closed Manhattan NN Classifier uses an influence function which is pyramidal It would be much better to use a Gaussian influence function but it is much harder to implement. One generalization of this method to the case of integer values rather than Boolean, would be to weight each bit position in a more Gaussian shape (i.e., weight the bit positions, b, b-1, ..., 0 (high order to low order) using Gaussian weights. By so doing, at least within each attribute, influences are Gaussian. We can call this method, Closed Manhattan Gaussian NN Classification. Testing the performance of either CM NNC or CMG NNC would make a great paper for this course (thesis?). Improving it in some way would make an even better paper (thesis).

GRIDs 2.lo grid1.hi grid Want square cells or a square pattern? 11 10 01 00 11 10 01 00 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 f:RY,  partition S={Sk} of Y, {f-1(Sk)}=S,f-grid of R (grid cells=contours) If Y=Reals, the j.lo f-grid is produced by agglomerating over the j lo bits of Y,  fixed (b-j) hi bit pattern. The j lo bits walk [isobars of] cells. Theb-j hi bits identify cells. (lo=extension / hi=intention) Let b-1,...,0 be the b bit positions of Y. The j.lo f-grid is the partition of R generated by f and S = {Sb-1,...,b-j | Sb-1,...,b-j = [(b-1)(b-2)...(b-j)0..0, (b-1)(b-2)...(b-j)1..1)} partition of Y=Reals. If F={fh}, the j.lo F-grid is the intersection partition of the j.lo fh-grids (intersection of partitions). The canonicalj.lo grid is the j.lo -grid; ={d:RR[Ad] | d = dth coordinate projection} j-hi gridding is similar ( the b-j lo bits walk cell contents / j hi bits identify cells). If the horizontal and vertical dimensions have bitwidths 3 and 2 respectively:

j.lo and j.hi gridding continued The horizontal_bitwidth = vertical_bitwidth = b iff j.lo grid = (b-j).hi grid e.g., for hb=vb=b=3 and j=2: 2.lo grid1.hi grid 111 110 101 100 111 110 101 100 011 010 001 000 011 010 001 000 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

r1 r1 a A distance, d, generates a similarity many ways, e.g., s(x,y)=1/(1+d(x,y)): (or if the relationship various by location, s(x,y)=(x,y)/(1+d(x,y)) s r2 1 r2 For C = {a} d s(x,y)=a*e-b*d(x,y)2 : s a d 0 : d(x,y)> s(x,y)= ae-bd(x,y)2-ae-b2 : d(x,y) (vote weighting IS a similarity assignment, so the similarity-to-distance graph IS a vote weighting for classification) s C a-ae-b2 d  Similarity NearNeighborSets (SNNS) Given similarity s:RRPartiallyOrderedSet (eg, Reals) ( i.e., s(x,y)=s(y,x) and s(x,x)s(x,y) x,yR ) and given any C  R The Ordinal disks, skins and rings are: disk(C,k)  C : |disk(C,k)C'|=k and s(x,C)s(y,C) xdisk(C,k), ydisk(C,k) skin(C,k)= disk(C,k)-C (skin comes from s k immediate neighbors and is a kNNS of C.) ring(C,k)= cskin(C,k)-cskin(C,k-1) closeddisk(C,k)alldisk(C,k); closedskin(C,k)allskin(C,k) The Cardinal disk, skins and rings are (PartiallyOrderedSet = Reals) disk(C,r) {xR | s(x,C)r} also = functional contour, f-1([r, ), where f(x)=sC(x)=s(x,C) skin(C,r) disk(C,r) - C ring(C,r2,r1) disk(C,r2)-disk(C,r1)  skin(C,r2)-skin(C,r1) also = functional contour, sC-1(r1,r2] Note: closeddisk(C,r) is redundant, since all r-disks are closed and closeddisk(C,k) = disk(C,s(C,y)) where y = kth NN of C L skins: skin(a,k) = {x | d, xd is one of the k-NNs of ad} (a local normalizer?)

Partition tree R / … \ C1 … Cn /…\ … /…\ C11…C1,n1Cn1…Cn,nn . . . Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing Jury is still out on parallelization, vertical (by relation) or horizontal (by tree node) or some combination? Horizontal parallelization is pretty, but network multicast overhead is huge Use active networking? Clusters of Playstations?... Formally, P-trees can be defined as any of the following; Partition-tree: Tree of nested partitions (a partition P(R)={C1..Cn}; each component is partitioned by P(Ci)={Ci,1..Ci,ni} i=1..n; each component is partitioned by P(Ci,j)={Ci,j1..Ci,jnij}... ) Ptrees Predicate-tree: For a predicate on the leaf-nodes of a partition-tree (also induces predicates on i-nodes using quantifiers) Predicate-tree nodes can be truth-values (Boolean P-tree); can be quantified existentially (1 or a threshold %) or universally; Predicate-tree nodes can count # of true leaf children of that component (Count P-tree) Purity-tree: universally quantified Boolean-Predicate-tree (e.g., if the predicate is <=1>, Pure1-tree or P1tree) A 1-bit at a node iff corresponding component is pure1 (universally quantified) There are many other useful predicates, e.g., NonPure0-trees; But we will focus on P1trees. All Ptrees shown so far were: 1-dimensional (recursively partition by halving bit files), but they can be; 2-D (recursively quartering) (e.g., used for 2-D images); 3-D (recursively eighth-ing), …; Or based on purity runs or LZW-runs or … Further observations about Ptrees: Partition-tree: have set nodes Predicate-tree: have either Boolean nodes (Boolean P-tree)or count nodes (Count P-tree) Purity-tree: being universally quantified Boolean-Predicate-tree have Boolean nodes (since the count is always the “full” count of leaves, expressing Purity-trees as count-trees is redundant. Partition-tree can be sliced at a level if each partition is labeled with same label set (e.g., Month partition of years). A Partition-tree can be generalized to a Set-graph when the siblings of a node do not form a partition.

The partitions used to create P-trees can come from functional contours (Note: there is a natural duality between partitions and functions, namely a partition creates a function from the space of points partitioned to the set of partition components and a function creates the pre-image partition of its domain). In Functional Contour terms (i.e., f-1(S) where f:R(A1..An)Y, SY), the uncompressed Ptree or uncompressed Predicate-tree0Pf, S = bitmap of set containment-predicate, 0Pf,S(x)=true iff xf-1(S) 0Pf,S = equivalently, the existential R*-bit mapof predicate, R*.AfS The Compressed Ptree,sPf,S is the compression of 0Pf,S with equi-width leaf size, s, as follows 1. Choose a walk of R (converts 0Pf,S from bit map to bit vector) 2. Equi-width partition 0Pf,S with segment size, s (s=leafsize, the last segment can be short) 3. Eliminate and mask to 0, all pure-zero segments (call mask, NotPure0 Mask or EM) 4. Eliminate and mask to 1, all pure-one segments (call mask, Pure1 Mask or UM) (EM=existential aggregation UM=universal aggregation) Compressing each leaf of sPf,S with leafsize=s2 gives: s1,s2Pf,SRecursivly, s1, s2, s3Pf,S s1, s2, s3, s4Pf,S ... (builds an EM and a UM tree) BASIC P-trees If AiReal or Binaryand fi,j(x)  jth bit of xi ; {(*)Pfi,j ,{1} (*)Pi,j}j=b..0 are basic (*)P-trees of Ai, *= s1..sk If AiCategorical and fi,a(x)=1 if xi=a, else 0; {(*)Pfi,a,{1}(*)Pi,a}aR[Ai] are basic (*)P-trees of Ai Notes: The UM masks (e.g., of 2k,...,20Pi,j, with k=roof(log2|R| ), form a (binary) tree. Whenever the EM bit is 1, that entire subtree can be eliminated (since it represents a pure0 segment), then a 0-node at level-k (lowest level = level-0) with no sub-tree indicates a 2k-run of zeros. In this construction, the UM tree is redundant. We call these EM trees the basic binary P-trees. The next slide shows a top-down (easy to understand) construction of and the following slide is a (much more efficient) bottom up construction of the same. We have suppressed the leafsize prefix.

2xRd=1..nad(k2kxdk) + |R||a|2 = xRd=1..n(k2kxdk)2 - 2xRd=1..nad(k2kxdk) + |R||a|2 = xd(i2ixdi)(j2jxdj) - |R||a|2 = xdi,j 2i+jxdixdj- 2 x,d,k2k adxdk + |R|dadad and x xd = |R|μd so, |R||a|2 - = x,d,i,j 2i+j xdixdj- - = x,d,i,j 2i+j xdixdj- 2 dad( x(k2kxdk) ) 2|R| dadd + 2 dad(x,k2kxdk) + 2 dad( x (xd) ) TV(a) = i,j,d 2i+j |Pdi^dj| - k2k+1 dad |Pdk| + |R||a|2 dadad ) (equation 7) = x,d,i,j 2i+j xdixdj+ |R|( -2dadd + Example functionals: Total Variation (TV) functionals TV(a) = xR(x-a)o(x-a) If we use d for a index variable over the dimensions, = xRd=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes Note that the first term does not depend upon a. Thus, the derived attribute, TV-TV() (eliminate 1st term) is much simpler to compute and has identical contours (just lowers the graph by TV() ). We also find it useful to post-compose a log to reduce the number of bit slices. The result functional is called High-Dimension-ready Total Variation or HDTV(a).

- 2ddad = |R|(dad2 + dd2) g(a)=LNTV(x) g(c) g(b) x1  a b c -contour (radius  about a) x2 dadad ) TV(a) = x,d,i,j 2i+j xdixdj + |R| ( -2dadd + From equation 7, Normalized Total Variation, NTV(a) TV(a)-TV() = d(adad- dd)) = |R| (-2d(add-dd) + = |R| |a-|2 Thus there is a simpler function which gives us circular contours, the Log Normal TV function< LNTV(a) = ln( NTV(a) ) = ln( TV(a)-TV() ) = ln|R| + ln|a-|2 The length of LNTV(a) depends only on the length of a-, so isobars are hyper-circles centered at  The graph of LNTV is a log-shaped hyper-funnel: For an -contour ring (radius  about a) go inward and outward along a- by  to the points; inner point, b=+(1-/|a-|)(a-) and outer point, c=-(1+/|a-|)(a-). Then take g(b) and g(c) as lower and upper endpoints of a vertical interval. Then we use EIN formulas on that interval to get a mask P-tree for the -contour (which is a well-pruned superset of the -neighborhood of a)

use circumscribing Ad-contour (Note: Ad is not a derived attribute at all, but just Ad, so we already have its basic P-trees). If the LNTV circumscribing contour of a is still too populous, (Use voting function, G(x) = Gauss(|x-a|)-Gauss(), where Gauss(r) is (1/(std*2)e-(r-mean)2/2var (std, mean, var are wrt set distances from a of voters i.e., {r=|x-a|: x a voter} )  a -contour (radius  about a) As pre-processing, calculate basic P-trees for the LNTV derived attribute (or another hypercircular contour derived attribute). To classify a 1. Calculate b and c (Depend on a, ) 2. Form mask P-tree for training pts with LNTV-values[LNTV(b), LNTV(c)] 3. User that P-tree to prune out the candidate NNS. • If the count of candidates is small, proceed to scan and assign class votes using Gaussian vote function, else prune further using a dimension projections). LNTV(x) LNTV(c) We can also note that LNTV can be further simplified (retaining same contours) using h(a)=|a-|. Since we create the derived attribute by scanning the training set, why not just use this very simple function? Others leap to mind, e.g., hb(a)=|a-b| LNTV(b) x1 contour of dimension projection f(a)=a1 b c x2

LNTV  h(a)=|a-|  TV-TV() hb(a)=|a-b| TV(x15)-TV() 1 1 2 2 3 3 4 4 5 5 Y X  b TV TV(x15) TV()=TV(x33) 1 1 2 2 3 3 4 4 5 5 Y X Graphs of functionals with hyper-circular contours

COS(a) o a ad ad = (1/|a|)d=1..n(xxd) = |R|/|a|d=1..n d = ( |R|/|a| ) ad = |R|/|a|d=1..n((xxd)/|R|)  COS(a)  a COSb(a)?  b a Angular Variation functionals: e.g., AV(a) ( 1/|a| ) xR xoa d is an index over the dimensions, = (1/|a|)xRd=1..nxdad = (1/|a|)d(xxdad) factor out ad COS(a) AV(a)/(|||R|) = oa/(|||a|) = cos(a) COS (and AV) has hyper-conic isobars center on  COS and AV have -contour(a) = the space between two hyper-cones center on  which just circumscribes the Euclidean -hyperdisk at a. Intersection (in pink)with LNTV -contour. Graphs of functionals with hyper-conic contours: E.g., COSb(a) for any vector, b

f(a)x = (x-a)o(x-a) d = index over dims, 2(i2iadi)(j2jxdj) + 2d=1..nad(k2kxdk) + (i2iadi) (j2jadj) ) |a|2 = d((i2ixdi)(j2jxdj)- = d=1..n(k2kxdk)2 - 2d=1..nad(2kxdk) + |a|2 = d(i2ixdi)(j2jxdj) - i,j 2i+jadiadj) i,j 2i+j+1adixdj + =d(i,j 2i+jxdixdj- |a|2 =di,j 2i+jxdixdj- 2 d,k2k adxdk + f(a)x = i,j,d 2i+j (Pdi^dj)x- k2k+1 dad (Pdk)x + |a|2 * exp( k2k+1 dad (Pdk)x) *exp( -|a|2 ) β exp( -f(a)x ) =βexp(-i,j,d 2i+j (Pdi^dj)x) xcβexp( -f(a)x ) =βxc(exp(-i,j,d 2i+j (Pdi^dj)x) β exp( -f(a)x ) =β(exp(-i,j,d 2i+j (Pdi^dj)x) * exp( k2k+1 dad (Pdk)x) ) * exp( k2k+1 dad (Pdk)x) ) exp( -|a|2 ) exp( -|a|2 ) xcexp((-i,j,d 2i+j (Pdi^dj)x) + k,d2k+1 ad (Pdk)x) xcexp( ij,d -2i+j (Pdi^dj)x xc( ij,d exp(-2i+j (Pdi^dj)x) ( ij,d:Pdijx=1 exp(-2i+j ) * i=j,d:Pdijx=1 exp((ad2i+1-22i))) (eq1) * i=j,d exp((ad2i+1-22i)(Pdi)x)) + i=j,d(ad2i+1-22i ) (Pdi)x) = d=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes Adding up the Gaussian weighted votes for class c: Collecting diagonal terms inside exp i,j,d inside exp, coefs are multiplied by 1|0-bit (depends on x). For fixed i,j,d either coef is x-indep (if 1bit) or not (if 0bit) Some additiona formulas: =d,i,j 2i+j( xdi-adj)(xdj-adj) =d,i,j 2i+j( xdixdj- 2adixdj + adiadj)

= |i 2i ( xdi – adi)| = | i:adi=0 2ixdi -i:adi=1 2ix'di| fd(a)x = |x-a|d Thus, for the derived attribute, fd(a) = numeric distance of xd from ad, if we remember that: when adi=1, subtract those contributing powers of 2 (don't add) and that we use the complement dimension=d basic Ptrees, then it should work. The point is that we can get a set of near basic or negative basic Ptrees, nbPtrees, for derived attr fd(a) directly from the basic Ptrees for Ad for free. Thus, the near basic Ptrees for fd(a) are the basic Ad Ptrees for those bit-positions where adi = 0 and they are the complements of the basic Ad Ptrees for those bit-positions where adi = 1 (called fd(a)'s nbPtrees) Caution: subtract the contribution of the nbPtrees for positions where adi=1 Note: nbPtrees are not predicate trees (are they? What's the predicate?) The EIN ring formulas are related to this, how? If we are simply after easy pruning contours containing a (so that we can scan to get the actual Euclidean epsilon nbrs and/or to get Guassian weighted vote counts, we can use Hobbit-type contours (middle earth contours of a?). See next slide for a discussion of hobbit contours.

A principle: A job is not done until the Mathematics is completed. The Mathematics of a research job includes 0. Getting to the frontiers of the area (researching, organizing, understanding and integrating everything others have done in the area up to the present moment and what they are likely to do next). 1. developing a killer idea for a better way to do something. 2. proving claims (theorems, performance evaluation, simulation, etc.), 3. simplification (everything is simple once fully understood), 4. generalization (to the widest possible application scope), and 4. insight (what are the main issues and underlying mega-truths (with full drill down)). Therefore, we need to ask the following questions at this point: Should we use the vector of medians (the only good choice of middle point in mulidimensional space, since the point closest to the mean definition is influenced by skewness, like the mean). We will denote the vector of medians as  h(a)=|a-| is an important functional (better than h(a)=|a-|?) If we compute the median of an even number of values as the count-weighted average of the middle two values, then in binary columns,  and  coincide. (so if µ and  are far apart, that tells us there is high skew in the data (and the coordinates where they differ are the columns where the skew is found).

Additional Mathematics to enjoy: What about the vector of standard deviations, ? (computable with P-trees!) Do we have an improvement of BIRCH here? - generating similar comprehensive statistical measures, but much faster and more focused?) We can do the same for any rank statistic (or order statistic), e.g., vector of 1st or 3rd quartiles, Q1 or Q3 ; the vector of kth rank values (kth ordinal values). If we preprocessed to get the basic P-trees of , and each mixed quartile vector (e.g., in 2-D add 5 new derived attributes; , Q1,1, Q1,2, Q2,1, Q2,2; where Qi,j is the ith quartile of the jth column), what does this tell us (e.g., what can we conclude about the location of core clusters? Maybe all we need is the basic P-trees of the column quartiles, Q1..Qn ?) L ordinal disks: disk(C,k) = {x | xd is one of the k-Nearest Neighbors of ad d}. skin(C,k), closed skin(C,k) and ring(C,k) are defined as above. Are they easy P-tree computations? Do they offer advantages? When? What? Why? E.g., do they automatically normalize for us?

The Middle Earth Contours of a are gotten by ANDing in the basic Ptree for ad,i=1 and ANDing in the complement if ad,i=0 (down to some bit-position threshold in each dimension, bptd . bptd can be the same for each d or not). Caution: Hobbit contours of a are not symmetric about a. That becomes a problem (for knowing when you have a symmetric nbrhd in the contour) expecially when many lowest order bits of a are identical (e.g., if ad = 8 = 1000 ) If the low order bits of ad are zeros, one should union (OR) take the Hobbit contour of ad - 1 (e.g., for 8 also take 7=0111) If the low order bits of ad are ones, one should union (OR) the Hobbit contour of ad + 1 (e.g, for 7=111 also take 8=1000) Some need research: Since we are looking for an easy prune to get our mask down to a scannable size (low root count) but not so much of a prune that we have too few voters within Euclidean epsilon distance of a for a good vote, how can we quickly determine an easy choice of a Hobbit prune to accomplish that? Note that there are many Hobbit contours. We can start with pruning in just one dimension and with only the lowest order bit in that dimension and work from there, how though? THIS COULD BE VERY USEFUL?

Suppose there are two classes, red and green and they are on the cylinder shown. Then the vector connecting medians (vcm) in YZ space is shown in purple. Then the unit vector in the direction of the vector connecting medians (uvcm) in YZ space is shown in blue. The vector from the midpoint of the medians to s is in orange. The inner product of the blue and the orange is the same as the inner product we would get by doing the same thing in all 3 dimensions! The point is that the x-component of the red vector of medians and that of the green are identical so that the x component of the vcm is zero. Thus, when the small vcm comp in a given dimension is very small or zero, we can eliminate that dimension! That's why I suggest a threshold for the innerproduct in each dimension first. It is a feature or attribute relevance tool. y s x z

DBQ versus MIKE (DataBase Querying vs. Mining through data for Information and Knowledge Extraction Why do we call it Mining through data for Information & Knowledge Extraction and not just Data Mining? We Mine Silver and Gold! We don't just Mine Rock (The emphasis should be on the desired result, not the discard. The name should emphasize what we mine for, not what we mine from.) Silver and Gold are low-volume, high-value products, found (or not) in the mountains of rock (high-volume, low-value). Information and knowledge are low-volume, high-value, hiding in mountains of data (high-volme, low-value) In both MIKE and MSG the output and substrate are substantially different in structure (chemical / data structure) Just as in Mining Silver and Gold, we extract (hopefully) Silver and Gold from raw Rock, in Mining through data for Information and Knowledge, we extract (hopefully) Information and Knowledge from raw Data. So Mining through data for Information and Knowledge Extraction is the correct terminology and MIKE is the correct acronym, not Data Mining (DM). How is Data Base Querying (DBQ) different from Mining thru data for Info & Knowledge (MIKE)? In all mining (MIKE as well as MSG) we hope to successfully mine out something of value, but failure is likely, whereas in DBQ, valuable results are likely and no result is unlikely. DBQ should be called Data Base Quarrying, since it is more similar to Granite Quarrying (GQ), in that what we extract has the same structure as that from which we extract it (the substrate). It has higher value because its detail and specificity. I.e., the output records of a DBQ are exactly the reduce size set of records we demanded and expected from our query and the output grave stones of GQ are exactly the size and shape we demanded and expected, and in both cases what is left is a substance that is the same as what is taken). In sum, DBQ = Quarrying (highly predictable output and the output has same structure as the substrate (sets of records)). MIKE = Mining (unpredictable output and the output has different structure than the substrate (e.g., T/F or partition).

Some good Dataset for classification • KDDCUP-99 Dataset (Network Intrusion Dataset) • 4.8 millions records, 32 numerical attributes • 6 classes, each contains >10,000 records • Class distribution: • Testing set: 120 records, 20 per class • 4 synthetic datasets (randomly generated): • 10,000 records (SS-I) • 100,000 records (SS-II) • 1,000,000 records (SS-III) • 2,000,000 records (SS-IV)

Speed and Scalability Speed (Scalability) Comparison (k=5, hs=25) Running Time Against Varying Cardinality 100 Machine: Intel Pentium 4 CPU 2.6 GHz, 3.8GB RAM, running Red Hat Linux Note: these evaluations were done when we were still sorting the derived TV attribute and before we used Gaussian vote weighting. Therefore both speed and accuracy of SMART-TV have improved markedly! SMART-TV 90 PKNN KNN 80 70 60 50 Time in Seconds 40 30 20 10 0 1000 2000 3000 4891 Training Set Cardinality (x1000)

Dataset (Cont.) • OPTICS dataset • 8,000 points, 8 classes (CL-1, CL-2,…,CL-8) • 2 numerical attributes • Training set: 7,920 points • Testing set: 80 points, 10 per class

Dataset (Cont.) • IRIS dataset • 150 samples • 3 classes (iris-setosa, iris-versicolor, and iris-virginica) • 4 numerical attributes • Training set: 120 samples • Testing set: 30 samples, 10 per class

Overall Accuracy Overall Classification Accuracy Comparison

If candidate lies Euclidean distance > from a, vote weight = 0, else, we define Gaussian drop-off function, g(x)= Gauss(r=|x-a|)=1/(std*2 ) * e -(r-mean)2/2var where std, mean, var refer to the set of distances from a of non-zero voters (i.e., the set of r=|x-a| numbers), but use the Modified Gaussian, MG(x) = g(x) - e-2 so the vote weight function drops smoothly to 0 (right at the boundary of the -disk about a and then stays zero outside it). More Mathematics to enjoy: A past student used a heap process to get the k nearest neighbors of unclassified samples (requiring one scan through the well-pruned nearest neighbor candidate set). This means that Taufik did not use the closed kNN set, so accuracy will be the same as horizontal kNN (single scan, non-closed) Actually, accuracy varies slightly depending on the kth NN picked from the ties). A great learning experience with respect to using DataMIME and a great opportunity for thesis exists here - namely showing that when one uses closed kNN in SMART-TV, not only do we get a tremendously scalable algorithm, but also a much more accurate result (even slightly faster since no heaping to do). A project: Re-run the performance measurements (just for SMART-TV) using a final scan of the pruned candidate Nearest Neighbor Set. Let all candidates vote using a Gaussian vote drop-off function as:

More Mathematics to enjoy: A good project is the study of how much the above improves accuracy in various settings (and how various parameterization of g(x) (ala statistics) affect it)? More enjoyment!: If there is a example data set where accuracy goes down when using the Gaussian and closed NNS, that proves the data set is noise at that point? (i.e.,  no continuity relationship between features and classes at a). This leads to an interesting theory regarding the continuity of a training set. Everyone assume the training set is what you work with and you assume continuity! In fact, Cancer studies forge ahead with classification even when p>>n, that is there are just too few feature tuples for continuity to even makes good sense! So now we have a method of identifying those regions of the training space where there is a discontinuity in the feature-to-class relationship, FC. That will be valuable (e.g., in Bioinformatics). Has this been studied? (I think, everyone just goes ahead and classifies based on continuity and lives with the result!

More enjoyment (continued): I claim, when the class label is real, then with a properly chosen Gaussian vote-weight function (chosen by domain experts) and with a good method of testing the classifier, if SMART-TV miss-classifies a test point, then it is not a miss-classification at all, but there is a discontinuity in FC there (between feature and class)! In other words, that proves the training set is wrong there and should not be used. Does SMART-TV, properly tuned, DEFINES CORRECT? It will be fun to see how far one can go with this point of view. Be warned - it will cause quite a stir! Thoughts: 1. Choose a vote drop-off Gaussian carefully (with domain knowledge in play) and designates it as "right". What could be more right? - if you agree that classification has to be based on FC continuity. 2. Analyze (very carefully) SMART-TV vote histograms for 1 < 2 < ... < h If all are inconclusive then the Feature-to-Class function (FC) is discontinuous at a and classification SHOULD NOT BE ATTEMPTED USING THAT TRAINING SET! (This may get us into Krigging). If the histograms get more and more conclusive as the radius increases, then possibly one would want to rely on the outer ring votes, but one should also report that there is class noise at a! 3. We can get into tuning the Modified Gaussian, MG, by region. Determine the subregion where MC gives conclusive histograms. Then for each inconclusive training point, examine modifications (other parameters or even other dropoff functions) until you find one that is conclusive. Determine the full region where that one is conclusive ...

More enjoyment (continued): CAUTION! Many important Class Label Attributes (CLAs) are real (e.g., level of ill intent in homeland security data mining, level of ill intent in Network Intrusion analysis, probability of cancer in a cancer tissue microarray dataset), but many important Class Label Attributes are categorical (e.g., bioinformatic anotation, Phenotype prediction, etc.). When the Class label is categorical, the distance on the CLA becomes the characteristic distance function (distance=0 iff the 2 categories are different). Continuity at a becomes: >0 :d(x,a)<  f(x)=f(a). Possibly boundary determination of the training set classes is most important in case? Is that why SVM works so well in those situations? Still, For there to be continuity at a, it must be the case that some NNS of a maps to f(a). However, if the (CLA) is real: Can you find an analysis of "what is the best definition of correct" done (do statistician do this?). Step 1: re-run SMART-TV with the Modified Gaussian, MG(x), and closed kNN. Take several standard UCI MLR classification datasets, randomize classes in some particular isotropic neighborhood (so that we know where there is definitely an FC discontinuity) Then show (using matlab?) that SVM fails to detect that discontinuity (i.e., SVM gives a definitive class to it without the ability to detect the fact that it doesn't deserve one? or do the Lagrange multipliers do that for them?) and then show that we can detect that situation. Does any other author do this? Other ideas?

Classifying with multi-relational (heterogeneous) training data 0 0 0 1 0 01 0 1 0 1 0 0 1 0 0 1 01 0 0 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 1 0 0 0 1 0 1 0 0 1 0 1 01 0 1 0 Classify on a foreign key table (S) when some of the features need to come from the primary key table (R)? R(A0 A1 A2) S(B0 B1 B2 B3) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 001 11 0 011 10 1 100 01 1 101 01 0 110 11 0 111 00 0 S11 S12 S13 S21 S22 S23 S31 S32 S33 S41 S42 S43 To data mine this PK multi-relation, (R.A0 is ordered ascending primary key and S.B2 is the foreign key), scan S building (basic P-trees for) the derived attributes Bn+1..Bn+m (here B4,B5) from A1..Am using the bottom up approach (next slide)? Note: Once the derived basic P-trees are built, what if a tuple is added to S? If it has a new B2-value then a new tuples must have been added to R also (with that value in A0). Then all basic P-trees must be extended (including the derived). 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0

R(B2 A1 A2) S(B0 B1 B2 B3) 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 0 0 1 1 1 P01 P02 P03 P11 P12 P13 P21 P22 P23 P31 P32 P33 P4,1 P4,2 P5 0 1 0 0 0 1 0 01 0 1 0 1 0 0 1 0 0 1 01 0 0 0 0 1 01 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 1 0 0 0 1 0 1 0 0 1 0 1 01 0 1 0 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 001 11 0 011 10 1 100 01 1 101 01 0 110 11 0 111 00 0 S.B4,1 S.B4,2 S.B5 0 1 0 The cost is the same as an indexed nested loop join (reasonable to assume there is a primary index on R). When an insert is made to R, nothing has to change. When an insert is made to S, the P-tree structure is expanded and filled using the values in that insert plus the R-attribute values of the new S.B2 value (This is one index lookup. The S.B2 value must exists in R.A0 by referential integrity). Finally, if we are using, e.g. 4Pi,j P-trees instead of the (4,2,1)Pi,j P-trees shown here, it's the same: The basic P-tree fanout is /\ , the left leaf is filled by the first 4 values, the right leaf is filled with the last 4.

If we are using, e.g. 4Pi,j P-trees instead of the (4,2,1)Pi,j P-trees shown here, it's the same: The basic P-tree fanout is /\ , the left leaf is filled by the first 4 values, the right leaf is filled with the last 4. S.B4,1 S.B4,2 S.B5 0 0 1 S.B4,1 S.B4,2 S.B5 1,1 = EM-S.B4,1 1,1 = EM-S.B4,2 0,0 = EM-S.B5 0,1 = UM-S.B4,1 1,1 = UM-S.B4,2 0,0 = UM-S.B5 1 1 0 0 R(B2 A1 A2) S(B0 B1 B2 B3) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 001 11 0 011 10 1 100 01 1 101 01 0 110 11 0 111 00 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 or

The GO is a data structure which needs to be mined together with various valuable bioinformatic data sets. Biologists waste time searching for all available information about each small area of research. This is hampered further by variations in terminology in common usage at any given time, and that inhibit effective searching by computers as well as people. E.g., In a search for new targets for antibiotics, you want all gene products involved in bacterial protein synthesis, that have significantly different sequence or structure from those in humans. If one DB says these molecules are involved in 'translation' and another uses 'protein synthesis', it is difficult for you - and even harder for a computer - to find functionally equivalent terms. GO is an effort to address the need for consistent descriptions of gene products in different DBs. The project began in 1988 as a collaboration between three model organism databases: FlyBase (Drosophila), Saccharomyces Genome Database (SGD) Mouse Genome Database (MGD). Since then, the GO Consortium has grown to include several of the world's major repositories for plant, animal and microbial genomes. See the GO web page for a full list of member orgs. VertiGO (Vertical Gene Ontology)

The GO is a DAG which needs to be mined in conjunction with the Gene Table (one tuple for each gene with feature attributes). The DAG links are IS-A or PART-OF links. (Description follows from the GO website). If we take the simplified view that the GO assigns annotations of types ( Biological Process (BP); Molecular Function (MF); Cellular Comp. (CC)) to genes, with qualifiers ( "contributes to", "co-localizes with", "not") and evidence codes: IDA=InferredfromDirectAssay; IGI=InferredfromGeneticInteraction, IMP=InferredfromMutantPhenotype; IPI=InferredfromPhysicalInteraction, TAS=TraceableAuthorStatement; IEP=InferredfromExpressionPattern, RCA=InferredfromReviewedComputationalAnalysis, IC=InferredbyCurator IEA=InferredbyElectronicAnnotation ISS=InferredfromSequence/StructuralSimilarity, NAS=NontraceableAuthorStatement, ND=NoBiologicalDataAvailable, NR=NotRecorded Solution-1: For each annotation (term or GOID) have a 2-bit type code column GOIDType BP=11 MF=10 CC=00 and a 2-bit qualifier code column GOIDQualifier with contributesto=11, co-localizeswith=10 and not=00 and a 4-bit evidence code column GOIDEvidence: e,g,: IDA=1111, IGI-1110, IMP=1101, IPI=1100, TAS=1011, IEP=1010, ISS=1001, RCA=1000, IC=0111, IEA=0110, NAS=0100, ND=0010, NR=0001 (putting DAG structure in schema catalog). (Increases width by 8-bits * #GOIDs to losslessly incorporate the GO info). Solution-2: BP, MF and CC DAGs are disjoint (share no GOIDs? true?), an alternative solution is: Use a 4-bit evidencecode/qualifier column, GOIDECQ: For evidence codes: IDA=1111 IGI-1110 IMP=1101 IPI=1100 TAS=1011 IEP=1010 ISS=1001 RCA=1000 IC=0111 IEA=0110 NAS=0101 ND=0100 NR=0011. Qualifiers: 0010=contributesto 0001=colocalizeswith 0000=not (width increases 4-bits*#GOID lossless GO). Solution-3: bitmap all 13 evidencecodes and all 3 qualifiers (adds 16 bit map per GO term). Keep in mind that genes are assumed to be inherited up the DAGs but are only listed at the lowest level to which they apply. This will keep the bitmaps sparse. If a GO term has no attached genes, it need not be included (many such?). It will be in the schema with its DAG links, and will be assumed to inherit all downstream genes, but it will not generate 16 bit columns in Gene Table). Is the not qualifier the complement of the term bitmap? VertiGO (Vertical Gene Ontology)

GO has3 structured, controlled vocabularies (ontologies) describing gene products (the RNA or protein resulting after transcription) by their species-independent, associated biological processes (BP), cellular components (CC) molecular functions (MF). There are three separate aspects to this effort: The GO consortium 1. writes and maintains the ontologies themselves; 2. makes associations between the ontologies and genes / gene products in the collaborating DBs, 3. develops tools that facilitate the creation, maintainence and use of ontologies. The use of GO terms by several collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that you can query them at different levels: e.g., 1. use GO to find all gene products in the mouse genome that are involved in signal transduction, 2. zoom in on all the receptor tyrosine kinases. This structure also allows annotators to assign properties to gene products at different levels, depending on how much is known about a gene product. GO is not a database of gene sequences or a catalog of gene products GO describes how gene products behave in a cellular context.GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not sufficient. Reasons include: Knowledge changes and updates lag behind.

Curators evaluate data differently (e.g., agree to use the word 'kinase', but not to support this by stating how and why we use 'kinase', and consistently to apply it. Only in this way can we hope to compare gene products and determine whether they are related. GO does not attempt to describe every aspect of biology. For example, domain structure, 3D structure, evolution and expression are not described by GO. GO is not a dictated standard, mandating nomenclature across databases. Groups participate because of self-interest, and cooperate to arrive at a consensus. The 3 organizing GO principles: molecular function, biological process, cellular component. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. E.g., the gene product cytochrome c can be described by the molecular function term oxidoreductase activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix, mitochondrial inner membrane. Molecular function (organizing principle of GO) describes e.g., catalytic/binding activities, at molecular level GO molecular function terms represent activities rather than the entities (molecules / complexes) that perform actions, and do not specify where or when, or in what context, the action takes place. Molecular functions correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity, or binding; Examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding. It is easy to confuse a gene product with its molecular function, and for thus many GO molecular functions are appended with the word "activity". The documentation on gene products explains this confusion in more depth.

A Biological Process is series of events accomplished by 1 or more ordered assemblies of molecular fctns. Examples of broad biological process terms: cellular physiological process or signal transduction. Examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps. A biological process is not equivalent to a pathway. We are specifically not capturing or trying to represent any of the dynamics or dependencies that would be required to describe a pathway. A cellular component is just that, a component of a cell but with the proviso that it is part of some larger object, which may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer). What does the Ontology look like? GO terms are organized in structures called directed acyclic graphs (DAGs), which differ from hierarchies in that a child (more specialized term) can have many parent (less specialized term). For example, the biological process term hexose biosynthesis has two parents, hexose metabolism and monosaccharide biosynthesis. This is because biosynthesis is a subtype of metabolism, and a hexose is a type of monosaccharide. When any gene involved in hexose biosynthesis is annotated to this term, it is automatically annotated to both hexose metabolism and monosaccharide biosynthesis, because every GO term must obey the true path rule: if the child term describes the gene product, then all its parent terms must also apply to that gene product.

It is easy to confuse a gene product and its molecular function, because very often these are described in exactly the same words. For example, 'alcohol dehydrogenase' can describe what you can put in an Eppendorf tube (the gene product) or it can describe the function of this stuff. There is, however, a formal difference: a single gene product might have several molecular functions, and many gene products can share a single molecular function, e.g., there are many gene products that have the function 'alcohol dehydrogenase'. Some, but by no means all, of these are encoded by genes with the name alcohol dehydrogenase. A particular gene product might have both the functions 'alcohol dehydrogenase' and 'acetaldehyde dismutase', and perhaps other functions as well. It's important to grasp that, whenever we use terms such as alcohol dehydrogenase activity in GO, we mean the function, not the entity; for this reason, most GO molecular function terms are appended with the word 'activity'. Many gene products associate into entities that function as complexes, or 'gene product groups', which often include small molecules. They range in complexity from the relatively simple (for example, hemoglobin contains the gene products alpha-globin and beta-globin, and the small molecule heme) to complex assemblies of numerous different gene products, e.g., the ribosome. At present, small molecules are not represented in GO. In the future, we might be able to create cross products by linking GO to existing databases of small molecules such as Klotho , LIGAND

Classification Models and Neural Networks

Classification Models and Neural Networks

Presentation Transcript

Section 10-1

Section 10

Section 10-3

Section 1-10

Section 10

Section 10: Ethics

Section 10 – 1

Section 10

Section 10-3

SECTION 10

Section 10 Vault

Section 10

Section 10: Layout

Section 10

Section 10-1

Section 10-3

Section 10 Vault

Section 10: Last section!

Section 4-10

Section 10-1