Database Analysis and Data Mining Techniques

Database analysiscan be broken down into 2 areas, Querying and Data Mining. Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining Machine Learning can be broken down into 2 areas, Clustering and Classification. Clustering can be broken down into 2 types, Isotropic (round clusters) and Density-based Classification can be broken down into to types, Model-based and Neighbor-based Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1st (round NNSs,  about a center). Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity >0 >0 : d(x,a)<  d(f(x),f(a))< where f assigns a class to a feature vector, or  -NNS of f(a),  a -NNS of a in its pre-image. f(Dom) categorical >0 : d(x,a)<f(x)=f(a) Caution: For classification, it may be the case that one has to use the continuity in lower dimensions to get a predication (due to data sparseness). E.g., 1234 5 a 6 1,2,3,4,5,6,7,8 are all distance  from a and 1,2,3,4 -->C 5,6,7,8-->D. 7 8 Any  that gives us a vote gives us an tie vote. However projecting onto the vertical subspace taking /2 we see that /2 nbrhd about a contains only 5 and 6 so gives us class D. Using horizontal data, NNS derivation requires at least one scan (at least O(n)). L disk NNS can be derived using vertical-data in O(log2n) yet usually Euclidean disks are preferred. (Note: Euclidean and Manhattan coincide in Binary data sets). Our solution in a sentence: Circumscribe the desired Euclidean- nbrhd with functional-contours, (sets of the type f -1([b,c] ) until the intersection is scannable, then scan it for Euclidean--nbrhd membership. Advantage: intersection can be determined before scanning - create and AND functional contour P-trees.

Contours:  f:R(A1..An)  Y Equivalently,  derived attribute, Af, with domain=Y (equivalence is x.Af = f(x) xR) A1 A2 An : : . . . graph(f) = { ( x, f(x) ) | xR } Y Y R* R f A1 A2 An x1 x2 xn : . . . f(x) A1 A2 An Af x1 x2 xn f(x1..xn) : . . . S x f-contour(S) R Y R f S and  SY, the f-contour(S) = f-1(S)Equiv., Af-contour(S) = Select x1..xn From R* Where x.Af=f(x1..xn) If S={a}, we use f-Isobar(a) equiv. Af-Isobar(a) If f is a local density and {Sk} is a partition of Y, {f-1(Sk)} partitions R. (eg, In OPTICS, f=reachability distance, {Sk} is the partition produced by intersections of graph-f wrt to a walk of R and a horizontal line. A Weather map use equiwidth interval partition of S=Reals (barometric pressure or temperature contours). A grid is the intersection partition with respect to the dimension projection functions (next slide). A Class is a contour under f:RC, the class map. An L -disk about a is the intersection of the -dimension_projection contours containing a.

GRIDs 2.lo grid1.hi grid Want square cells or a square pattern? 11 10 01 00 11 10 01 00 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 f:RY,  partition S={Sk} of Y, {f-1(Sk)}=S,f-grid of R (grid cells=contours) If Y=Reals, the j.lo f-grid is produced by agglomerating over the j lo bits of Y,  fixed (b-j) hi bit pattern. The j lo bits walk [isobars of] cells. Theb-j hi bits identify cells. (lo=extension / hi=intention) Let b-1,...,0 be the b bit positions of Y. The j.lo f-grid is the partition of R generated by f and S = {Sb-1,...,b-j | Sb-1,...,b-j = [(b-1)(b-2)...(b-j)0..0, (b-1)(b-2)...(b-j)1..1)} partition of Y=Reals. If F={fh}, the j.lo F-grid is the intersection partition of the j.lo fh-grids (intersection of partitions). The canonicalj.lo grid is the j.lo -grid; ={d:RR[Ad] | d = dth coordinate projection} j-hi gridding is similar ( the b-j lo bits walk cell contents / j hi bits identify cells). If the horizontal and vertical dimensions have bitwidths 3 and 2 respectively:

j.lo and j.hi gridding continued The horizontal_bitwidth = vertical_bitwidth = b iff j.lo grid = (b-j).hi grid e.g., for hb=vb=b=3 and j=2: 2.lo grid1.hi grid 111 110 101 100 111 110 101 100 011 010 001 000 011 010 001 000 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

r1 r1 a A distance, d, generates a similarity many ways, e.g., s(x,y)=1/(1+d(x,y)): (or if the relationship various by location, s(x,y)=(x,y)/(1+d(x,y)) s r2 1 r2 For C = {a} d s(x,y)=e-d(x,y)2 : s 1 d 0 : d(x,y)> s(x,y)= e-d(x,y)2/std-e-2/std: d(x,y) (vote weighting IS a similarity assignment, so the similarity-to-distance graph IS a vote weighting for classification) s C 1-e-2/std d  SOME Useful NearNeighborSets (NNS) Given a similarity s:RRReals and a CR (i.e., s(x,y)=s(y,x); s(x,x)s(x,y) x,yR) Ordinal disks, skins and rings: disk(C,k)  C : |disk(C,k)C'|=k and s(x,C)s(y,C) xdisk(C,k), ydisk(C,k) skin(C,k)= disk(C,k)-C (skin comes from s k immediate neighbors and is a kNNS of C.) ring(C,k)= cskin(C,k)-cskin(C,k-1) closeddisk(C,k)alldisk(C,k); closedskin(C,k)allskin(C,k) Cardinal disk, skins and rings: disk(C,r) {xR | s(x,C)r} also = functional contour, f-1([r, ), where f(x)=sC(x)=s(x,C) skin(C,r) disk(C,r) - C ring(C,r2,r1) disk(C,r2)-disk(C,r1)  skin(C,r2)-skin(C,r1) also = functional contour, sC-1(r1,r2] Note: closeddisk and closedskin(C,k) are redundant, since closeddisk(C,k) = disk(C,s(C,y)) where y is any kth NN of C L skins: skin(a,k) = {x | d, xd is one of the k-NNs of ad} - a local normalization?

f:R(A1..An)Y SY The (uncompressed) Predicate-tree0Pf, S is : 0Pf,S(x)=1(true) iff f(x)S 0Pf,S is called a P-tree for short and is just the existential R*-bit mapof SR*.Af The Compressed P-tree,sPf,S is the compression of 0Pf,S with equi-width leaf size, s, as follows 1. Choose a walk of R (converts 0Pf,S from bit map to bit vector) 2. Equi-width partition 0Pf,S with segment size, s (s=leafsize, the last segment can be short) 3. Eliminate and mask to 0, all pure-zero segments (call mask, NotPure0 Mask or EM) 4. Eliminate and mask to 1, all pure-one segments (call mask, Pure1 Mask or UM) (EM=existential aggregation UM=universal aggregation) Compressing each leaf of sPf,S with leafsize=s2 gives: s1,s2Pf,SRecursivly, s1, s2, s3Pf,S s1, s2, s3, s4Pf,S ... (builds an EM and a UM tree) BASIC P-trees If AiReal or Binaryand fi,j(x)  jth bit of xi ; {(*)Pfi,j ,{1} (*)Pi,j}j=b..0 are basic (*)P-trees of Ai, *= s1..sk If AiCategorical and fi,a(x)=1 if xi=a, else 0; {(*)Pfi,a,{1}(*)Pi,a}aR[Ai] are basic (*)P-trees of Ai Notes: The UM masks (e.g., of 2k,...,20Pi,j, with k=roof(log2|R| ), form a (binary) tree. Whenever the EM bit is 1, that entire subtree can be eliminated (since it represents a pure0 segment), then a 0-node at level-k (lowest level = level-0) with no sub-tree indicates a 2k-run of zeros. In this construction, the UM tree is redundant. We call these EM trees the basic binary P-trees. The next slide shows a top-down (easy to understand) construction of and the following slide is a (much more efficient) bottom up construction of the same. We have suppressed the leafsize prefix.

A data table, R(A1..An), containing horizontal structures (records) is Vertical basic binary Predicate-tree (P-tree): vertically partition table; compress each vertical bit slice into a basic binary P-tree as follows R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontal structures (records) Scanned vertically R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 1. Whole file is not pure1 0 2. 1st half is not pure1  0 0 0 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42P43 3. 2nd half is not pure1  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 1 10 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 4. 1st half of 2nd half not  0 0 0 1 0 1 01 5. 2nd half of 2nd half is  1 0 1 0 6. 1st half of 1st of 2nd is  1 Eg, Count number of occurences of 111 000 001 1000 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 =0 0 22-level=2 01 21-level 7. 2nd half of 1st of 2nd not 0 processed vertically (vertical scans) then process using multi-operand logical ANDs. R11 0 0 0 0 1 0 1 1 The basic binary P-tree, P1,1, for R11 is built top-down by record truth of predicate pure1 recursively on halves, until purity. But it is pure (pure0) so this branch ends

R11 0 0 0 0 1 0 1 1 Top-down construction of basic binary P-trees is good for understanding, but bottom-up is more efficient. 0 0 0 0 0 1 0 0 0 0 0 1 1 1 Bottom-up construction of P11 is done using in-order tree traversal and the collapsing of pure siblings, as follow: P11 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0

Processing Efficiencies? (prefixed leaf-sizes have been removed) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 = R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 01 0 1 0 1 0 0 1 0 0 1 01 This 0 makes entire left branch 0 7 0 1 4 These 0s make this node 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 These 1s and these 0s make this 1 0 0 0 0 1 01 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 To count occurrences of 7,0,1,4 use pure111000001100: 0 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 01 0 1 0 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 0 0 1 0 1 01 ^ 0 1 0 R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 21-level has the only 1-bit so the 1-count = 1*21 = 2

2xRd=1..nad(k2kxdk) + |R||a|2 = xRd=1..n(k2kxdk)2 - 2xRd=1..nad(k2kxdk) + |R||a|2 = xd(i2ixdi)(j2jxdj) - |R||a|2 = xdi,j 2i+jxdixdj- 2 x,d,k2k adxdk + |R|dadad |R||a|2 = x,d,i,j 2i+j xdixdj- = x,d,i,j 2i+j xdixdj- 2|R| dadd + 2 dadx,k2kxdk + TV(a) = i,j,d 2i+j |Pdi^dj| - k2k+1 dad |Pdk| + |R||a|2 dadad ) = x,d,i,j 2i+j xdixdj+ |R|( -2dadd + A useful functional: TV(a) =xR(x-a)o(x-a) If we use d for a index variable over the dimensions, = xRd=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes Note that the first term does not depend upon a. Thus, the derived attribute, TV-TV() (eliminate 1st term) is much simpler to compute and has identical contours (just lowers the graph by TV() ). We also find it useful to post-compose a log to reduce the number of bit slices. The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a).

- 2ddad = |R|( dad2 + dd2) g(a) HDTV(a) = ln( f(a) )= ln|R| + ln|a-|2 g(a)=HDTV(x) g(c) g(b) x1  a b c -contour (radius  about a) x2 dadad ) TV(a) = x,d,i,j 2i+j xdixdj + |R| ( -2dadd + From equation 7, f(a)=TV(a)-TV() d(adad- dd) ) = |R| ( -2d(add-dd) + = |R| |a-|2 so f()=0 The length of g(a) depends only on the length of a-, so isobars are hyper-circles centered at  The graph of g is a log-shaped hyper-funnel: For an -contour ring (radius  about a) go inward and outward along a- by  to the points; inner point, b=+(1-/|a-|)(a-) and outer point, c=-(1+/|a-|)(a-). Then take g(b) and g(c) as lower and upper endpoints of a vertical interval. Then we use EIN formulas on that interval to get a mask P-tree for the -contour (which is a well-pruned superset of the -neighborhood of a)

use circumscribing Ad-contour (Note: Ad is not a derived attribute at all, but just Ad, so we already have its basic P-trees). If the HDTV circumscribing contour of a is still too populous, (Use voting function, G(x) = Gauss(|x-a|)-Gauss(), where Gauss(r) is (1/(std*2)e-(r-mean)2/2var (std, mean, var are wrt set distances from a of voters i.e., {r=|x-a|: x a voter} )  a -contour (radius  about a) • As pre-processing, calculate basic P-trees for the HDTV derived attribute • (or another hypercircular contour derived attribute). • To classify a • 1. Calculate b and c (Depend on a, ) • 2. Form mask P-tree for training pts with HDTV-values[HDTV(b),HDTV(c)] • 3. User that P-tree to prune out the candidate NNS. • If the count of candidates is small, proceed to scan and assign class votes using Gaussian vote function, else prune further using a dimension projections). HDTV(x) HDTV(c) We can also note that HDTV can be further simplified (retaining same contours) using h(a)=|a-|. Since we create the derived attribute by scanning the training set, why not just use this very simple function? Others leap to mind, e.g., hb(a)=|a-b| HDTV(b) x1 contour of dimension projection f(a)=a1 b c x2

HDTV  h(a)=|a-|  TV-TV() hb(a)=|a-b| TV(x15)-TV() 1 1 2 2 3 3 4 4 5 5 Y X  b TV TV(x15) TV()=TV(x33) 1 1 2 2 3 3 4 4 5 5 Y X Graphs of functionals with hyper-circular contours

COS(a) o a ad ad = (1/|a|)d=1..n(xxd) = |R|/|a|d=1..n d = ( |R|/|a| ) ad = |R|/|a|d=1..n((xxd)/|R|)  COS(a)  a COSb(a)?  b a Angular Variation functionals: e.g., AV(a) ( 1/|a| ) xR xoa d is an index over the dimensions, = (1/|a|)xRd=1..nxdad = (1/|a|)d(xxdad) factor out ad COS(a) AV(a)/(|||R|) = oa/(|||a|) = cos(a) COS (and AV) has hyper-conic isobars center on  COS and AV have -contour(a) = the space between two hyper-cones center on  which just circumscribes the Euclidean -hyperdisk at a. Intersection (in pink)with HDTV -contour. Graphs of functionals with hyper-conic contours: E.g., COSb(a) for any vector, b

2d=1..nad(k2kxdk) + |a|2 = d=1..n(k2kxdk)2 - 2d=1..nad(2kxdk) + |a|2 = d(i2ixdi)(j2jxdj) - |a|2 =di,j 2i+jxdixdj- 2 d,k2k adxdk + f(a)x = i,j,d 2i+j (Pdi^dj)x- k2k+1 dad (Pdk)x + |a|2 * exp( k2k+1 dad (Pdk)x) *exp( -|a|2 ) βexp( -f(a)x ) =βexp(-i,j,d 2i+j (Pdi^dj)x) xcβexp( -f(a)x ) =βxc(exp(-i,j,d 2i+j (Pdi^dj)x) βexp( -f(a)x ) =β(exp(-i,j,d 2i+j (Pdi^dj)x) * exp( k2k+1 dad (Pdk)x) ) * exp( k2k+1 dad (Pdk)x) ) exp( -|a|2 ) exp( -|a|2 ) xcexp((-i,j,d 2i+j (Pdi^dj)x) + k,d2k+1 ad (Pdk)x) xc( ij,d exp(-2i+j (Pdi^dj)x) ( ij,d:Pdijx=1 exp(-2i+j ) xcexp( ij,d -2i+j (Pdi^dj)x * i=j,d exp((ad2i+1-22i)(Pdi)x)) * i=j,d:Pdijx=1 exp((ad2i+1-22i))) (eq1) + i=j,d(ad2i+1-22i ) (Pdi)x) f(a)x = (x-a)o(x-a) d = index over dims, = d=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes Adding up the Gaussian weighted votes for class c: Collecting diagonal terms inside exp i,j,d inside exp we have coefs which do not involve x multiplied by a 1-bit or a 0-bit, depending on x thus for fixed i,j,d we either have the x-indep coef (if 1-bit) or we don't (if 0-bit)

Suppose there are two classes, red (-) and green (+) on the -cylinder shown. Then the vector connecting medians (vcm) in YZ space is shown in purple. Then the unit vector in the direction of the vector connecting medians (uvcm) in YZ space is shown in blue. The vector from the midpoint of the vectors of medians to s is in orange. The inner product is of the blue and the orange is the same as the inner product we would get by doing it in 3D! The point is that the x-component of the red vector of medians and that of the green are identical so that the x component of the vcm is zero. (small vcm component means prune out! y s x z

Database Analysis and Data Mining Techniques

Database Analysis and Data Mining Techniques

Presentation Transcript

Supervised Learning: Linear Perceptron NN

Machine Learning: Symbol-based

Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search

Learning Invariances and Hierarchies

Nearest Neighbor Searching Under Uncertainty

Nearest Neighbor Searching Under Uncertainty

Statistical Natural Language Parsing

Multiple Sequence Alignment Based on Compact Set

CS 9633 Machine Learning Explanation Based Learning

CS 9633 Machine Learning k-nearest neighbor

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

Searching Dynamic Point Sets in Spaces with Bounded Doubling Dimension

Machine Learning Lecture 11: Nearest Neighbor

What is Machine Learning?

Near-Neighbor Search

Machine Learning on Massive Datasets

Machine Learning: Symbol-based

Near-Neighbor Search