160 likes | 172 Views
Explore the spatial proximity of structural data attributes in machine learning through clustering, classification, and proximity sets. Understand how data analysis is performed based on near neighbor sets and continuous classification methods.
E N D
Spatial Proximity of Structural Data Attributes Maria Canton, William Perrizo Dept. of CS, North Dakota State University. CATA 2007 – Honolulu, Hawaii
Data analysis can be broken down into two parts, Querying and Data Mining. Data Mining can be broken down into 2 parts, Machine Learning and Association Rule Mining Machine Learning can be broken down into 2 parts, Clustering and Classification. Clustering can be broken down into 2 parts, Isotropic (round clusters) and Density-based
So Machine Learning begins by identifying Near Neighbor Set(s), NNS. In Isotropic Clustering, round sets are identified (disk shaped Near Neighbor Sets about a center). In Density Clustering, cores are identified (dense NNSs) then pieced together by overlap. Classification is always based on continuity which is necessarily Near Neighbor Set based.
Classification We classifying a sample based on its NNS class histogram (AKA, k Nearest Neighbor or kNN classification) or We identify isotropic NNSs of centroids (AKA, k-means) or We build decision trees whose leaves are disjoint Training Subsets whose histograms classify samples falling to that leaf or we find class boundaries (e.g. SVM) which distinguish NNSs in one class from rest.
Continuity Recall the definition of continuity? >0 >0 : d(x,a)< d(f(x),f(a))< or said using Near Neighbor Sets, NNS about f(a) NNS about a that maps inside it. In a Database, class values are descrete ( finite) and thus Nearest Neighbor Sets (Proximity Sets) are fundamental to Machine Learning.
Near Neighbor Sets of a set Given a similarity, s:RRReals (e.g., s(x,y) = s(y,x) and s(x,x) s(x,y) x, y R ) and an extension to disjoint subsets of R (e.g., single link / complete link / average link...) and C R, a k-disk of Cis (a k Nearest Ngbr Set of C) disk(C,k) C : |disk(C,k)C'| = k and s(x,C) s(y,C) xdisk(C,k), ydisk(C,k)
C C r1 C For C = {a} r1 r1 a r2 r2 skin(C,k) disk(C,k) - C skin stands for "s k immediate neighbors" and is also a kNNS of C cskin(C,k) allskin(C,k)sclosed skin, and ring(C,k)= cskin(C,k) - cskin(C,k-1) disk(C,r1) {xR | s(x,C)r1}, skin(C,r1) disk(C,r1) - C ring(C,r2,r1) disk(C,r2) - disk(C,r1) skin(C,r2) - skin(C,r1). Given a [psuedo] distance, d, rather than a similarity, just reverse all inequalities.
xyshad xyshad xoyy = xoyy = xoyy |y| |y| yoy |y|2 y x Non-Isotopic (Vector) Spatial Proximity Structures Theshadow vector made by a vector x on another vector y, denoted xyshadow or just xyshad is the dot product of x with a unit vector in the y direction times that unit vector.
xyshad y xyperp x perp vectors Theperpendicular vector made by a vector x on another vector y, denoted xyperpendicular or justxyperp = difference of x and its yshadow. xyperp x - xyshad |xyperp|2 = |x|2 - |xyshad|2 xyshad (xyperp) are linear in x
xyshad y xyperp x Proximity Structures based on shad and perp In the collaborative filtering problem, e.g., predicting the rating, um, of a movie, m, by a user, u, from ratings given by users, v, let's consider users as spatial vectors of ratings over movie dimensions The other users, v, provide signals for predicting um. Note that a user, v, whose ratings are: vn= un+1 for all movies, n, that u has already rated, is just as strong a prediction signal as one with exactly matching ratings, vn= unnSupp(u) In standard collaborative filtering, such vs (I will call them +1 signals) are filtered out as not being proximal to u.
xyshad xyshad vm- (1/n)SignedLength(v-u)shad = xoyy yoy y(1..1)=1|y|2=nxshad= xo11=kxk1=x1xperp=x-x1 y=1 n n xyperp x=v-u Pure Signals in Collaborative Filters proximity structures Filter out all collaborators except exact match signals, +1 signals and -1 signals (collectively called pure signals), as non-proximal? For this we use y=(1,1,...,1) RatingPrediction-v = SignedLength(v-u)shad= |v-u|cos = (v-u)o(1/n ) = vo1/n -uo1/n = vk/n -uk/n = (n) (v-u) xyperp x - xyshad
RSI domains • Spatial domain operations, used in analyzing remotely sensed imagery, take into account pixels’ structural attributes as well as neighborhood conditions. • Using the programming utility, TM-Mine, we find the following.
VI (vegetation index) NDVI (normalized difference) TVI (transformed veg index) NIR / R (NIR – R) / (NIR + R) {[(G-B)/(G+B)+0.5]^0.5}*100 P4 3.0GHz – dataset size of 2.10 X 10E8 142.5 seconds 307.5 seconds 442.0 seconds Execution Times for Band Functionals of Different Complexities on a Full TM Scene of 210,000,000 Pixels
Execution Times for Band Funtionals of Different Complexities on a Full TM Scene of 210,000,000 Pixels