390 likes | 567 Views
Distance Functions on Hierarchies. Eftychia Baikousi. Outline. Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy.
E N D
Distance Functions on Hierarchies Eftychia Baikousi
Outline • Definition of metric & similarity • Various Distance Functions • Minkowski • Setbased • Editdistance • Basic concept of OLAP • Lattice • Distance in same level of hierarchy • Distance in different level of hierarchy
Definition of metric • A distance function on a given set M is a function d:MxM , that satisfies the following conditions: • d(x,y)≥0 and • d(x,y)=0 iff x=y • Distance is positive between two different points and is zero precisely from a point to itself • It is symmetric: d(x,y)=d(y,x) • The distance between x and y is the same in either direction • It satisfies the triangleinequality: d(x,z) ≤ d(x,y)+ d(y,z) • The distance between two points is the shortest distance along any path • Is a metric
Definition of similarity metric • Let s(x,y) be the similarity between two points x and y, then the following properties hold: • s(x,y) =1 only if x=y (0≤ s ≤1) • s(x,y) =s(y,x)x and y (symmetry) • The triangle inequality does not hold
Outline • Definition of metric & similarity • Various Distance Functions • Minkowski • Setbased • Editdistance • Basic concept of OLAP • Lattice • Distance in same level of hierarchy • Distance in different level of hierarchy
Edit Distance- Levenshtein distance • Edit distance between two strings x=x1 ….xn, y=y1…ym is defined as the minimum number of atomic edit operations needed • Insert : ins(x,i,c)=x1x2…xicxi+1…xn • Delete : del(x,i)=x1x2…xi-1xi+1…xn • Replace : rep(x,i,c)=x1x2…xi-1cxi+1…xn • Assign cost for every edit operation c(o)=1
Edit distances • Needleman-Wunch distance or Sellers Algorithm • Insert • a characterins(x,i,c)=x1x2…xicxi+1…xn • with cost(o)=1 • a gap ins_g(x,i,g)=x1x2…xigxi+1…xn • withcost(o)=g • Delete • a characterdel(x,i)=x1x2…xi-1xi+1…xn • withcost(o)=1 • a gapdel_g(x,i)=x1x2…xi-1xi+1…xn • withcost(o)=g • Replace • a characterrep(x,i,c)=x1x2…xi-1cxi+1…xn • withcost(o)=1
Edit distances • Jaro distance • Let two strings s and t and • s’= characters in s that are common with t • t’ = characters in t that are common with s • Ts,t=number of transportations of characters in s’ relative to t’
Edit distances • Jaro distance Example • Let s =MARTHA and t =MARHTA • |s’|=6 • |t’|=6 • Ts,t = 2/2since mismatched characters are T/H and H/T
Edit distances • Jaro Winkler • JWS(s,t)= Jaro(s,t) + ((prefixLength * PREFIXSCALE * (1.0-Jaro(s,t))) • Where: • prefixLength : the length of common prefix at the start of the string • PREFIXSCALE: a constant scaling factor which gives more favourable ratings to strings that match from the beginning for a set prefix length
Edit distances • Jaro Winkler Example • Let s =MARTHA and t =MARHTA and PREFIXSCALE = 0.1 • Jaro(s,t)=0.8055 • prefixLength=3 • JWS(s,t)= Jaro(s,t) + ((prefixLength * PREFIXSCALE * (1.0-Jaro(s,t))) = 0.8055 + (3*0.1*(1-0.8055)) = 0.86385
Outline • Definition of metric & similarity • Various Distance Functions • Minkowski • Setbased • Editdistance • Basic concept of OLAP • Lattice • Distance in same level of hierarchy • Distance in different level of hierarchy
Βασικές Έννοιες OLAP • Αφορά την ανάλυση κάποιων μετρήσιμων μεγεθών (μέτρων) • πωλήσεις, απόθεμα, κέρδος,... • Διαστάσεις: παράμετροι που καθορίζουν το περιβάλλον (context) των μέτρων • ημερομηνία, προϊόν, τοποθεσία, πωλητής, … • Κύβοι: συνδυασμοί διαστάσεων που καθορίζουν κάποια μέτρα • Ο κύβος καθορίζει ένα πολυδιάστατο χώρο διαστάσεων, με τα μέτρα να είναι σημεία του χώρου αυτού
REGION W S N Juice 10 Cola 13 PRODUCT Soap Jan MONTH Κύβοι για OLAP
Βασικές Έννοιες OLAP • Τα δεδομένα θεωρούνται αποθηκευμένα σε ένα πολυδιάστατο πίνακα (multi-dimensional array), ο οποίος αποκαλείται και κύβος ή υπερκύβος (Cube και HyperCube αντίστοιχα). • Ο κύβος είναι μια ομάδα από κελιά δεδομένων (data cells). Κάθε κελί χαρακτηρίζεται μονοσήμαντα από τις αντίστοιχες τιμές των διαστάσεων (dimensions)του κύβου. • Τα περιεχόμενα του κελιού ονομάζονται μέτρα (measures) και αναπαριστούν τις αποτιμώμενες αξίες του πραγματικού κόσμου.
Ιεραρχίες επιπέδων για OLAP • Μια διάσταση μοντελοποιεί όλους τους τρόπους με τους οποίους τα δεδομένα μπορούν να συναθροιστούν σε σχέση με μια συγκεκριμένη παράμετρο του περιεχομένου τους. • Ημερομηνία, Προϊόν, Τοποθεσία, Πωλητής, … • Κάθε διάσταση έχει μια σχετική ιεραρχία επιπέδωνσυνάθροισης των δεδομένων (hierarchy of levels). Αυτό σημαίνει, ότι η διάσταση μπορεί να θεωρηθεί από πολλά επίπεδα αδρομέρειας. • Ημερομηνία: μέρα, εβδομάδα, μήνας, χρόνος, …
Ιεραρχίες Επιπέδων • ΙεραρχίεςΕπιπέδων: κάθε διάσταση οργανώνεται σε διαφορετικά επίπεδα αδρομέρειας • Ο χρήστης μπορεί να πλοηγηθεί από το ένα επίπεδο στο άλλο, δημιουργώντας νέους κύβους κάθε φορά Αδρομέρεια: το αντίθετο της λεπτομέρειας -- ο σωστός όρος είναι αδρομέρεια...
Sales volume Region Product Month Κύβοι & ιεραρχίες διαστάσεων για OLAP Διαστάσεις: Product, Region, DateΙεραρχίες διαστάσεων: Country Year Industry Category Region Quarter City Week Product Month Day Store
Outline • Definition of metric & similarity • Various Distance Functions • Minkowski • Setbased • Editdistance • Basic concept of OLAP • Lattice • Distance in same level of hierarchy • Distance in different level of hierarchy
Lattice • A lattice is a partially ordered set (poset) in which every pair of elements has a unique supremum and an inifimum • The hierarchy of levels is formally defined as a lattice (L,<) • such that L= (L1, ..., Ln, ALL) is a finite set of levels and • < is a partial order defined among the levels of L • such that L1<Li<ALL 1≤i≤n. • the upper bound is always the level ALL, • so that we can group all values into the single value ‘all’. • The lower bound of the lattice is the most detailed level of the dimension.
Outline • Definition of metric & similarity • Various Distance Functions • Minkowski • Setbased • Editdistance • Basic concept of OLAP • Lattice • Distance in same level of hierarchy • Distance in different level of hierarchy
Distances in the same level of Hierarchy • Let a dimension D, • its levels of hierarchies L1<Li<ALL and • two specific values x and y s.t. x, y Li All L2 L1
Distances in the same level of Hierarchy • Explicit • Minkowski • Set Based • Highway • With respect to the detailed level • Attribute Based
Distances in the same level of Hierarchy • Explicit assignment • n2 distances for the n values of the dom(Li) • Minkowski family • reduce to the Manhattan distance: |x-y| • Set based family • reduced to {0, 1}, where
Distances in the same level of Hierarchy • Highway distance • Let the values of level Liform a set of k clusters, where each cluster has a representative rk • dist(x, y)= dist(x, rx)+ dist(rx, ry)+ dist(y, ry) • Specify • k2 distances: dist (rx, ry) and • k distances: dist(x, rx)
Distances in the same level of Hierarchy • With respect to the detailed level • f is a function that picks one of the descendants • Attribute based • level L attributes: • v [v1 … vn] dom(L) • Distance can be defined with respect to the attributes
Outline • Definition of metric & similarity • Various Distance Functions • Minkowski • Setbased • Editdistance • Basic concept of OLAP • Lattice • Distance in same level of hierarchy • Distance in different level of hierarchy
Distances in different levels of Hierarchy • Explicit • dist1+ dist2 • dist3+dist4 • With respect to the detailed level • With respect to their least common ancestor • Highway • Attribute Based
dist2 xy y Ly dist1 dist3 Lx yx x dist4 Distances in different levels of Hierarchy • Let a dimension D, • its levels of hierarchies L1<Li<ALL • two specific values x and y s. t. xLx yLy • Lx<Ly • ancestor ofxin levelLy • a descendant ofyin levelLx
dist2 xy y Ly dist1 dist3 Lx yx x dist4 Distances in different levels of Hierarchy • Explicit assignment • define distLx,Ly(x, y)x Lx, y Ly • dist1 +dist2 • Where is a distance of two values from the same level of hierarchy • special case: y is an ancestor of xthen dist2=0
dist2 xy y Ly dist1 dist3 Lx yx x dist4 Distances in differentlevels of Hierarchy • dist3 +dist4 • Wherea distance of two values from the same level of hierarchy • special case: y is an ancestor of xthen dist4=0
Distances in different levels of Hierarchy • With respect to the detailed level • Let and • Wheredist(x1, y1)a distance of two values from the same level of hierarchy
Distances in different levels of Hierarchy • With respect to their commonancestor • Let Lzthe level of hierarchy where x and y have their first common ancestor • number of “hops” needed to reach the first common ancestor • normalizing according to the height of the level
Distances in different levels of Hierarchy • Highway distance • Let every Li is clustered into ki clusters and every cluster has its own representativerki • Attribute Based • level L attributes: • v [v1 … vn] dom(L) • Distance can be defined with respect to the attributes
Types of Levels • Nominal = • values hold the distinctness property • values can be explicitly distinguished • Ordinal < > • values hold the distinctness property & the order property • values abide by an order • Interval + - • values hold the distinctness, order & the addition property • a unit of measurement exists • there is meaning of the difference between two values