100 likes | 110 Views
Distributions cont.: Continuous and Multivariate. Distribution, numeric attribute. Continuous data potentially has infinite domain probability of specific values is zero probabilities over intervals, e.g. (-∞, x ] Cumulative distribution function CDF F X ( x ) = P( X ≤ x )
E N D
Distribution, numeric attribute • Continuous data potentially has infinite domain • probability of specific values is zero • probabilities over intervals, e.g. (-∞, x] • Cumulative distribution function CDF • FX(x) = P(X ≤ x) • Probability density function PDF • first derivative of CDF • relative density of points for each value • density is not probability
Histograms • Estimate density in a discrete way • Define cut points and count occurrences within bins • How to choose cut points • equal width: cut domain (min->max) up inkequal size intervals • equal height: select k cut points such that all bins contain (approximately) n/k data points
Kernel Density Estimation • Estimating the density (of the population) from the sample • Observed data is smoothed over numeric domain by means of a kernel (often Gaussian)
Entropy of continuous attribute • Differential entropy • Generalisation of entropy to continuous case somewhat problematic • Uniform distribution over [0, a]: H(X) = lg(a) • a = ½ => H(X) = lg(½) = -1?
Joint distributions • How frequent are combinations of values? • Confusion matrix (contingency table, cross table) • counts each combination • complete information • 2 attributes: how informative is one attribute about the other? • Quantifying information between attributes: joint entropy, mutual information, information gain, … Y univariate distribution of X (marginal distribution) X
Some joint distributions • X and Y are independent • 0.48 = 0.60.8 • 0.12 = 0.60.2 • 0.32 = 0.40.8 • 0.08 = 0.40.2 • Y depends on X • higher counts along diagonal • both diagonals possible • X fully determines Y
Capturing multivariate continuous distributions • 2-dimensions • Problematic in higher dimensions
Joint distribution over numeric x binary • Of specific relevance in Data Mining • classification • How does the class (T/F) depend on a numeric attribute?