Probability Densities in Data Mining

Probability Densities in Data Mining Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599

Contenido • Porque son importantes. • Notacion y fundamentos de PDF continuas. • PDFs multivariadas continuas. • Combinando variables aleatorias discretas y continuas.

Porque son importantes? • Real Numbers occur in at least 50% of database records • Can’t always quantize them • So need to understand how to describe where they come from • A great way of saying what’s a reasonable range of values • A great way of saying how multiple attributes should reasonably co-occur

Porque son importantes? • Can immediately get us Bayes Classifiers that are sensible with real-valued data • You’ll need to intimately understand PDFs in order to do kernel methods, clustering with Mixture Models, analysis of variance, time series and many other things • Will introduce us to linear and non-linear regression

A PDF of American Ages in 2000

Poblacion de PR por grupo de edad group freq midpoint freq.rela 0-4 284593 2.5 0.0737516 5-9 301424 7.5 0.0781133 10-14 305025 12.5 0.0790465 15-19 305577 17.5 0.0791895 20-24 299362 22.5 0.0775789 25-29 277415 27.5 0.0718914 30-34 262959 32.5 0.0681452 35-39 265154 37.5 0.0687140 40-44 258211 42.5 0.0669147 45-49 239965 47.5 0.0621863 50-54 233597 52.5 0.0605361 55-59 206552 57.5 0.0535274 60-64 169796 62.5 0.0440022 65-69 141869 67.5 0.0367650 70-74 112416 72.5 0.0291323 75-79 85137 77.5 0.0220630 80-84 57953 82.5 0.0150184 85+ 51801 87.5 0.0134241

A PDF of American Ages in 2000 Let X be a continuous random variable. If p(x) is a Probability Density Function for X then… = 0.36

Properties of PDFs That means…

Donde x-h/2<w<x+h/2). Luego, Asi p(w) tiende a p(x) cuando h tiende a cero

Funcion de distribucion acumulativa. Esta es una funcion No decreciente Notar que p(w) tiende a p(x) cuando h tiende a 0 Se ha mostrado que la derivada de la funcion de distribucion da la funcion de densidad

Properties of PDFs Therefore… Therefore… La dcerivada de una fucnion no dcecreciente es mayor o igual que cero.

Cual es el significado de p(x)? Si p(5.31) = 0.06 and p(5.92) = 0.03 Entonces cuando un valor de X es muestreado de la distribucion, es dos veces mas probable que X este mas cerca a 5.31 que a 5.92.

Yet another way to view a PDF • A recipe for sampling a random age. • Generate a random dot from the rectangle surrounding the PDF curve. Call the dot (age,d) • If d < p(age) stop and return age • Else try again: go to Step 1.

Test your understanding • True or False:

Expectations E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X

Expectations E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X E[age]=35.897 = the first moment of the shape formed by the axes and the blue curve = the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error

Expectation of a function m=E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution. = the average value we’d see if we took a very large number of random samples of f(X) Note that in general:

Variance s2 = Var[X] = the expected squared difference between x and E[X] = amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally

Standard Deviation s2 = Var[X] = the expected squared difference between x and E[X] = amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally s = Standard Deviation = “typical” deviation of X from its mean

Estadisticas para PR • E(edad)=35.17 • Var(edad)=501.16 • Desv.Est(edad)=22.38

Funciones de densidad mas conocidas

Funciones de densidad mas conocidas • La densidad uniforme o rectangular • La densidad triangular • La densidad exponencial • La densidad Gamma y la Chi-square • La densidad Beta • La densidad Normal o Gaussiana • Las densidades t y F.

La distribucion rectangular 1/w -w/2 0 w/2

La distribucion triangular 0

The Exponential distribution

La distribucion Normal Estandar

La distribucion Normal General s=15 m=100

General Gaussian Also known as the normal distribution or Bell-shaped curve s=15 m=100 Shorthand: We say X ~ N(m,s2) to mean “X is distributed as a Gaussian with parameters m and s2”. In the above figure, X ~ N(100,152)

The Error Function Assume X ~ N(0,1) Define ERF(x) = P(X<x) = Cumulative Distribution of X

Using The Error Function Assume X ~ N(m,s2) P(X<x| m,s2) =

The Central Limit Theorem • If (X1,X2, … Xn) are i.i.d. continuous random variables • Then define • As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi] Somewhat of a justification for assuming Gaussian noise is common

Estimadores de funcion de densidad • Histograms • K-nearest neighbors: • Kernel density estimators h ancho de clase dk es la distancia hasta el k-esimo vecino

> x=c( 7.3, 6.8, 7.1, 2.5, 7.9, 6.5, 4.2, 0.5, 5.6, 5.9) > hist(x,freq=F,main="Estimacion de funcion de densidad-histograma") > rug(x,col=2)

Estimacion de densidad por knn en 20 pts con k=1,3,5,7

Estimación por kernels de una función de densidad univariada. En el caso univariado, el estimador por kernels de la función de densidad f(x) se obtiene de la siguiente manera. Consideremos que x1,…xn es una variable aleatoria X con función de densidad f(x), definamos la función de distribución empirica por el cual es un estimador de la función de distribución acumulada F(x) de X. Considerando que la función de densidad f(x) es la derivada de la función de distribución F y usando aproximación para derivada se tiene que

donde h es un valor positivo cercano a cero. Lo anterior es equivalente a la proporción de puntos en el intervalo (x-h, x+h) dividido por 2h. La ecuación anterior puede ser escrita como: donde la función peso K está definida por 0 si |z|>1 K(z)= 1/2 si |z| 1

Muestra: 6, 8, 9 12, 20, 25,18, 31

este es llamado el kernel uniforme y h es llamado el ancho de banda el cual es un parámetro de suavización que indica cuanto contribuye cada punto muestral al estimado en el punto x. En general, K y h deben satisfacer ciertas condiciones de regularidad, tales como: K(z) debe ser acotado y absolutamente integrable en (-,) Usualmente, pero no siempre, K(z)0 y simétrico, luego cualquier función de densidad simétrica puede usarse como kernel.

Eleccion del ancho de banda h Donde n es el numero de datos y s la desviacion estandar de la muestra.

EL KERNEL GAUSSIANO Eneste caso el kernel representa una función peso más suave donde todos los puntos contribuyen al estimado de f(x) en x. Es decir,

EL KERNEL TRIANGULAR K(z)=1- |z| para |z|<1, 0 en otro caso. EL KERNEL "BIWEIGHT" 15/16(1-z2)2 para |z|<1 K(z)= 0 en otro caso

EL KERNEL EPANECHNIKOV para |z|< K(z)= 0 en otro caso

Estimacion de densidad en 20 pts usando kernel gaussiano con h=.5,”opt1”,”opt2”, 4

Variables aleatorias bidimensionales p(x,y) = probability density of random variables (X,Y) at location (x,y)

Estimadores de funcion de densidad bi-dimensionales • Histogramas • K-nearest neighbors: • Kernel density estimators A area de la clase Ak es el area ncluyendo hasta el k-esimo vecino

Estimacion de kernel bivariado Sean xi=(x,y) los valores observados y t=(t1,t2) un punto del plano donde se desea estimar la densidad conjunta Si h1=h2=h

(a1, a2)H-1 donde

Estimacion de densidad-Kernel Gaussiano bivariado

f1= kde2d(autompg1$V1, autompg1$V5,n=100) persp(f1$x,f1$y,f1$z)

Probability Densities in Data Mining