1 / 41

Tópicos Especiais em Aprendizagem

Tópicos Especiais em Aprendizagem. Reinaldo Bianchi Centro Universitário da FEI 2012. 4 a . Aula. Parte B. O algoritmo K-means. K-Means. Algoritmo muito conhecido para agrupamento ( clustering ) de padr ões. Usado quando se pode definir o número de agrupamentos:

phil
Download Presentation

Tópicos Especiais em Aprendizagem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tópicos Especiais em Aprendizagem Reinaldo Bianchi Centro Universitário da FEI 2012

  2. 4a. Aula ParteB

  3. O algoritmo K-means

  4. K-Means • Algoritmo muito conhecido para agrupamento (clustering) de padrões. • Usado quando se pode definir o número de agrupamentos: • Escolha o número de agrupamentos desejado. • Escolha centros e membros dos agrupamentos de modo a minimizar o erro. • Não pode ser feito por busca: • muitos parâmetros.

  5. K-Means • Algoritmo: • Fixe os centros dos agrupamentos. • Aloque os pontos para o agrupamento mais próximo. • Recalcule os centros dos clusters, como sendo a média dos pontos que ele representa. • Repita até que os centros parem de se mover.

  6. K-Means • Pode ser usado para qualquer atributo para o qual se pode calcular uma distância…

  7. Clustering • Partitioning Clustering Approach: • a typical clustering analysis approach via partitioning data set iteratively • construct a partition of a data set to produce several non-empty clusters (usually, the number of clusters given in advance) • in principle, partitions achieved via minimising the sum of squared distance in each cluster

  8. Clustering • Given a K, find a partition of K clusters to optimise the chosen partitioning criterion: • global optimal: exhaustively enumerate all partitions • Heuristic method: K-means algorithm (MacQueen’67): • each cluster is represented by the center of the cluster and the algorithm converges to stable centers of clusters.

  9. Algorithm Given the cluster number K, the K-means algorithm is carried out in three steps: • Initialisation: set seed points • Assign each object to the cluster with the nearest seed point; • Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) • Go back to Step 1), • stop when no more new assignment

  10. Example • Suppose we have 4 types of medicines and each has two attributes: • pH and • weight index. • Our goal is to group these objects into K=2 group of medicine.

  11. D C A B Example

  12. Assign each object to the cluster with the nearest seed point Step 1: Use initial seed points for partitioning Euclidean distance

  13. Step 2: Compute new centroids of the current partition Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.

  14. Step 2: Renew membership based on new centroids Compute the distance of all objects to the new centroids Assign the membership to objects

  15. Step 3: Repeat the first two steps until its convergence Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.

  16. Repeat the first two steps until its convergence Compute the distance of all objects to the new centroids Stop due to no new assignment

  17. K-means Demo • User set up the number of clusters they’d like. (e.g. k=5)

  18. K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster Center locations

  19. K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster Center locations • Each data point finds out which Center it’s closest to. (Thus each Center “owns” a set of data points)

  20. K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster centre locations • Each data point finds out which centre it’s closest to. (Thus each Center “owns” a set of data points) • Each centre finds the centroid of the points it owns

  21. K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster centre locations • Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) • Each centre finds the centroid of the points it owns • …and jumps there

  22. K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster centre locations • Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) • Each centre finds the centroid of the points it owns • …and jumps there • …Repeat until terminated!

  23. Exemplo K-means no Matlab

  24. Exemplo k-means no iPad

  25. Relevant Issues • Efficient in computation • O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n. • Local optimum • sensitive to initial seed points • converge to a local optimum that may be unwanted solution

  26. Relevant Issues • Other problems • Need to specify K, the number of clusters, in advance • Unable to handle noisy data and outliers (K-Medoids algorithm) • Not suitable for discovering clusters with non-convex shapes • Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)

  27. Cluster Validity With different initial conditions, the K-means algorithm may result in different partitions for a given data set. Which partition is the “best” one for the given data set? In theory, no answer to this question as there is no ground-truth available in unsupervised learning

  28. Cluster Validity • Example: the ratio of the total between-cluster to the total within-cluster distances: • Between-cluster distance (BCD): the distance between means of two clusters • Within-cluster distance (WCD): sum of all distance between data points and the mean in a specific cluster • A large ratio of BCD:WCD suggests good compactness inside clusters and good separability among different clusters!

  29. Conclusion • K-means algorithm is a simple yet popular method for clustering analysis • There are several variants of K-means to overcome its weaknesses • K-Medoids: resistance to noise and/or outliers • K-Modes: extension to categorical data clustering analysis • CLARA: dealing with large data sets • Mixture models (EM algorithm): handling uncertainty of clusters

  30. E no Matlab?

  31. E no Matlab? • Sintaxe: • IDX = kmeans(X,k) • Descrição: • Partitionsthepoints in the n-by-p data matrix X into k clusters. • Thisiterativepartitioningminimizesthe sum, overallclusters, of thewithin-clustersums of point-to-cluster-centroiddistances. • returnsan n-by-1 vector IDX containingtheclusterindices of eachpoint.

  32. Ransac

  33. RANSAC • RANdomSAmple Consensus. • Alternativa para procurar bons pontos para gerar o ajuste da reta. • Idéia: • Escolha um subconjunto uniforme de maneira aleatória (pontos de suporte). • Ajuste a reta para esses pontos. • Tudo que se encontra longe do ajuste é ruído. • Repita muitas vezes e escolha o melhor ajuste.

  34. RANSAC • Problemas: • Quantas vezes executar? • O mínimo possível… • Qual o tamanho do subconjunto? • O menor possível… • O que é próximo? • Basta estimar a ordem de magnitude… • O que é um bom ajuste? • Um que o número de pontos próximos é tão grande que seja improvável que todos sejam ruído.

  35. 11 supports 4 supports RANSAC – Example How many samples do we need to draw?

  36. RANSAC – How many samples • How many samples we need to ensure with a probability p, that at least one of the random samples of S points is free from outliners. (w: inlier probability)

  37. TheRansacSong

  38. Conclusão

  39. Conclusão • Terminamos de ver os métodos de aprendizado de máquina puramente estatísticos. • K-NN, Mínimos Quadrados, PCA, LDA, k-Means • A partir da próxima aula veremos métodos nãomaisestatísticos, mas probabilísticos.

  40. Links • Exemplosextraidos de: • www.cs.manchester.ac.uk/ugt/FCOMP24111/materials/slides/K-means.ppt

More Related