ITEC4310 Applied Artificial Intelligence

ITEC4310Applied Artificial Intelligence Lecture 2 Clustering

Machine Learning • Unsupervised Learning • Supervised Learning • Everything is data • Everything is optimization • UPEI and U Havana slides • Slideshare - Tilani Gunawardena

Supervised vs. Unsupervised x2 x2 x1 x1 Data has labels Just the data

Supervised vs. Unsupervised • We only have a set of data… without any further information • Goal: to discover “interesting structures” in the data • Data has labels • We have training examples that allow to train an algorithm • Goal: to correctly predict the class/value of a sample

Unsupervised Learning • Unsupervised learning is arguably more typical of human and animal learning. • It is also more widely applicable than supervised learning, since it does not require a human expert to manually label the data. • Labeled data is not only expensive to acquire, but it also contains relatively little information, certainly not enough to reliably estimate the parameters of complex models.

“When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information — even one bit per second — that way. The brain’s visual system has 1014neural connections. And you only live for 109seconds. So it’s no use learning one bit per second. You need more like 105 bits per second. And there’s only one place you can get that much information: from the input itself.” — Geoffrey Hinton

What is a good clustering? • A good clustering will yield clusters with • High intra-cluster similarities • Low inter-cluster similarities • The quality of the result will depend on the clustering method and the similarity measures used. • There are different ways to measure the quality of the clustering. The goal is to discover the “hidden patterns”…

Aspects of clustering • Might be more than one correct answer. • You may have one or more (useful) similarities measures, e.g. Euclidean distance, Manhattan distance, Mahalanobis distance, Pearson correlation… • Clustering is not always made with real-valued vectors. • Almost never in a two-dimensional space.

K-Means An iterative clustering algorithm • Pick K random points as cluster centers (means) • Alternate: • Assign data instances to closes mean • Assign each mean to the average of its assigned points • Stop when no points’ assignments change

K-Means as Optimization • Consider the total distance to the means: • Each iteration reduces phi • Two stages each iteration: • Update assignments: fix means c, change assignments a • Update means: fix assignments a, change means c means points assignments

Phase I: Update Assignments • For each point, re-assign to closest mean: • Can only decrease total distance phi

Phase II: Update Means • Move each mean to the average of its assigned points: • Can only decrease total distance …

K-Means Getting Stuck Local Optimum: Why doesn’t this work out like the earlier example? K-means require initial means… It does matter what you pick!

Hierarchical Clustering • Build a tree-based hierarchical taxonomy (dendrogram)

Agglomerative Clustering • Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

Dendogram: Hierarchical Clustering • Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster

Summary • For a dataset consisting of n points • O(n2) space -- requires storing the distance matrix • O(n3) time complexity • Advantages • Dendograms are great for visualization • Provides hierarchical relations between clusters

Summary • Disadvantages • Not easy to define levels for clusters • Can never undo what was done previously • Sensitive to cluster distance measures and noise/outliers • Experiments showed that other clustering techniques outperform hierarchical clustering

Non-convex Clusters Need another technique…

Samples • Animations 1 • https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 • Animations 2 • https://www.youtube.com/watch?v=BVFG7fd1H30

Warning • Garbage in  Garbage out • Clustering creates confirmation bias • You will find what you thought you were looking for!

ITEC4310 Applied Artificial Intelligence