240 likes | 364 Views
Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. Discovering groups {week 11}. from Programming Collective Intelligence by Toby Segaran, O’Reilly Media, 2007, ISBN 978-0-596-52932-1 . Text transformation. User interaction and querying.
E N D
Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. Discovering groups{week 11} from Programming Collective Intelligence by Toby Segaran, O’Reilly Media, 2007, ISBN 978-0-596-52932-1
Data clustering (i) • A cluster is a group of related things • Automatic detection of clustersis a powerful data discovery tool • Detect similar user interests,buying patterns, clickthroughpatterns, etc. • Also applicable to the sciences • In computational biology, find groups(or clusters) of genes that exhibit similar behavior
Data clustering (ii) • Data clustering is an example ofan unsupervised learning algorithm... • ...which is an AI technique for discovering structure within one or more datasets • The key goal is to find the distinct group(s) that exist within a given dataset • We don’t know what we’ll find
Data clustering (iii) We need to first identify a common setof numerical attributes that we can compare to see how similar they are. Can we do anything with word frequencies?
Clustering blogs via feeds (i) • If we cluster blogs based on theirword frequencies, maybe we canidentify groups of blogs that are... • ...similar in terms of blog content • ...similar in terms of writing style • ...of interest for searching, cataloging, etc.
Clustering blogs via feeds (ii) • A feed is a simple XML document containing information about a blog and its entries • Reader apps enable usersto read multiple blogs ina single window • Being structured data,feeds are generally moresearch-friendly
Clustering blogs via feeds (iii) • Check out these feeds: • http://blogs.abcnews.com/theblotter/index.rdf • http://www.wired.com/rss/index.xml • http://www.tmz.com/rss.xml • http://scienceblogs.com/sample/combined.xml • http://www.neilgaiman.com/journal/feed/rss.xml
Clustering blogs via feeds (iv) • Techniques for avoiding stop words: • Ignore words on a predefined stop list • Select words from within a predefined rangeof occurrence percentages • Lower bound of 10% • Upper bound of 50% • Tune as necessary
What next? • Study the resulting blog data • Identify any patterns in the data • Which blogs are very similar? • Which blogs are very different? • How can these techniques beapplied to other types of search? • Web search? • Enterprise search?
Hierarchical clustering (i) • Hierarchical clustering is an algorithmthat groups similar items together • At each iteration, the two most similaritems (or groups) are merged • For example, given five items A-E: A D B E C
Hierarchical clustering (ii) • Calculate the distances between all items • Group the two items that are closest: • Repeat! AB A D B E C
Hierarchical clustering (iii) • How do we compare group AB to other items? • Use the midpoint of items A and B ABC AB A DE D x B E C
Hierarchical clustering (iv) • When do we stop? • When we have a top-level group that includes all items ABCDE ABC AB A DE D x B E C
Hierarchical clustering (v) • The hierarchical part is based on the discovery order of clusters • This diagram is called a dendrogram... A AB ABC B ABCDE C D DE E
Hierarchical clustering (vi) • A dendrogram is a graph (or tree) • Distances between nodes of the dendrogram show how similar items (or groups) are • AB is closer (to A and B) than DEis (to D and E), so A and B aremore similarthan D and E • How can wedefine closeness? A AB ABC B ABCDE C D DE E
Similarity scores • A similarity score compares two distinct elements from a given set • To measure closeness, we need to calculate a similarity score for each pair of items in the set • Options include: • The Euclidean distance score, which is based onthe distance formula in two-dimensional geometry • The Pearson correlation score, which is basedon fitting data points to a line
Euclidean distance score • To find the Euclidean distance betweentwo data points, use the distance formula: distance = √ (y2 – y1)2 + (x2 – x1)2 • The larger the distance between two items,the less similar they are • So use the reciprocal of distance as a measure of similarity (but be careful of division by zero)
Pearson correlation score (i) • The Pearson correlation score is derived by determining the best-fit line for a given set v2 • The best-fit line, on average, comes as close as possible to each item • The Pearson correlation score is a coefficientmeasuring the degree to which items are on the best-fit line x x x x x x x x v1
Pearson correlation score (ii) • The Pearson correlation score tells us how closely items are correlated to one another • 1.0 is a perfect match; ~0.0 is no relationship correlation score: 0.4 correlation score: 0.8 v2 v2 x x x x x x x x x x x x x x x x v1 v1
Pearson correlation score (iii) • The algorithm is: • Calculate sum(v1) and sum(v2) • Calculate the sum of thesquares of v1 and v2 • Call them sum1Sq and sum2Sq • Calculate the sum of the products of v1 and v2 • (v1[0] * v2[0]) + (v1[1] * v2[1]) + ... + (v1[n-1] * v2[n-1]) • Call this pSum v2 x x x x x x x x v1
Pearson correlation score (iv) • Calculate the Pearson score: • Much more complex, but often better thanthe Euclidean distance score sum(v1) * sum(v2) pSum – ( ) n r = sum1Sq – sum(v1)2 sum2Sq – sum(v2)2 * n n √
What next? • Review the blog-data dendrograms • Identify any patterns in the data • Which blogs are very similar? • Which blogs are very different? • How can these techniques beapplied to other types of search? • Web search? • Enterprise search?