Discovering groups {week 11}

Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. Discovering groups{week 11} from Programming Collective Intelligence by Toby Segaran, O’Reilly Media, 2007, ISBN 978-0-596-52932-1

Text transformation

User interaction and querying

Data clustering (i) • A cluster is a group of related things • Automatic detection of clustersis a powerful data discovery tool • Detect similar user interests,buying patterns, clickthroughpatterns, etc. • Also applicable to the sciences • In computational biology, find groups(or clusters) of genes that exhibit similar behavior

Data clustering (ii) • Data clustering is an example ofan unsupervised learning algorithm... • ...which is an AI technique for discovering structure within one or more datasets • The key goal is to find the distinct group(s) that exist within a given dataset • We don’t know what we’ll find

Data clustering (iii) We need to first identify a common setof numerical attributes that we can compare to see how similar they are. Can we do anything with word frequencies?

Clustering blogs via feeds (i) • If we cluster blogs based on theirword frequencies, maybe we canidentify groups of blogs that are... • ...similar in terms of blog content • ...similar in terms of writing style • ...of interest for searching, cataloging, etc.

Clustering blogs via feeds (ii) • A feed is a simple XML document containing information about a blog and its entries • Reader apps enable usersto read multiple blogs ina single window • Being structured data,feeds are generally moresearch-friendly

Clustering blogs via feeds (iii) • Check out these feeds: • http://blogs.abcnews.com/theblotter/index.rdf • http://www.wired.com/rss/index.xml • http://www.tmz.com/rss.xml • http://scienceblogs.com/sample/combined.xml • http://www.neilgaiman.com/journal/feed/rss.xml

Clustering blogs via feeds (iv) • Techniques for avoiding stop words: • Ignore words on a predefined stop list • Select words from within a predefined rangeof occurrence percentages • Lower bound of 10% • Upper bound of 50% • Tune as necessary

What next? • Study the resulting blog data • Identify any patterns in the data • Which blogs are very similar? • Which blogs are very different? • How can these techniques beapplied to other types of search? • Web search? • Enterprise search?

Hierarchical clustering (i) • Hierarchical clustering is an algorithmthat groups similar items together • At each iteration, the two most similaritems (or groups) are merged • For example, given five items A-E: A D B E C

Hierarchical clustering (ii) • Calculate the distances between all items • Group the two items that are closest: • Repeat! AB A D B E C

Hierarchical clustering (iii) • How do we compare group AB to other items? • Use the midpoint of items A and B ABC AB A DE D x B E C

Hierarchical clustering (iv) • When do we stop? • When we have a top-level group that includes all items ABCDE ABC AB A DE D x B E C

Hierarchical clustering (v) • The hierarchical part is based on the discovery order of clusters • This diagram is called a dendrogram... A AB ABC B ABCDE C D DE E

Hierarchical clustering (vi) • A dendrogram is a graph (or tree) • Distances between nodes of the dendrogram show how similar items (or groups) are • AB is closer (to A and B) than DEis (to D and E), so A and B aremore similarthan D and E • How can wedefine closeness? A AB ABC B ABCDE C D DE E

Similarity scores • A similarity score compares two distinct elements from a given set • To measure closeness, we need to calculate a similarity score for each pair of items in the set • Options include: • The Euclidean distance score, which is based onthe distance formula in two-dimensional geometry • The Pearson correlation score, which is basedon fitting data points to a line

Euclidean distance score • To find the Euclidean distance betweentwo data points, use the distance formula: distance = √ (y2 – y1)2 + (x2 – x1)2 • The larger the distance between two items,the less similar they are • So use the reciprocal of distance as a measure of similarity (but be careful of division by zero)

Pearson correlation score (i) • The Pearson correlation score is derived by determining the best-fit line for a given set v2 • The best-fit line, on average, comes as close as possible to each item • The Pearson correlation score is a coefficientmeasuring the degree to which items are on the best-fit line x x x x x x x x v1

Pearson correlation score (ii) • The Pearson correlation score tells us how closely items are correlated to one another • 1.0 is a perfect match; ~0.0 is no relationship correlation score: 0.4 correlation score: 0.8 v2 v2 x x x x x x x x x x x x x x x x v1 v1

Pearson correlation score (iii) • The algorithm is: • Calculate sum(v1) and sum(v2) • Calculate the sum of thesquares of v1 and v2 • Call them sum1Sq and sum2Sq • Calculate the sum of the products of v1 and v2 • (v1[0] * v2[0]) + (v1[1] * v2[1]) + ... + (v1[n-1] * v2[n-1]) • Call this pSum v2 x x x x x x x x v1

Pearson correlation score (iv) • Calculate the Pearson score: • Much more complex, but often better thanthe Euclidean distance score sum(v1) * sum(v2) pSum – ( ) n r = sum1Sq – sum(v1)2 sum2Sq – sum(v2)2 * n n √

What next? • Review the blog-data dendrograms • Identify any patterns in the data • Which blogs are very similar? • Which blogs are very different? • How can these techniques beapplied to other types of search? • Web search? • Enterprise search?

Discovering groups {week 11}

Discovering groups {week 11}

Presentation Transcript

Discovering the Unseen World

Teaching the Discovering the Real Me Series

Step Up To: Discovering Psychology by John J. Schulte, Psy.D.

Discovering Treewidth

Discovering Devonian Microfossils

Discovering Computers 2012

3 Week Diet System Pdf 3-week Diet Plan For Men

GLOBAL BUSINESS WEEK 9 Making Predictions

Discovering Nursing

Discovering the Universe Eighth Edition

Week 1

Earth Science 12.1 Discovering Earth’s History: Geologic Time

Discovering Personal Genius through Assistive Technology

Interest Groups

Step Up To: Discovering Psychology by John J. Schulte, Psy.D.

Discovering Informative Subgraphs in RDF Graphs

Step Up To: Discovering Psychology by John J. Schulte, Psy.D.

Step Up To: Discovering Psychology by John J. Schulte, Psy.D.

Discovering Flight

Chapter 11