100 likes | 222 Views
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. Discovering groups {week 03}. from Programming Collective Intelligence by Toby Segaran , O’Reilly Media, 2007, ISBN 978-0-596-52932-1. Data clustering ( i ).
E N D
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. Discovering groups{week 03} from Programming Collective Intelligence by Toby Segaran, O’Reilly Media, 2007, ISBN 978-0-596-52932-1
Data clustering (i) • A cluster is a group of related things • Automatic detection of clustersis a powerful data discovery tool • Detect similar user interests,buying patterns, clickthroughpatterns, etc. • Also applicable to the sciences • In computational biology, find groups(or clusters) of genes that exhibit similar behavior
Data clustering (ii) • Data clustering is an example ofan unsupervised learning algorithm... • ...which is an AI technique for discovering structure within one or more datasets • The key goal is to find the distinct group(s) that exist within a given dataset • We don’t know what we’ll find
Data clustering (iii) We need to first identify a common setof numerical attributes that we can compare to see how similar they are. Can we do anything with word frequencies?
Clustering blogs via feeds (i) • If we cluster blogs based on theirword frequencies, maybe we canidentify groups of blogs that are... • ...similar in terms of blog content • ...similar in terms of writing style • ...of interest for searching, cataloging, etc.
Clustering blogs via feeds (ii) • A feed is a simple XML document containing information about a blog and its entries • Reader apps enable usersto read multiple blogs ina single window • Click below to check outthe Google Reader blog:
Clustering blogs via feeds (iii) • Check out these feeds: • http://blogs.abcnews.com/theblotter/index.rdf • http://www.wired.com/rss/index.xml • http://www.tmz.com/rss.xml • http://scienceblogs.com/sample/combined.xml • http://www.neilgaiman.com/journal/feed/rss.xml
Clustering blogs via feeds (iv) • Techniques for avoiding stop words: • Ignore words on a predefined stop list • Select words from within a predefined rangeof occurrence percentages • Lower bound of 10% • Upper bound of 50% • Tune as necessary
What next? • Study the resulting blog data • Identify any patterns in the data • Which blogs are very similar? • Which blogs are very different? • How can these techniques beapplied to other types of search? • Web search? • Enterprise search?