Discovering groups {week 03}

The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. Discovering groups{week 03} from Programming Collective Intelligence by Toby Segaran, O’Reilly Media, 2007, ISBN 978-0-596-52932-1

Data clustering (i) • A cluster is a group of related things • Automatic detection of clustersis a powerful data discovery tool • Detect similar user interests,buying patterns, clickthroughpatterns, etc. • Also applicable to the sciences • In computational biology, find groups(or clusters) of genes that exhibit similar behavior

Data clustering (ii) • Data clustering is an example ofan unsupervised learning algorithm... • ...which is an AI technique for discovering structure within one or more datasets • The key goal is to find the distinct group(s) that exist within a given dataset • We don’t know what we’ll find

Data clustering (iii) We need to first identify a common setof numerical attributes that we can compare to see how similar they are. Can we do anything with word frequencies?

Clustering blogs via feeds (i) • If we cluster blogs based on theirword frequencies, maybe we canidentify groups of blogs that are... • ...similar in terms of blog content • ...similar in terms of writing style • ...of interest for searching, cataloging, etc.

Clustering blogs via feeds (ii) • A feed is a simple XML document containing information about a blog and its entries • Reader apps enable usersto read multiple blogs ina single window • Click below to check outthe Google Reader blog:

Clustering blogs via feeds (iii) • Check out these feeds: • http://blogs.abcnews.com/theblotter/index.rdf • http://www.wired.com/rss/index.xml • http://www.tmz.com/rss.xml • http://scienceblogs.com/sample/combined.xml • http://www.neilgaiman.com/journal/feed/rss.xml

Clustering blogs via feeds (iv) • Techniques for avoiding stop words: • Ignore words on a predefined stop list • Select words from within a predefined rangeof occurrence percentages • Lower bound of 10% • Upper bound of 50% • Tune as necessary

What next? • Study the resulting blog data • Identify any patterns in the data • Which blogs are very similar? • Which blogs are very different? • How can these techniques beapplied to other types of search? • Web search? • Enterprise search?

Discovering groups {week 03}

Discovering groups {week 03}

Presentation Transcript

Discovering Leaders from Community Actions

Chapter 2 Discovering the Universe for Yourself

Chapter 9

Lesson 9

Welcome to “Discovering Our Heritage”

Chapter 10 Groups and Intergroup Processes

Step Up To: Discovering Psychology by John J. Schulte, Psy.D.

Discovering the Universe for Yourself

Interest Groups

Discovering the Americas

Step Up To: Discovering Psychology by John J. Schulte, Psy.D.

Discovering Computers

Discovering Computers 2010

Discovering Fibonacci

Step Up To: Discovering Psychology by John J. Schulte, Psy.D .

Discovering the Universe Ninth Edition

Discovering the structure of DNA

Discovering Hidden Groups in Communication Networks

Re-Discovering Our Mission as Young Adults

IS 313 Today

Discovering the Universe Ninth Edition

Discovering My Shape For Ministry