210 likes | 222 Views
A Data Mining Course for Computer Science Primary Sources and Implementations. Dave Musicant Saturday, March 4, 2006. Overview. What is data mining? Why offer a course in data mining? Why focus on research papers in an undergraduate class? What topics do I cover?
E N D
A Data Mining Course for Computer SciencePrimary Sources and Implementations Dave Musicant Saturday, March 4, 2006
Overview • What is data mining? • Why offer a course in data mining? • Why focus on research papers in an undergraduate class? • What topics do I cover? • What research papers do I use in class? • What assignments do I use? • Does it work?
What is data mining? • “The non-trivial discovery of novel, valid, comprehensible and potentially useful patterns from data” (Fayyad et al) • Data Mining and Machine Learning are two sides of the same coin • Data mining focuses more on larger datasets • Machine learning focuses more on connections with artificial intelligence • ... but there is much overlap in the two areas. • My course is titled “Machine Learning and Data Mining” • boosts student enthusiasm
Why offer a course in data mining? • Interesting applied area of CS that uses theoretical techniques • Reinforces and introduces data structures and algorithms • heaps, R-trees, graphs • Privacy and ethics • Personal ownership in assignments • Students choose datasets in areas that interest them • New field, yet accessible • Can be done with only Data Structures as a prereq • It’s my research area
Why research papers? Can it be done? • One approach to course is to use data mining software • Lopez & Ludwig, University of Minnesota-Morris • I wanted students to implement data mining algorithms • Textbook support w/ computer science focus is limited • (I use Margaret Dunham’s text as a side reference) • Primary sources provide a rich experience • With proper selection, papers are accessible to undergraduates • Papers must be supplemented in classroom • e.g. specific topics in linear algebra, statistics • directs classroom activity toward filling gaps and interpreting papers instead of parroting reading
Topics, Papers, Assignments • Each topic consists of one or more papers that are assigned to the students to read before class discussion. • Students post to Caucus (electronic message board): • something they didn’t understand, or something they found interesting • potential exam question • Assignment follows class discussion • Detailed references for all papers and datasets can be found in paper
Topic 0: What is Data Mining? • Paper: J. Friedman. “Data Mining and Statistics: What’s the Connection?” • Entertaining and controversial • Pokes fun at flaws on all sides • Helps to ensure buy-in from computer science students (they haven’t been tricked into taking a stats course) • Assignment: For the “census-income” dataset, determine: • Number of records and features • How many features are continuous, how many are nominal • For continuous features: average, median, minimum, maximum, standard deviation • 2-dimensional scatter plots of two features at a time • Interesting patterns
Topic 1: Classification and Regression • Example: First Trimester Screening Training Set • Use this training set to learn how to classify patients where diagnosis is not known: Testing Set Input Data Classification • The input data is often easily obtained, whereas the classification is not.
Technique: Nearest Neighbor • Envision each example as a point in n-dimensional space • Classify test point same as nearest training point What am I?
Topic 1: Classification and Regression • Focus on scalable nearest neighbor algorithms • Paper: Roussopoulos et. al. “Nearest Neighbor Queries” • How to do NN efficiently when data doesn’t fit in core • Requires R-trees (I cover in class) • Assignment: Code up the traditional k-nearest neighbor algorithm, apply to census-income data • Experiment with different distance metrics (1-norm, 2-norm, cosine) • Experiment with different values of k • Produce plots showing training and test set accuracies • Interpret results
Topic 2: Clustering • Sometimes referred to as unsupervised learning • Goal: find clusters of similar data • Less accurate than supervised learning, but quite useful when no training set is available • Where are the clusters below? How many are there? tissue (cm) tissue (cm) chemical 1 chemical 2
Topic 2: Clustering • Assignment: Find dataset of interest from UCI Repository • iris plant, letter recognition, liver disorders, Pima Indians diabetes, Congressional voting records, wine recognition, zoo • this dataset is used for most remaining assignments • if dataset has a class label, discard it for this assignment • Implement basic clustering algorithm (k-means) • Try varying number of clusters • Try two different techniques for initializing clusters • Report and interpret results found
Topic 2: Clustering • Paper: Bradley et al, “Scaling Clustering Algorithms to Large Databases” • Describes “Scalable K-means” algorithm • Class discussion around “data mining desiderata” • Paper: Guha et al, “CURE: An Efficient Clustering Algorithm for Large Databases” • Agglomerative clustering algorithm • completely different approach • Requires use of a heap (as I pose the assignment) • Assignment: Implement stripped-down version of CURE • Run on dataset, interpret results
Topic 3: Association Rules • “Supermarket basket analysis” • What items do people tend do buy together at the same time? • Paper: Agrawal et al, “Fast Algorithms for Mining Association Rules” • presents classic Apriori algorithm (skim other portions of paper) • Assignment: Implement Apriori algorithm and implement on own dataset
Topic 4: Web Mining • How does Google rank importance of web pages? • Every page has a PageRank • PageRank of a page is determined by the PageRank of the pages that link to it • manifests itself as an eigenvalue problem • Paper: Page et al, “The PageRank Citation Ranking: Bringing Order to the Web” • describes basic version of Google PageRank algorithm • cover eigenvalues in class • exposure to linear algebra, numerical analysis
Topic 4: Web Mining • Paper: Chakrabarti et al, “Mining the Link Structure of the World Wide Web” • describes HITS algorithm for ranking web pages • Google isn’t the only way to do it • uses Latent Semantic Analysis, which requires singular value decomposition (cover in class) • Assignment: Implement PageRank algorithm • try it on archive of department website • crawling for an assignment is dangerous • sparse data representation • hashing or other form of map for efficiency • interpret results hubs authorities
Topic 5: Collaborative Filtering • a.k.a. Recommender Systems • “I like Pink Floyd, Dream Theater, and Evanescence. Who should I be listening to?” • Amazon.com, Yahoo! Launchcast • Paper: Breese et al, “Empirical Analysis of Predictive Algorithms for Collaborative Filtering” • Algorithms are nearest neighbor-like in flavor • Involve averaging numerical scores • Need to normalize for individual biases • Students already working on final project, so no assignment
Topic 6: Ethical Issues in Data Mining • Privacy concerns • Good vs. evil uses of data mining • Video: Ramakrishnan et al, “Data Mining: Good, Bad, or Just a Tool?” • Panel discussion from KDD 2004 • Before watching video, students post to Caucus: • how data mining could be exploited • how this could be prevented (if possible) • After watching video • followup commentary Pictures from conference website at http://www.acm.org/sigs/sigkdd/kdd2004/
Topic 6: Ethical Issues in Data Mining • Students response to video was more engaged than I expected • More problems than solutions are raised in video • Frustrated students that solutions weren’t clear • Many students interested in issue of accountability • If someone’s privacy is violated, who is responsible? • “Who do I sue?” • Lively class discussion
Final Project • “Do almost anything you want regarding data mining, so long as I approve it” • Find a paper and implement the algorithm within • Find a dataset of interest and study it completely, using Weka and/or their own code from throughout the term • Quantitative association rules • Poker association rules • Collaborative filtering (music, art) • Attack KDD Cup problems • KDD Cup 2005: identify categories for web search queries • tried this once: tended to be too big for them in the time that I had • could perhaps be done with right level of support
Conclusions • Papers are most memorable part of course • Students speak very positively about this in evaluations • Significant prep time for me to fill in gaps • Caucus motivates reading papers • Students find this a pain, but are thankful afterwards in evals • Important to set deadline for posting a few hours before class so I have time to read • Programming assignments work (mostly) well • Allow students to work in pairs if they wish • Grading is difficult: unspecified details in algorithms, differing datasets • All materials available on my website at http://www.mathcs.carleton.edu/faculty/dmusican/cs377s05