70 likes | 332 Views
CS 548 – Project 3. Association Rules. Correlation coefficient. Symmetric measure of correlation Compute contingency table with support counts: Use formula : Weka code in AprioriItemSet.java :. public double correlationForRule ( AprioriItemSet premise, AprioriItemSet consequence,
E N D
CS 548 – Project 3 Association Rules Skyler Whorton – March 29, 2012
Correlation coefficient • Symmetric measure of correlation • Compute contingency tablewith support counts: • Use formula: • Weka code inAprioriItemSet.java: public double correlationForRule(AprioriItemSet premise, AprioriItemSetconsequence, intpremiseCount, intconsequenceCount) { // Compute contingency table entries double N = (double)m_totalTransactions; double f11 = (double)m_counter; double f1x = (double)premiseCount; double fx1 = (double)consequenceCount;double f0x = N - f1x; double fx0 = N - fx1;double f10 = f1x - f11; double f01 = fx1 - f11; // Support count of “not A and not B” double f00 = fx0 - f10; // Calculate ratio numerator and denominator double num = f11 * f00 - f01 * f10; double denom = Math.sqrt(f1x * fx1 * f0x * fx0); // Return ratio return num/denom; } Skyler Whorton – March 29, 2012
College data • Pre-processing: • Equal-frequencydiscretization into3 bins, “Lo,” “Med,” “Hi” • Binarize intoitem-type attributes • Remove id, name,state • Objectives: • Which groups of features are highly associated? • Which are associated with high tuition costs? • What are some different trends between public vs. private schools? Skyler Whorton – March 29, 2012
College data • inStateTuitionHi, stuFacRatioLo → priv • numFtUndergradHi, inStateTuitionLo, pctAlumniGiveLo → pub CAR Rules Skyler Whorton – March 29, 2012
ASSistments data • Dataset of 241 teachers, 1,500 problem sets, 1M logs • Can I make Netflix/Amazon-style recommendations based on these problem set data? (No.) • Logs too sparse—only ~4,000 items total • Average of 1% transaction width per teacher • Highest-supported rule: 31 instances of premise Skyler Whorton – March 29, 2012
ASSISTments data Skyler Whorton – March 29, 2012
ASSistments data • Findings • Problem set associations • “Evaluating Expressions” -> “Equation Solving (1)”, etc. • Mined associations are highly confident • Not enough data to make many recommendations • Wide, sparse data set • Use leverage and lift to your advantage • Few highly-supported itemsets • Teachers assigning similar content • Similar account creation date, and/or • Similar school e-mail domains Skyler Whorton – March 29, 2012