230 likes | 248 Views
Learn about the basics of Association Rule Mining (ARM) and how to find patterns in attribute values between instances. Understand the concepts of support, confidence, lift, and complexity in ARM.
E N D
COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan • (mskhan@liv.ac.uk) • Dept. of Computer Science • University of Liverpool • 2009 Association Rule Mining March 5, 2009 Slide 1
COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Association Rule Mining March 5, 2009 Slide 2
Today's Topics COMP527: Data Mining Introduction to Association Rule Mining (ARM) General Issues • Support • Confidence • Lift • Conviction • Complexity! Frequent Itemsets Association Rule Mining March 5, 2009 Slide 3
Introduction COMP527: Data Mining We've spent a long time looking at various classification methods, but there's more to data mining than classification. Given a data set with no classes, just attributes, what might we want to do with it? Association Rule Mining: Find patterns in the attribute values between instances. Instead of predicting an unknown value, we want to find interesting facts about the relationships between the known values. Association Rule Mining March 5, 2009 Slide 4
Introduction COMP527: Data Mining In ARM, these patterns take the form of rules about the co-occurrence of attributes. The easiest example to use is market basket analysis -- finding patterns of things that are bought together in a supermarket. Shopping at a supermarket, you typically buy many things together (as opposed to shopping for a television, say). Perhaps 30 different items. Under 10 items is pretty rare. By comparing your shopping habits over time, the supermarket can learn about you and how best to make you spend more money, increasing their profits. They can also compare all shoppers' habits to find general rules, hopefully for how to increase profits. Association Rule Mining March 5, 2009 Slide 5
Introduction COMP527: Data Mining • Basket1: bread, butter, jam • Basket2: bread, butter • Basket3: bread, butter, milk • Basket4: beer, bread • Basket5: beer, milk What can we find from this? Some simple statistics: bread occurs 80% of the time. butter appears 60% of the time. Less simple: 100% of baskets containing butter also contain bread. 100% of baskets containing butter and jam also contain bread. Association Rule Mining March 5, 2009 Slide 6
Finding Rules COMP527: Data Mining • Basket1: bread, butter, jam • Basket2: bread, butter • Basket3: bread, butter, milk • Basket4: beer, bread • Basket5: beer, milk if (butter jam) then bread if butter then bread if bread then butter To find rules we find sets of items which occur together. The more frequently they occur, the better our rule is. There are some particular factors involved in determining the 'goodness' of a rule... Association Rule Mining March 5, 2009 Slide 7
Support COMP527: Data Mining • Basket1: bread, butter, jam • Basket2: bread, butter • Basket3: bread, butter, milk • Basket4: beer, bread • Basket5: beer, milk Support: Percentage of baskets in which the item(s) occur. bread: 80%, butter 60%, (bread butter) 60% ... So the support for a rule X => Y, is the percentage of instances which contain both X and Y. Association Rule Mining March 5, 2009 Slide 8
Confidence COMP527: Data Mining • Basket1: bread, butter, jam • Basket2: bread, butter • Basket3: bread, butter, milk • Basket4: beer, bread • Basket5: beer, milk We also need a confidence for each rule -- how strongly we believe that rule to be true. Here, butter => bread is true 100% of the time, but bread => butter is only true for 3/4 baskets that contain bread so true 75% of the time. Confidence for X => Y is number of instances that contain X and Y divided by the number of instances that contain X. Association Rule Mining March 5, 2009 Slide 9
Rule Mining COMP527: Data Mining • Basket1: bread, butter, jam • Basket2: bread, butter • Basket3: bread, butter, milk • Basket4: beer, bread • Basket5: beer, milk ARM algorithms have a minimum threshold for both support and confidence and discard any rules below those thresholds. For example jam => (butter bread) has 100% confidence, but only 20% support, because jam butter and bread only occur once. On the other hand butter => bread has 60% support and 100% confidence, a much more interesting rule to us. Association Rule Mining March 5, 2009 Slide 10
Lift COMP527: Data Mining Confidence and Support are necessary but not sufficient to find interesting rules. Suppose that X => Y has a confidence of 60%. (X+Y)/X = 0.6 Sure, that looks interesting... there's a correlation between buying X and buying Y. But what if the probability of Y was 70% overall? Then if you buy X, you're less likely than normal to buy Y... certainly not what the rule is implying! Association Rule Mining March 5, 2009 Slide 11
Lift COMP527: Data Mining Lift is measured in terms of support: s(X+Y) / s(X) * s(Y) This would then take into account the likelihood of Y. This penalises 'obvious' rules where both X and Y are common. For example bread => milk ... if 90% of baskets contain bread and 85% of baskets contain milk, then the worst that bread=>milk could be is 75%. (10% of baskets don't contain bread but do contain milk, 15% don't contain milk but do contain bread, therefore at least 75% must contain both. The maximum is 85%, where all baskets with milk have bread, 5% have just bread and 10% have neither) Association Rule Mining March 5, 2009 Slide 12
Lift COMP527: Data Mining Lift: s(X+Y) / s(X) * s(Y) if the support for X is 0.25, Y is 0.7, and X+Y is 0.15 then we have: • 0.15 / (0.25 * 0.7) = 0.857 Because this is less than 1, there is a negative correlation. 0.75 / (0.85 * 0.90) = 0.98 --> Negative lift 0.85 / (0.85 * 0.90) = 1.111 --> Positive lift Break even point is 0.765 Association Rule Mining March 5, 2009 Slide 13
Conviction COMP527: Data Mining We can express this in just terms of baskets that contain A but not B. “if A then B” implies “not (A and not B)” So the formula for conviction is: s(A) s(not B) / s(A and not B) If A and B always co-occur, the denominator will be 0. Splat. (treat as infinite) Association Rule Mining March 5, 2009 Slide 14
Other Evaluation Metrics COMP527: Data Mining Association Rule Mining March 5, 2009 Slide 15
Back to Rule Mining COMP527: Data Mining The most common approach to finding rules is: 1. Find sets of 2 or more attributes that occur together in more instances than a minimum support threshold. 2. Generate rules from those sets. The most important thing to note is that any subset of a frequent item set is also frequent. If (bread, milk, butter, beer) is frequent, then (bread, butter, beer) is also frequent because it must occur as least as often as the full set. Association Rule Mining March 5, 2009 Slide 16
Naïve Approach COMP527: Data Mining No problem. Algorithm is obvious: Count all possible itemsets that appear in all transactions. If our transactions are: BC, BD, AC, BCD, ABD, ABCD We count: AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD Uhh... And when you have the number of different items as a supermarket?? Say 100,000 different products? Ignoring empty set and the single item sets, that's 2100000-100000 -1... You want to know how many that is? Association Rule Mining March 5, 2009 Slide 17
Naïve Approach: BAD!!! COMP527: Data Mining 9990020930143845079440327643300335909804291390541816917715292738631458324642573483274873313324496504031643944455558549300187996607656176562908471354247492875198889629873671093246350427373112479265800278531241088737085605287228390164568691026850675923517914697052857644696801524832345475543250292786520806957770971741102232042976351205330777996897925116619870771785775955521720081320295204617949229259295623920965797873558158667525495797313144806249260261837941305080582686031535134178739622834990886357758062104606636372130587795322344972010808486369541401835851359858035603574021872908155566580607186461268972839794621842267579349638893357247588761959137656762411125020708704870465179396398710109200363934745618090601613377898560296863598558024761448933047052222860131377095958357319485898496404572383875170702242332633436894423297381877733153286944217936125301907868903603663283161502726139934152804071171914923903341874935394455896301292197256417717233543544751552379310892268182402452755752094704642185943862865632744231332084742221551493315002717750064228826211822549349600557457334964678483269180951895955769174509673224417740432840455882109137905375646772139976621785265057169854834562487518322383250318645505472114369934167981678170255122812978065194806295405339154657479941297499190348507544336414505631657396006693382427316434039580121280260984212247514207834712224831410304068603719640161855741656439472253464945249700314509890093162268952744428705476425472253167514521182231455388374308232642200633025137533129365164341725206256155311794738619142904761445654927128418175183531327052975495370561438239573227939673030106077456848477427832195349227983836436163764742969545906672369124136325932123335643135894465219101882123829740907916386023235450959388766736403229577993901152154448003637215069115591111996001530589107729421032230424262035693493216052927569625858445822354594645276923108197305806280326516736449343761732409753342333289730282959173569273013286423311759605230495171677033163709522256952460402143387655197644016528148022348331881097559421960476479388520198541017348985948511005469246617234143135309938405923268953586538886974427008607028635502085562029549352480050796521564919683265106744100967822951954161617717542997520009887307377876210685890770969411610438028623950445323789591870760289260393489826100774887672852918106468489143893649064784591211612193300707900537059042188012856559403699070888032966871611655961232331998310923225082866180321880439447572986762096935819784385927969250123326935194693207724335527365566248223787833888074999276831633440318604463618703789784313032843823470410944306591471928341190975185239212327674384990561563688432939039442002617530976850605132937101449086396141620556053547335569926700941375271829142407234267937565069765567475934101310225342830080409079587329544213551307302050171598424230760469209732907290141606353960880559202357376885647852240092777111489134492416995607171786298436533978180869474106751111353523711540436599310889697485658800887861974934357929246204051767246012250618404011966289872673803070498361217974484679100747846356194664829224736134115135567179291781968056053726484141128347858241259121954601184412409349782963317042002530418661694962318735860652485410222211869544223788289189712080514575141361964805369723164570564998479537657174548128597406077339158775332355215609435919275199351014222246963017013717419337504919295363295101115292951836282819191821651676455946515828048984256116748150367805267878662716999649296949377045794876146628110929982020737013330324451005385378551188803474148198665114579322684900993000236736168555294173442059925371965244997925483159343706343970371809611470323074186985035054722289027174850333368328300281132910841693150457389933183934593292994942796015309756118708918929528449074243284767006243171171622731766606796101967802204564589015899524704741001158110963633731329388356868949408759334176909387806398584647300588928175998844477486130063153068760070084837267527789777356830042778902772105683833021470279728595336332110564064263909724579949686162908019604141753935768876587992428549912151737924270343248648414247456838889541893241450987505759403013249697541696955330296880219304874163501097920036210238768275176369980977614979636096704348140124130683576879904997436596296495705459524735382000363770324894982103331332913562315169854410415317054193928234723398848453552173203688088312100943941434938282203549650281530751087098604681224802973825631244989331965296202372608586509050307993308652001231671915182765742095689513136184095412121473786311042897717861448158316965848766949554826252504961227044714712229620274682362909803877469376987358942125441792355298387479830450253909788733469732603097544156474805473732732767248652759034995336354126953900458854988683574927864615252040800490114785892289085443353996994780867471613519785838571456421583171193004117989440790268346357550339888086725127883577297626499213827436573992927302238792576924232785487201297255386071968303782483063725899808484638503828356258403917311872694381464553651690062530023217591343084755215901475299149215296944362366910833233693767993138209275870024246238331218236715236772098417187703860172308522448043176333602759733161201262248323085329288986154559221427378507410978822244729512663572225567169779409767341543017289268332635077451210167869121334465680739797372711461919299938118178827541421792926883790285430909942441260511945849237909966329550263865701114884142266162969810073652710928504579470861508094054577797864301504899958634164700528220562786008864025709432444254044034243140203812074857537999016066465520986980790589347320243050635907363821521280600041827529325485247927904235727598574209554632363830932428250711518801775633739811523761994686263270550635099851254333875594601540900862014293625673738331693082328854327001487476635118830885173775268819526360165345900556160767713453617655450974424979076063906093300028416964847594027046669468486593636425428625241644836652173922586528474244952363302305311413449332339822336551611431469131900170488226836525916399723912626616140205707996727383529597479125488961419287261259757561701592645823541151922177253919651034344793680369057003813056557866311011476313189571556336518727757991908862890765494952019474922148851417079252352394293801701149485239005844358329748769279941586384640877265901749104933238853465429979253900561311562288241147192158137210120267399648622831610430287268739840335142120299516610846193164688075944526965248570070554452152547493450434852917987512185973647190461515413582582139040172118295702327537027389787793506904044938553587650503557155872873201596885061331145477101575699375441097493374115991199114962726801718038950907803041184400075585468560976965669584325627283327416418044590727844680051360774154288412712456353383625469068936430902068216750459819321744513362913853983154560610459692604508787700304184579153478291725762810632722108035826060904572460619204237580363147200158749075361633785243462298769917887808671453928846572417223504887766803869453474588831907597355292800709241471370696647029530700507083091412492771404776193459007315206233634226128137074504162520473449597415678882003845446774388950379192344594171245510231738995030348421937088083329709108176561010708693158020695060096428352046647333361163476664106311247065173802510599409266908984046663298613648854871230659903565772327667696057187057276814394932559371368029375974604116075641599919402266794230681485723361363592903676841480358328093127506801111571615062761556607158236612268544268330274725849294875852089790850962835235527978491475563744318483993474633300330972497012808415900969455190375849945750379465019166009861502794606130794726898507849610303884846035423392175449505876157130344700415823080225786693300512126831846009510203543174323783292176865976076275412421892808138872880175813109296020150746331979561488146333412674896256883784351178477592660577212734269328382384711746083782209939646612308343952169576581065423771981899573840430315930973215059901371218399762585055435459516340055149080565627330475362528926945020226163130902420795006258931367813005222140742964756194053782182452833097021554210929638693005460011927178302761563505715735405672652524175925436371863471836292012162456662093642074605500842449347289830619506077570528754845277680661218358066130291463288932240701043887535007851997159198390084654596996197138399723495863749806582439384615049184048485819193560667125968018574877819561104335238420873417743385185735663101292757409280586840011804854994149478736882949368786637202682607198707656286436753775709560349718397405565505269425218354301348910785234517795519757516484711545928466003754558485470994737493796615841040414239875763335201795518644856632201598556341934286668912522153446348791218159622744525372314219184738770596659942181275403613660438538829201810204850917717791485256026242529802492309229562177062770027659288158473994804255067730903420043491632913588644627415318468517462580180901314477358637486528221274450661883667873545037139535563260349778209992416559111602097437491432360787879331015052417047437823553506205617017572175387061751192919715660363028302343819584946594328460482931960515124867123604625653903565173322856758210937541222674223847046664733620292824834065137814475367747671882220098389682019784216724015491253360436437847479770633657905418133523010804559958547379685864708937791659340223795537045273849435441105983887969741143051069401271065628507537039823308867819868298171415185218271493613110963984021912448323423901392553811725954153209435002954807640291982765741514042956666953177304003358701503703497424897898108939453026976878231557938158928996868766367603579055322794822757659104812835219745724022347569914650240636730492833286151875049129873457930874999488048681250802904606446223569562767964898914869924201946458521355165709887118378290437174375625282606140534611987395334677500936625746765638459629521872262777473480491233965194281353725068660782076683862565487279038020486778099991754380815789820825255566234983933217491493864966284116889874665005414748264599972752003370084542592544301190399041231752771993767799847551279448012913842034323154888137932524887172099381195722163148101670274877379161830968937348720168944903299658932511996504109653674618914861599481632040891930577238630396311858213341337110096389113836596895914715370925073998461682046426447290788976525593505136546978364603183820619560578517561504972661817649030304982138534738696212234626114043035600967042547012317360449724623287452575151198771801585742829389025650825988275495110865424704218337264023078045681651420517807418196096401513461760794362769612228126118610912766814880500950963889032877710837651051900076128058473969258768737937306664751387942217354694021157675557568970168734104342446525522568974329716152742558110503495045718931752447070410307760830365536714180388723602948872805590752711115590794756926903978519601939790311768070356801944936106850640568519290645048685535628256787225734544146565541187816717729850612874044620890718502108518025052924590359814117522720320552642597751984410742492179242039080014606225999422109717176118746845802673724801365603866909971071347255859723217027554055085082090418987534829222004178998475030519537179062001509333023023881806519182405550818672164711702307529922652228033820404113386625335815042934115143980939986416365633923620673874259342713444701242702722227197573203194489407856355511639619115985907995399083680129468810771595938084908111251938016414866250141095286680914828503123938960997659175977315432797173945762560365023587931559926170852315074247849814256564 Association Rule Mining March 5, 2009 Slide 18
Frequent Itemsets COMP527: Data Mining Let's not try to work out the support for all possible combinations. Subsets of frequent itemsets are frequent. All subsets of a set that meets the minimum support will also necessarily meet the minimum support. So if we know a subset is small, any superset must also be small. So, instead of trying all combinations, we'll generate itemsets for a particular size and scan the database to see if any of them meet the support threshold. We know that any subsets of frequent sets are also frequent and supersets of infrequent are also infrequent, so don't need to check them. Association Rule Mining March 5, 2009 Slide 19
Itemset Lattice Pruned supersets COMP527: Data Mining Infrequent (Lattice borrowed from CSE980 @ MSU) Association Rule Mining March 5, 2009 Slide 20
A Priori COMP527: Data Mining The algorithm that does this is called A Priori and most other ARM techniques are based on it. Will look at it in more detail next week. Association Rule Mining March 5, 2009 Slide 21
Issues with WEKA and ARM COMP527: Data Mining ARFF is a horrible horrible format for ARM. Most datasets are very sparse with the attributes being present or not present. Bread 0/1, Milk 0/1, etc. We want to record this as {bread, milk ,cheese} not a huge table of 1s and 0s Weka doesn't include many ARM algorithms... In fact it has three, thankfully one is A Priori. The book doesn't include much information, but Dunham has good coverage. We'll also look at some other ARM applications built by Frans Coenen and Paul Leng here at Liverpool. Association Rule Mining March 5, 2009 Slide 22
Further Reading COMP527: Data Mining • Witten 4.5 • Dunham 6.1, 6.2 • Han 5.1 • Berry and Browne 14.1-14.3 • Berry and Linoff Chapter 9 • Zhang, Association Rule Mining, Chapter 1, 2.1, 2.2 • Pal and Mitra, 8.3 Association Rule Mining March 5, 2009 Slide 23