180 likes | 302 Views
Lee McCluskey, room 3/10 Email lee@hud.ac.uk http://scom.hud.ac.uk/scomtlm/cha2555/. AI Week 15 Machine Learning: Data Mining : Association Rule Mining, Associative Classification, Applications. Last Week. Data Mining -- as inducing rule classifiers from classified training examples.
E N D
Lee McCluskey, room 3/10 Email lee@hud.ac.uk http://scom.hud.ac.uk/scomtlm/cha2555/ AI Week 15Machine Learning:Data Mining :Association Rule Mining, Associative Classification,Applications
Last Week Data Mining -- as inducing rule classifiers from classified training examples.
Association Rule Mining(ARM) • This is an “unsupervised learning activity” - briefly, looking for strong associations between features in data. • Definitions: A transactional database is a set of “transactions” eg the details of individual sales. • A transaction can be though of as an “item-set” where each item is an attribute-value • {height=6, temp = 20. weather = warm} • As a special case we could have nominal item sets • {bread, cheese, milk}
Association Rule Mining(ARM): Important Definitions • An association rule is an expression • X => Y • where X, Y are item-sets,and • The supportof an association rule is defined as the proportion of transactions in the database that contain • X U Y. • The confidenceof an association rule is defined as the probability that a transaction contains Y given that it contains X, that is • = no of transactions containing (X U Y) / no of transactions containing X
Aims of ARM • Given a transactional database D, the association rule problem is to find all rules that have supports and confidences greater than certain user-specified thresholds, denoted by minimum support (MinSupp) and minimum confidence (MinConf), respectively. • The aim is the discovery of the most significant associations between the items in a transactional data set. This process involves primarily the discovery of so called frequent item-sets, i.e. item-sets that occurred in the transactional data set above MinSupp and MinConf.
Example • A trader deals in the following currencies in a series of 8 transactions… • 1 Sterling Yen Dollar Euro • 2 Dollar Euro Rand Sterling Ruble • 3 Pesos Euro Ruble Rupee Yen • 4 Rupee Sterling Ruble Euro Dollar • 5 Sterling Dinars Rand Yen • 6 Pesos Kroner Sterling Dollar • 7 Ruble Rupee Kroner Sterling Pesos • 8 Dollar Euro Sterling • What is the SUPPORT and CONFIDENCE of the following rules? • {Ruble } → {Rupee} • {Sterling, Euro} → {Ruble} • {Sterling, Euro} → {Ruble, Pesos} • Find an association rule from the set of transactions that has • - at least 2 items in its antecedents, • - better support and better confidence than both rules above.
Sterling Yen Dollar Euro Sterling Yen Dollar Euro Example Sterling Yen Dollar Euro Pesos Kroner Sterling Dollar Sterling Dinars Rand Yen Dollar Euro Sterling X Pesos Euro Ruble Rupee Yen Ruble Rupee Kroner Sterling Pesos Dollar Euro Rand Sterling Ruble R X u Y X => Y: Ruble => Rupee Rupee Sterling Ruble Euro Dollar
Associative Classification • If we fuse ARM and classification rule mining we get “Associative Classification” – use the association technique, but learning about particular items or item sets. • Associative Classification is a branch in data mining that combines classification and association rule mining. In other words, it utlises association rule discovery methods in classification data sets. • Typically: • Find Association Rules using ARM • Sift out the “Class Association Rules” – ones that have the class of interest on their Right Hand Sides
Validation in Rule Discovery • Multi-stage Data Mining “pipelines” are fraught with various kinds of errors / bias • the integrity of the data at each stage of the DM process and the reliability of the results are particularly important. • DM usually uses “cross validation”, where the data is split into a training set and a testing set, and the results of the data miner applied to the training set is compared to the training set. Not really applicable to rule discovery. • Key idea: Look for trends/associations in the data that are output from the process and that represent known associations in the application domain.
DM Application 1: Discovering trends from patient data in the area of Diabetic Retinopathy • Diabetic Retinopathy: Basically damage to the eyes caused by Diabetes, sometimes leading to blindness • HUGE problem as diabetes on the increase. If you are a long term diabetic then your are very likely suffer some retina damage • Clinics keep large amounts of data on patients who are treated in various ways, over long periods of time.
Diabetic Retinopathy Application • Data of 20,000 patients over 18 years • Much data cleaning and inference precedes mining – replacing missing values, noise, anomalies etc • Focus in one a smaller number of patients with a yearly screening (- timestamp) over a period of 4+ years • Attribute Examples (there are several hundred) • Age_at_Exam , • Present_Treatment, • calculated_age_at_diagnosis, • Retinopathy_in_R_Eye (RE_RET), • Retinopathy_in_L_Eye (RE_RET), • calculated_diabetes_type, • calculated_diabetes_duration
Trend Mining • Item-sets that have an increasing support over a series of time-stamped instances (events) are called “emerging patterns” • The changing support for sets of items during each event can indicate trends in the data. For example, the presence of a particular treatment over a period of time may lead to the alleviation of a symptom.
Diabetic Retinopathy Application • Aim - to find trends in the data e.g. (ficticous example): • calculated_diabetes_duration > Y & • Age_at_Exam in [60,70] & • Present_Treatment = drugX & • calculated_age_at_diagnosis in [50,60] => • Retinopathy_in_R_Eye (RE_RET) = low • Retinopathy_in_L_Eye (RE_RET) =low • Increasing trend .. • “people who have had diabetes for a certain length of time, whose age is in there 60’s, who were diagnosed in their 50’s, who have been taking treatmentX, often have low DR levels” • Increasing trend adds support for the association.
Example in Road Traffic Control Data .. • Numeric Data Record from individual CARS • (date, time, position, actual speed, expected speed) • Textual Data of INCIDENTS • (date, time start, time cleared, position, severity, road type, area, incident category, cause, road-effect, traffic-effect, reporter ..) • Data Sources .. • ANPR, Mobile Phones, Road (Vehicle) Sensors, Environment Sensors
Applications in Road Traffic Control • associations between variations in speeds with near-future incidents • effect of a particular type of incident (eg roadworks) on average speeds on nearby trunk roads • looking for predictors in "heavy/slow traffic" incidents: look for associations with speed variations or accidents on roads downstream from the incident position (hence causing the incident) • looking for associations between speeds around a bypass and a later "heavy traffic" incident within the town bypassed
Conclusions Data Mining is a powerful set of techniques to help discover hidden knowledge It can be supervised or unsupervised. • Association Rule Mining • Associative Classification are important classes of technique used in DM