210 likes | 259 Views
Data Mining – Input: Concepts, instances, attributes. Chapter 2. Concept. Thing to be learned Ignore any philosophy about what a concept is Need description that is Intelligible – can be understood, and thus can be argued / discussed as to its validity by humans
E N D
Data Mining – Input: Concepts, instances, attributes Chapter 2
Concept • Thing to be learned • Ignore any philosophy about what a concept is • Need description that is • Intelligible – can be understood, and thus can be argued / discussed as to its validity by humans • Operational – it can be applied to future examples • How the concept is expressed is the “concept description” • Concept may differ based on different styles of learning … classification, association, clustering, numeric prediction …
Styles of Learning • Classification – learn way of “classifying” unseen examples – put them in the correct category • Association – learn any association between attributes • Clustering – seek groups of examples that belong together, without pre-classification • Numeric prediction – prediction of numeric quantity instead of category
Classification • “Supervised” – learning scheme is provided correct classification/class/category for “training” data • Success is measured by trying out what is learned on independent/ previous unseen “test” data (withholding category/class until checking the program’s answer)
Supervision • Classification and numeric prediction are “supervised” • Association and Clustering are “unsupervised”
Inputs – What’s in an Example? • Input is a set of instances (records/examples) • Instance has set of values for pre-determined attributes (like a record in a DB) • I.e. input is like a single DB table, or “flat file” • There may be things we’d like to learn that don’t fit into this simple structure – but current technology is largely only up to handling simple input • You may find it useful sometimes to “denormalize” a DB – do a JOIN of two or more tables to produce a flat file (just make sure you don’t just re-learn the primary keys or foreign key!)
Attributes • Flat file format means that all examples are expected to have values for the same attributes • Some attributes may be irrelevant for some examples • Some attributes relevance may depend on value of another attribute • Usual workaround – irrelevant attributes have a special irrelevant “value”
Kinds of attributes • Binary/boolean – two valued; e.g. Resident Student? • Nominal/categorical/enumerated/discrete – multiple valued, unordered; e.g. Major • Ordinal - Ordered, but no sense of distance between – • e.g. Fr, So, Jr, Sr; • e.g. Household Income 1 - < 15K, 2 – 15-20K, 3- 20-25K, 4- 25-30K, 5 – 30-40K, 6 – 40-50K, 7 - > 50K • Interval – ordered, distance is measurable; e.g. birth year • Ratio – an actual measurement with defined zero point - such that we could say that one value is double another or triple, or ½; e.g. GPA
Kinds of Attributes • Many algorithms cannot handle all of those different types of attributes • One approach – • treat binary and nominal as nominal • Treat ordinal, interval, and ratio as “numeric” • Requires coding ordinals such as Fr, So etc as numbers
Preparing the Data • Preparing the data “usually consumes the bulk of the effort invested in the entire data mining process” • Real data is frequently low quality • Data Cleaning is frequently necessary and time consuming
Preparing the Data • Integrating data from multiple sources • E.g. data from different departments – marketing, sales, billing, customer service • E.g. sometimes outside data is valuable – economic conditions, weather data • Challenges – different coding conventions, different time periods, different aggregations, different keys, different kinds of errors • Point of intersection with Data Warehousing – this work needs to be done for BOTH! • May need to iterate to get right
Preparing the Data • Standard format – any tool needs data to be in some standard format • Weka tool requires data to be in ARFF format
ARFF Format • Lines beginning with % are comments • File starts with name of the relation • Attributes are defined • Nominal attributes are followed by the set of values • Numeric attributes list the keyword “numeric” • No identification of class to be predicted – flexible • Beginning of data is flagged with @data • Data itself is comma delimited (easily created from Access or Excel) • Missing values are represented with a ?
Data Preparation • You need to understand machine learning schemes before using them for data mining • Some schemes treat numerics as ordinals and only compare < > = • Others treat numerics as ratios and perform distance and other measurements • If distance measurements are to be made, avoid scheme if datasets contain ordinals that distort distances (e.g. income example earlier) • Distance between nominals is frequently all or nothing (0 or 1) • If scheme only deals with nominals, any numerics need to be converted to nominals (e.g. age converted to young, mid, old) (some info is lost) • If dataset has nominals that are coded as integers, don’t confuse the scheme by marking them numeric
Normalization • Some schemes require all numeric attributes to be on a similar scale – thus normalize or standardize (different term than DB normalization) • One normalization approach: Norm val = (val – minimum value for attribute) (max value for attribute – min val) • One standardization approach: Stand val = (val – mean) / SD
Missing Values • In real datasets, missing values are frequently coded with weird value (e.g. –1, 999999) • Sometimes different types of missing values are distinguished – unknown, vs unrecorded vs not applicable vs … • Missing values may have meaning – • e.g. maybe income may be left blank more often by people whose income is particularly high or low • E.g. in diagnosis, a particular test may not need to be done for a particular case • Get data-knowledgeable person involved • Most machine learning schemes assume that missing value is not particularly meaningful • If meaningful, need to let scheme know …
Inaccurate Values • Errors and omissions may be more important to mining algorithms than to source system • Misspelling of nominal attribute values may suggest incorrect possible values • Typos or incorrect measurement may yield numeric outliers • Find via graphing / involve data-knowledgeable person • Duplicate records – confuse scheme by giving heavier weight to • Deliberate mis-entry occurs (e.g. supermarket checkout entering own bonus card)
Data Age • We are frequently using data to predict the future • At some point, the world / business has changed enough that the data is no longer appropriate for that
Getting to Know Your Data • Several points above reflect this need • Graphic display of data can help find problems (e.g. outliers, large numbers of unknown value (e.g. 9999), typos of nominals) • Domain knowledgeable people are valuable – explain anomalies, missing values, coding schemes. • Data cleaning is extremely important. • At least look at some records to see what is going on • “Time spent looking at your data is always time well spent”