320 likes | 394 Views
Slides for “Data Mining” by I. H. Witten and E. Frank. 2. Input: Concepts, instances, attributes. Terminology What’s a concept? Classification, association, clustering, numeric prediction What’s in an example? Relations, flat files, recursion What’s in an attribute?
E N D
2 Input: Concepts, instances, attributes • Terminology • What’s a concept? • Classification, association, clustering, numeric prediction • What’s in an example? • Relations, flat files, recursion • What’s in an attribute? • Nominal, ordinal, interval, ratio • Preparing the input • ARFF, attributes, missing values, getting to know data
Terminology • Components of the input: • Concepts: kinds of things that can be learned • Aim: intelligible and operational concept description • Instances: the individual, independent examples of a concept • Note: more complicated forms of input are possible • Attributes: measuring aspects of an instance • We will focus on nominal and numeric ones
What’s a concept? • Styles of learning: • Classification learning:predicting a discrete class • Association learning:detecting associations between features • Clustering:grouping similar instances into clusters • Numeric prediction:predicting a numeric quantity • Concept: thing to be learned • Concept description:output of learning scheme
Classification learning • Example problems: weather data, contact lenses, irises, labor negotiations • Classification learning is supervised • Scheme is provided with actual outcome • Outcome is called the class of the example • Measure success on fresh data for which class labels are known (test data) • In practice success is often measured subjectively
Association learning • Can be applied if no class is specified and any kind of structure is considered “interesting” • Difference to classification learning: • Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time • Hence: far more association rules than classification rules • Thus: constraints are necessary • Minimum coverage and minimum accuracy
Clustering • Finding groups of items that are similar • Clustering is unsupervised • The class of an example is not known • Success often measured subjectively
Numeric prediction • Classification learning, but “class” is numeric • Learning is supervised • Scheme is being provided with target value • Measure success on test data
What’s in an example? • Instance: specific type of example • Thing to be classified, associated, or clustered • Individual, independent example of target concept • Characterized by a predetermined set of attributes • Input to learning scheme: set of instances/dataset • Represented as a single relation/flat file • Rather restricted form of input • No relationships between objects • Most common form in practical data mining
A family tree Peter M Peggy F Grace F Ray M = = Steven M Graham M Pam F Ian M Pippa F Brian M = Anna F Nikki F
The “sister-of” relation Closed-world assumption
Generating a flat file • Process of flattening called “denormalization” • Several relations are joined together to make one • Possible with any finite set of finite relations • Problematic: relationships without pre-specified number of objects • Example: concept of nuclear-family • Denormalization may produce spurious regularities that reflect structure of database • Example: “supplier” predicts “supplier address”
Recursion • Appropriate techniques are known as “inductive logic programming” • (e.g. Quinlan’s FOIL) • Problems: (a) noise and (b) computational complexity • Infinite relations require recursion
What’s in an attribute? • Each instance is described by a fixed predefined set of features, its “attributes” • But: number of attributes may vary in practice • Possible solution: “irrelevant value” flag • Related problem: existence of an attribute may depend of value of another one • Possible attribute types (“levels of measurement”): • Nominal, ordinal, interval and ratio
Nominal quantities • Values are distinct symbols • Values themselves serve only as labels or names • Nominal comes from the Latin word for name • Example: attribute “outlook” from weather data • Values: “sunny”,”overcast”, and “rainy” • No relation is implied among nominal values (no ordering or distance measure) • Only equality tests can be performed
Ordinal quantities • Impose order on values • But: no distance between values defined • Example:attribute “temperature” in weather data • Values: “hot” > “mild” > “cool” • Note: addition and subtraction don’t make sense • Example rule: temperature < hot c play = yes • Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)
Interval quantities • Interval quantities are not only ordered but measured in fixed and equal units • Example 1: attribute “temperature” expressed in degrees Fahrenheit • Example 2: attribute “year” • Difference of two values makes sense • Sum or product doesn’t make sense • Zero point is not defined!
Ratio quantities • Ratio quantities are ones for which the measurement scheme defines a zero point • Example: attribute “distance” • Distance between an object and itself is zero • Ratio quantities are treated as real numbers • All mathematical operations are allowed • But: is there an “inherently” defined zero point? • Answer depends on scientific knowledge (e.g. Fahrenheit knew no lower limit to temperature)
Attribute types used in practice • Most schemes accommodate just two levels of measurement: nominal and ordinal • Nominal attributes are also called “categorical”, ”enumerated”, or “discrete” • But: “enumerated” and “discrete” imply order • Special case: dichotomy (“boolean” attribute) • Ordinal attributes are called “numeric”, or “continuous” • But: “continuous” implies mathematical continuity
Transforming ordinal to boolean • Simple transformation allowsordinal attribute with n valuesto be coded using n–1 boolean attributes • Example: attribute “temperature” • Better than coding it as a nominal attribute Original data Transformed data c
Metadata • Information about the data that encodes background knowledge • Can be used to restrict search space • Examples: • Dimensional considerations(i.e. expressions must be dimensionally correct) • Circular orderings(e.g. degrees in compass) • Partial orderings(e.g. generalization/specialization relations)
Preparing the input • Problem: different data sources (e.g. sales department, customer billing department, …) • Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors • Data must be assembled, integrated, cleaned up • “Data warehouse”: consistent point of access • Denormalization is not the only issue • External data may be required (“overlay data”) • Critical: type and level of data aggregation
Attribute types • ARFF supports numeric and nominal attributes • Interpretation depends on learning scheme • Numeric attributes are interpreted as • ordinal scales if less-than and greater-than are used • ratio scales if distance calculations are performed (normalization/standardization may be required) • Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise) • Integers: nominal, ordinal, or ratio scale?
Nominal vs. ordinal • Attribute “age” nominal • Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”)
Missing values • Frequently indicated by out-of-range entries • Types: unknown, unrecorded, irrelevant • Reasons: • malfunctioning equipment • changes in experimental design • collation of different datasets • measurement not possible • Missing value may have significance in itself (e.g. missing test in a medical examination) • Most schemes assume that is not the casec“missing” may need to be coded as additional value
Inaccurate values • Reason: data has not been collected for mining it • Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer) • Typographical errors in nominal attributes values need to be checked for consistency • Typographical and measurement errors in numeric attributes outliers need to be identified • Errors may be deliberate (e.g. wrong zip codes) • Other problems: duplicates, stale data
Getting to know the data • Simple visualization tools are very useful • Nominal attributes: histograms (Distribution consistent with background knowledge?) • Numeric attributes: graphs(Any obvious outliers?) • 2-D and 3-D plots show dependencies • Need to consult domain experts • Too much data to inspect? Take a sample!