160 likes | 188 Views
Data and its Distribution. The popular table. Table (relation) propositional, attribute-value Example record, row, instance, case Table represents a sample from a larger population independent, identically distributed Attribute variable, column, feature, item Target attribute, class
E N D
The popular table • Table (relation) • propositional, attribute-value • Example • record, row, instance, case • Table represents a sample from a larger population • independent, identically distributed • Attribute • variable, column, feature, item • Target attribute, class • Sometimes rows and columns are swapped • bioinformatics
Example: play tennis data attributes examples
Example: play tennis data attributes examples target attribute
Example: play tennis data three examples covered, 100% correct if Outlook = sunny and Humidity = high then play = no
Numeric tennis data numeric attributes
Numeric tennis data numeric attributes
Numeric tennis data if Outlook = sunny and Humidity > 83 then play = no if Temperature < Humidity then play = no
Types • Nominal, categorical, symbolic, discrete • only equality (=) • no distance measure • Numeric • inequalities (<, >, ≤, ≥) • arithmetic • distance measure • Ordinal • inequalities • no arithmetic or distance measure • Binary • like nominal, but only two values, and True (1, yes, y) plays special role.
Univariate (probability) distribution • What values occur for an attribute and how often? • count occurrences • Counts are complete information about sample • actual data can be ignored from here on • Data is a sample of a population • counts are probability estimates
Attribute information: entropy • How informative is an attribute? • (How informative is an attribute about the value of another attribute?) • if an attribute is not informative, it cannot be informative about another • Entropy • a measure for the amount of information/chaos usefulness 1 bit entropy do you own a Mercedes? social security nr. gender highest degree
Distribution of a Binary Attribute • Only two values • probabilities pand 1-p • Entropy:H(A) = – plg(p) – (1–p)lg(1–p) • lg(p) is the 2-log of p • H(A)is maximal when p = ½ = 1/m(mis the number of values) • uniform distribution • e.g., gender
Entropy, Binary case gender, coin flip, … do you own a Mercedes? do you own a car? are you an alien? Entropy: H(A) = – plg(p) – (1–p)lg(1–p)
Distribution of nominal attribute • Multiple values (m) • each with probability pi • Entropy:H(A) = Σ–pilg(pi) • notice binary as special case • H is maximal when p = 1/m • uniform distribution • Hmax = –m1/m lg(1/m) = –lg(1/m) = lg m • e.g. season of booking date • m = 4 • at most lg(m) = lg(4) = 2bits • Q: what if only summer and winter? bar chart