Learning & Data Mining

Learning & Data Mining

Learning • Change of contents and organization of system’s knowledge enabling to improve its performance on task - Simon • Acquire new knowledge from environment • Organize its current knowledge • Inductive Inference • General conclusion from examples • Infer association between input and output • with some confidence • Incremental vs Batch

General Model of Learning Agent Performance standard Environment Sensors Critics Feedback Change Learning Module Performance Module Knowledge Learning Goals Problem Generator Effectors From Artificial Intelligence : a modern Approach by Russel and Norvig

Classification of Inductive Learning • Supervised Learning • given training examples • correct input-output pairs • recover unknown function from data generated from the function • generalization ability for unseen data • classification : function is discrete • concept learning : output is binary • Unsupervised Learning

Classification of Inductive Learning • Supervised Learning • Unsupervised Learning • No correct input-output pairs • needs other source for determining correctness • reinforcement learning : yes/no answer only • example : chess playing • Clustering : group into clusters of common characteristics • Map Learning : explore unknown territory • Discovery Learning : uncover new relationships

데이터 마이닝 • 데이터 마이닝(data mining)의 정의 • 대량의 실제 데이터로부터 • 이전에 잘 알려지지는 않았지만 • 묵시적이고 • 잠재적으로 유용한 정보를 추출하는 작업 Cf) KDD(Knowledge Discovery in Database) 데이터로부터 지식을 추출하는 전 과정 데이터 마이닝 ⊂ KDD

데이터 마이닝 기술( I ) 전문가 시스템 기계 학습 KDD Data Mining 데이터 베이스 통계학 가시화

데이터 마이닝 기술 ( II ) • 데이터 마이닝 주요 작업(primary task) • 분류화(Classification) • 군집화(Clustering) • 특성화(Characterization, Summerization) • 경향 분석(Trend analysis) • 연관규칙 탐사(Association, Market basket analysis) • 패턴 분석(Pattern analysis) • Estimation • Prediction

데이터 마이닝 기술( III ) • 응용 분야 • Marketing & Retail • Banking • Finance • Insurance • Medicine & Health(Genetics) • Quality control • Transportation • Geo • Spatial Application

Data Mining Tasks(1) • classification • Examples News ⇒ [international] [domestic] [sports] [culture]… large medium small predefinedclasses objects

Data Mining Tasks(2) • Classification - continued Credit application ⇒ [high] [medium] [low] Water sample => [일급수] [이급수] … [구정물] • Algorithm • Decision trees, Memory based reasoning

Data Mining Tasks(3) • Estimation cf. classification maps to discrete categories • Examples • 나이, 성별, 혈압… ⇒ 잔여수명 • 나이, 성별, 직업… ⇒ 연수입 • 지역, 수량(水), 인구 -> 오염농도 • Algorithm : Neural network • Estimating future value is called Prediction attr1 attr2 attr3 … (continuous) value data

Data Mining Tasks(4) • Association (Market basket analysis) - determine which things go together • Example • shopping list ⇒ Cross-Selling(supermarket (shelf, catalog, CF…) home-shopping, E-shopping…) • Association rules

Data Mining Tasks(5) • Clustering cf. classification - predefined category clustering - find new category & explain the category G1 G2 G3 G4 heterogeneous population homogeneous subgroups(clusters)

Data Mining Tasks(6) • Clustering -continued • Examples • Symptom ⇒ Disease • Customer information ⇒ Selective sales • 토양(수질) data Note: clustering is dependent to the features used card 예: number, color, suite …

Data Mining Tasks(7) • Clustering - continued • Clustering is useful for Exception finding • Algorithm K-means -> K clusters Note:Directed vs. Non-directed KDD exception • calling card fraud detection • credit card fraud. etc.

데이터 마이닝 기술(IV) • 데이터 마이닝 기법 • 연관규칙(association rule) • K-최단인접(k-nearest neighbor) • 의사결정 트리(decision tree) • 신경망(neural network) • 유전자 알고리즘(genetic algorithm) • 통계적 기법(statistical technique)

Market Basket Analysis (Associations) (1/10) O: Orange Juice M: Milk S: Soda W: Window Cleaner D: Detergent

Market Basket Analysis (Associations) (2/10) • Co.Occurrence Table

Market Basket Analysis (Associations) (3/10) { S , O} : Co-Occurrence of 2 R1 - if S Then O R2 - if O Then S • Support - 전체 data중 몇 percent가 이를 포함? Confidence - 전체 LHS 중 몇 percent 가 규칙만족? eg. Support of R1  2 / 5 40% Confidence of R1  2 / 3 confidence of R2  2 / 4 determine “How Good” is the Rule

Market Basket Analysis (Associations) (4/10) • Probability Table {A, B, C}

Market Basket Analysis (Associations) (5/10) R1: If A ∧ B then C R2: If A ∧ C then B R3: If B ∧ C then A • Confidence Support =5

Market Basket Analysis (Associations) (6/10) • R3 has the best confidence (0.33) but is it GOOD? Note: R3 : If B ∧C then A (0.33) A (0.45) 예: 머리 긴 사람 여자 • Improvement -> How good is the rule compared to random guessing? ?

Market Basket Analysis (Associations) (7/10) improvement= improvement > 1: criteria P(condition and result) P(condition) P(result)

Market Basket Analysis (Associations) (8/10) • Some Issues • overall algorithm build co-occurrence matrix for 1 item, 2 items, 3 items, etc. -> complex!! • Pruning eg. minimum support pruning • Virtual Item season, store, geographic information combined with real : items eg. If OJ ∧ Milk ∧Friday then Beer

Market Basket Analysis (Associations) (9/10) • Level of Description How specific ! Drink Soda Coke • 장점 - explainability - undirected Data Mining - variable length data - simple computation

Market Basket Analysis (Associations) (10/10) • 단점 - Complex as data grows - Limited Data Type (attributes) - Difficult to determine right number of items - Rare Items --> pruned

Clustering Algorithm (1/2) • k-means method ( Mc Queen ‘61) - lot of variations • Alg. Step 1. Choose initial k-points (seeds) 2. Find closest neighbors for k points ( initial cluster) 3. Find centroid for the cluster 4. goto step 2 stop when no more change

x1+ … + xn y1 + … + yn (x2, y2) (x3, y3) n n , (xn, yn) (x1, y1) Clustering Algorithm (2/2) Note: Finding neighbors • Finding Centroid

Variation of k-means 1. Use probability density rather than simple distance eg. Gaussian mixture Models 2. Weighted Distance 3. Agglomeration Method - hierarchical cluster

Agglomerative Algorithm 1. Start with every single record as a cluster(N) 2. Select closest cluster and combine them (N-1 clusters) 3. go to step 2 4. Stop at the right level (number) what is closest?

Distance between clusters • 3 measures 1. Single linkage closest members 2. Complete linkage most distant members 3. centeroids

Clustering • Strength 1. Undirected Knowledge Discovery 2. Categorical, Numeric, Textual data 에 적합 3. Easy to Apply • Weakness 1. Can be difficult to choose right (distance) measure & weight 2. Initial parameter에 sensitive 3. Can be hard to interpret

Tear production normal reduced none astigmatism yes no spectacle press soft hypermetrope myope none hard Decision Tree(contact lens)

Class 1 Class 2 Learning function input … … Class n classification yes concept input Concept learning no decision tree Concept Learning eg. red good customer

Weather data attribute instance s: sunny h: hot o: overcast m: mild h: high n: normal r: rainy c: cool

Decision Tree for weather (1/4) outlook sunny r o humidity windy no high n f t no yes yes no If Outlook = sunny then play = no and humidity = high

Decision Tree for weather (2/4) note: temp, humid can be numeric data temp>30 (hot) 10<= temp <= 30 (normal) temp<10 (cool)

Decision Tree for weather (3/4) • attribute • Attribute types • nominal ( categorical discreet ) • ordinal ( numeric continuous) • interval [10,20] • ratio – real numbers

Decision Tree for weather (4/4) note: Leaf node doesn’t have to be yes/no --> classification tear normal reduced astigmatism none no hard soft Contact lens

Decision Tree 를 이용한 Prediction A Build trees C B Training (set) ... Test (set) Evaluation set B Choose best A data real data Predict expected performance

Box Diagram of Decision Tree rain sunny overcast Windy humidity yes high n n y y n n n y no n y y y y y y

Prune here! Unseen data Error rate Training data Depth of Tree The effect of pruning • Some issues • where to prune? Too high -> unnecessarily complex too low -> lose information • what to split? (first)

Error Rate y y y n y y n y er=2/7 • Adjusted error rate of a tree AE(T)= E(T) + α leaf-count(T) • Find sub tree α1 of T s.t. AE(α1) <= AE(T) then prune all the branches that are not part of α1

Possible sub trees for weather data (1/2) first split? (a) (b) temp outlook sunny not cool rainy o mild y y y n n y y y y y y n n n y y y n y y y y n n y y n n

Possible sub trees for weather data (2/2) (c ) (d) windy humidity high true normal false y y y y y y n y y y n n n n y y y n n n y y y y y y n n

Information Theory & Entropy info([2,3]) = 0.971 bit info([4,0]) = 0.0 bit info([3,2]) = 0.971 bit -> info ([2,3], [4,0], [3,2]) = (5/14) * 0.971 + (4/14) * 0 + (5/14) * 0.971 = 0.693 bit gain(outlook) = info([9,5]) - info([2,3], [4,0],[3,2]) = 0.247 bits gain(temp) = 0.029 bit gain(humid) = 0.152 bit gain(windy) = 0.048 bit

Calculating info(x) - entropy • if either #yes or #no is 0 then info(x) = 0 • if #yes = #no then info(x) is max.value • can cover multi class situation eg. Info[2,3,4] = info( [2,7] + 7/9 * info[3,4] ) => entropy(p1, p2, … , pn) = - p1log p1 - p2 logp2 - … - pn log pn info([2,3,4]) = entropy ( 2/9, 3/9, 4/9 ) -> -2/9 * log 2/9 - 3/9 * log 3/9 - 4/9 log 4/9 = [-2log 2 - 3 log 3 - 4 log 4 + 9 log 9] /9

Algorithms: CART, C4.5 • CART - binary tree only Briemen ‘84 • C4.5 Quinlan ‘86 => ID3 • Clementine • NCR • CHAID Hartigan ‘75

Learning & Data Mining