כריית מידע -- Clustering

כריית מידע -- Clustering ד"ר אבי רוזנפלד

הרעיון הכללי: דברים דומים הם דומים • איך נאסוף דברים דומים • Regression, Classification (Supervised), k-nn • Clustering (Unsupervised) k-meand • Partitioning Algorithms (k-mean), Hierarchical Algorithms • שאלות פתוחות: איך להגדיר "קירבה" • מרחק Euclidean • מרחק Manhattan (Judea Pearl) • הרבה אופציות אחריות

איך לסווג את סימן השאלה?

K-Nearest Neighbor • בודקים את הסיווג בזמן אמת model free • צריכים לקבוע את מספר השכנים • בדרך כלל יש שקלול לפי המרחק מהנקודה • גם CBR או Case Based Reasoning דומה • בסיווג הולכים לפי הרוב (או איזשהו משקל לפי הקרבה) • ברגרסיה הערך יהיה לפי הרוב (או איזשהו משקל לפי הקרבה)

1-Nearest Neighbor

3-Nearest Neighbor

? k NEAREST NEIGHBOR • Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes • Choose an odd value for k, to eliminate ties • k = 1: • Belongs to square class • k = 3: • Belongs to triangle class • k = 7: • Belongs to square class 8 7 ICDM: Top Ten Data Mining Algorithms k nearest neighbor classification December 2006

Remarks +Highly effective inductive inference method for noisy training data and complex target functions +Target function for a whole space may be described as a combination of less complex local approximations +Learning is very simple - Classification is time consuming

האלגוריתם הבסיסי ל : ClusteringK-MEAN • בחר ערך רצוי של אשכולות: K • מתוך אוכלוסיית המדגם שנבחרה (להלן הנקודות), בחרK נקודות אקראיות. נקודות אלו הם המרכזים ההתחלתיים של האשכולות(Seeds) • קבע את המרחק האוקלידי של כל הנקודות מהמרכזים שנבחרו • כל נקודה משויכת למרכז הקרוב אליה ביותר. בצורה זו קיבלנו K אשכולות זרים זה לזה. • בכל אשכול: קבע נקודות מרכז חדשה על ידי חישוב הממוצע של כל הנקודות באשכול • אם נקודת המרכז שווה לנקודה הקודמת התהליך הסתיים , אחרת חזור ל 3

דוגמא עם 6 נקודות

איטרציה 1 • באופן אקראי נבחרו הנקודות 1,3 להלן C1,C2 • למרכז C1 נבחרות נקודות 1,2. למרכז C2 נבחרו הנקודות 3,4,5,6 • נוסחת המרחק:² ( Distance= √(x1-x2)² + ( y1-y2

בחירת מרכזים חדשים • ל C1 • X=(1.0+1.0)/2=1.0 • Y=(1.5+4.5)/2=3.0 • ל C2 • X=(2.0+2.0+3.0+5.0)/4.0=3.0 • Y=(1.5+3.5+2.5+6.0)/4.0=3.375

איטרציה 2 • נקודות המרכז החדשות: C1(1.0, 3.0) C2(3.0, 3.375) • ל C1 יצטרפו הנקודות: 1,2,3 ל C2 יצטרפו : 4,5,6

התוצאה הסופית

בעיות עם k-means • על המשתמש להגדיר מראש K • מניח שניתן לחשב את הממוצע • מאוד רגיש לoutliers • Outliersהם נקודות הרחוקות מהאחרים • יכול להיות סתם טעות... CS583, Bing Liu, UIC

דוגמא של OUTLIER CS583, Bing Liu, UIC

מרחק Euclidean • Euclidean distance: • Properties of a metric d(i,j): • d(i,j) 0 • d(i,i)= 0 • d(i,j)= d(j,i) • d(i,j) d(i,k)+ d(k,j)

Hierarchical Clustering • Produce a nested sequence of clusters, a tree, also called Dendrogram. CS583, Bing Liu, UIC

Types of hierarchical clustering • Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and • merges the most similar (or nearest) pair of clusters • stops when all the data points are merged into a single cluster (i.e., the root cluster). • Divisive (top down) clustering: It starts with all data points in one cluster, the root. • Splits the root into a set of child clusters. Each child cluster is recursively divided further • stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point CS583, Bing Liu, UIC

Agglomerative clustering It is more popular then divisive methods. • At the beginning, each data point forms a cluster (also called a node). • Merge nodes/clusters that have the least distance. • Go on merging • Eventually all nodes belong to one cluster CS583, Bing Liu, UIC

Agglomerative clustering algorithm CS583, Bing Liu, UIC

An example: working of the algorithm CS583, Bing Liu, UIC

Measuring the distance of two clusters • A few ways to measure distances of two clusters. • Results in different variations of the algorithm. • Single link • Complete link • Average link • Centroids • … CS583, Bing Liu, UIC

Single link method • The distance between two clusters is the distance between two closest data points in the two clusters, one data point from each cluster. • It can find arbitrarily shaped clusters, but • It may cause the undesirable “chain effect” by noisy points Two natural clusters are split into two CS583, Bing Liu, UIC

Complete link method • The distance between two clusters is the distance of two furthest data points in the two clusters. • It is sensitive to outliers because they are far away CS583, Bing Liu, UIC

EM Algorithm • Initialize K cluster centers • Iterate between two steps • Expectation step: assign points to clusters • Maximation step: estimate model parameters

כריית מידע -- Clustering

כריית מידע -- Clustering

Presentation Transcript

powerpoint presentation

Powerpoint presentation

PPT Presentation

PowerPoint presentation

PowerPoint Presentation.

talk-ppt - PowerPoint Presentation

PowerPoint Presentation

PowerPoint Presentation

PowerPoint Presentation

PowerPoint Presentation

Full Service Moving Plano TX - PowerPoint PPT Presentation

IEinfosoft.Pvt.Ltd Powerpoint PPT Presentation.

1800 Drivers PPT - PowerPoint PPT Presentation

PPT (PowerPoint Presentation) Combat Pest Control

PPT PRESENTATION

Hybrid MLM Software - PowerPoint PPT Presentation

Best MLM Software - PowerPoint PPT Presentation

Affiliate Marketing Software - PowerPoint PPT Presentation

Student Information Management System - PowerPoint PPT Presentation

Swot Analysis Threat PPT PowerPoint Presentation Icon Mockup