1.28k likes | 1.64k Views
8. Cluster Analysis. Cluster Analysis. Introduction ✔ Measure of Distance Hierarchical Clustering K-means clustering Density based clustering Cluster Validation Kernel K-Means Algorithm. 查基因可知你姓什么.
E N D
Cluster Analysis • Introduction ✔ • Measure of Distance • Hierarchical Clustering • K-means clustering • Density based clustering • Cluster Validation • Kernel K-Means Algorithm
查基因可知你姓什么 中国人姓氏和同姓人群的分布规律的研究,有可能成为探讨中国人起源和父系遗传物质进化的一条新的重要途径和科学依据。研究者们收集了几十年来上百万份血型数据,经过计算机聚类统计分析后发现,不同人群的血样中的血型、酶、蛋白质的区域分布和人们姓氏的区域分布高度一致。这证明了中国人的姓氏分布是稳定的!这一发现,无疑说明了“姓氏基因”存在的可能,而且证明研究姓氏的遗传规律将有助于找到特殊姓氏人群的特殊遗传基因。
欧洲人祖先可能是七位女性 英国牛津大学的科学家最近说,现代欧洲人是由七位女性祖先繁衍而来的,99%以上的欧洲人都是她们七位之一的后代。 据《泰晤士报》报道,牛津大学的人类遗传学家布赖恩·赛克斯教授说,他是通过研究欧洲人体细胞中线粒体的DNA(脱氧核糖核酸)得出上述结论的。 除了偶尔产生的基因突变,线粒体中的DNA一般都由母亲原封不动地遗传给子女。赛克斯教授随机选取了6000位志愿者进行细胞取样,分析其线粒体的DNA特征。结果发现,这些数据可以明显地分为七组,每组属于一位女性祖先。 布赖恩教授说,这七位女性大约于4.5万年前先后来到欧洲,繁衍生息。 这项研究还为人类起源于非洲提供了新证据。布赖恩教授说,按线粒体的DNA特征划分,目前的非洲人可分为三个“家族”,其中一个“家族”与欧洲人的这七位女性祖先有密切的血缘关系。
Types of Clusters: Well-Separated • Well-Separated Clusters: • A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters
Types of Clusters: Center-Based • Center-based • A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster • The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters
Types of Clusters: Contiguity-Based • Contiguous Cluster (Nearest neighbor or Transitive) • A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters
Types of Clusters: Density-Based • Density-based • A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. • Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters
Types of Clusters: Conceptual Clusters • Shared Property or Conceptual Clusters • Finds clusters that share some common property or represent a particular concept. 2 Overlapping Circles
How many clusters? Six Clusters Two Clusters Four Clusters Notion of a Cluster can be Ambiguous
Cluster Analysis • Introduction ✔ • Measure of Distance ✔ • Hierarchical Clustering • K-means clustering • Density based clustering • Cluster Validation • Kernel K-Means Algorithm
分类 俗语说,物以类聚、人以群分。 但什么是分类的根据呢? 比如,要想把中国的县分成若干类,就有很多种分类法: • 可以按照自然条件来分,比如考虑降水、土地、日照、湿度等各方面; • 也可以考虑收入、教育水准、医疗条件、基础设施等指标; • 既可以用某一项来分类,也可以同时考虑多项指标来分类。
如何度量远近? • 如果想要对100个学生进行分类,如果仅仅知道他们的数学成绩,则只好按照数学成绩来分类;这些成绩在直线上形成100个点。这样就可以把接近的点放到一类。 • 如果还知道他们的物理成绩,这样数学和物理成绩就形成二维平面上的100个点,也可以按照距离远近来分类。 • 三维或者更高维的情况也是类似;只不过三维以上的图形无法直观地画出来而已。
两个距离概念 • 点和点之间的距离 • 类和类之间的距离 • 由一个点组成的类是最基本的类;如果每一类都由一个点组成,那么点间的距离就是类间距离。 • 类间距离是基于点间距离定义的:比如两类之间最近点之间的距离可以作为这两类之间的距离,也可以用两类中最远点之间的距离作为这两类之间的距离;当然也可以用各类的中心之间的距离来作为类间距离。
Mahalanobis距离的本质:变量标准化 作用:消除量纲的影响
配合距离 适用于分类变量,尤其是名义尺度变量
Density Measures Top right points “belong” together not only because close, but also close to many other points. Points in middle could belong to different groups because not close to many other points.
Cluster Analysis • Introduction ✔ • Measure of Distance ✔ • Hierarchical Clustering ✔ • K-means clustering • Density based clustering • Cluster Validation • Kernel K-Means Algorithm
AGNES (Agglomerative Nesting) • Initially, each object is a cluster • Step-by-step cluster merging, until all objects form a cluster • Single-link approach • Each cluster is represented by all of the objects in the cluster • The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters
Example: Single linkage Toronto Hamilton Niagara Barrie London North Bay Sudbury Ottawa Kingston
树状图 Dendrogram • Show how to merge clusters hierarchically • Decompose data objects into a multi-level nested partitioning (a tree of clusters) • A clustering of the data objects: cutting the dendrogram at the desired level • Each connected component forms a cluster
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative (AGNES) a a b b a b c d e c c d e d d e e divisive (DIANA) Step 3 Step 2 Step 1 Step 0 Step 4 Hierarchical Clustering Group data objects into t tree of clusters
DIANA (DIvisive ANAlysis) • Initially, all objects are in one cluster • Step-by-step splitting clusters until each cluster contains only one object