PCA: Dimension Reduction Method for Machine Learning

可视计算 Machine Learning Basics Weiteng Xie

scenario Learning Map task method Semi-supervised Learning Transfer Learning Regression Unsupervised Learning Reinforcement Learning Linear Model Deep Learning SVM, decision tree, K-NN… Structured Learning Non-linear Model Classification Supervised Learning

Learning Map

Outline

Dimension Reduction

Principle Component Analysis (PCA)

Bias and Variance

Bias v.s. Variance Error from bias Error from variance Error observed Underfitting Overfitting Small Bias Large Bias Large Variance Small Variance

Dimension Reduction Looks like 3-D Actually, 2-D

Why?

本质上讲，PCA就是将高维的数据通过线性变换投影到低维空间上去，主要作用可以总结为两点：本质上讲，PCA就是将高维的数据通过线性变换投影到低维空间上去，主要作用可以总结为两点： Why? 1.降噪 2.去冗余

X = (3,2) 1.向量的表示及基变换 X = (3,2) 要准确描述向量，首先要确定一组基，然后给出在基所在的各个直线上的投影值就可以了

注意这里R可以小于N，而R决定了变换后数据的维数。即我们可以将一N维数据变换到更低维度的空间中去，变换后的维度取决于基的数量。注意这里R可以小于N，而R决定了变换后数据的维数。即我们可以将一N维数据变换到更低维度的空间中去，变换后的维度取决于基的数量。其中pi是一个行向量，表示第i个基，aj是一个列向量，表示第j个原始数据记录。将M个N维向量，变换为由R个N维向量表示的新空间中

如何选择基才是最优的？或者说，如果我们有一组N维向量，现在要将其降到K维（K小于N），那么我们应该如何选择K个基才能最大程度保留原有的信息？如何选择基才是最优的？或者说，如果我们有一组N维向量，现在要将其降到K维（K小于N），那么我们应该如何选择K个基才能最大程度保留原有的信息？ 2.优化目标中心化希望投影后的投影值尽可能分散

前面说过，我们希望投影后投影值尽可能分散，而这种分散程度，可以用数学上的方差来表述前面说过，我们希望投影后投影值尽可能分散，而这种分散程度，可以用数学上的方差来表述 3.方差于是上面的问题被形式化表述为：寻找一个一维基，使得所有数据变换为这个基上的坐标表示后，方差值最大。均值为0

对于二维降成一维的问题来说，找到那个使得方差最大的方向就可以了。但是对于更高维呢？对于二维降成一维的问题来说，找到那个使得方差最大的方向就可以了。但是对于更高维呢？ 4.协方差当协方差为0时，表示两个字段完全独立。为了让协方差为0，我们选择第二个基时只能在与第一个基正交的方向上选择。因此最终选择的两个方向一定是正交的。至此，我们得到了降维问题的优化目标：将一组N维向量降为K维（K大于0，小于N），其目标是选择K个单位（模为1）正交基，使得原始数据变换到这组基上后，各字段两两间协方差为0，而字段的方差则尽可能大（在正交的约束下，取最大的K个方差）。

假设我们只有a和b两个字段，那么我们将它们按行组成矩阵X：假设我们只有a和b两个字段，那么我们将它们按行组成矩阵X： 5.协方差矩阵然后我们用X乘以X的转置，并乘上系数1/m：根据上述推导，我们发现要达到优化目前，等价于将协方差矩阵对角化：即除对角线外的其它元素化为0，并且在对角线上将元素按大小从上到下排列，这样我们就达到了优化目的

6.协方差矩阵对角化

设有m条n维数据： 7.PCA算法步骤总结 1.将原始数据按列组成n行m列矩阵X 2.将X的每一行（代表一个属性字段）进行零均值化，即减去这一行的均值 3.求出协方差矩阵 4.求出协方差矩阵的特征值及对应的特征向量 5.将特征向量按对应特征值大小从上到下按行排列成矩阵，取前k行组成矩阵P 6.Y=PX即为降维到k维后的数据

PCA - MNIST images 30 components: Eigen-digits

PCA - Face 30 components: http://www.cs.unc.edu/~lazebnik/research/spring08/assignment3.html Eigen-face

Weakness of PCA • Unsupervised • Linear PCA Non-linear dimension reduction in the following lectures LDA

Unsupervised Learning:Word Embedding

Distributed Representation • Clustering: an object must belong to one cluster • Distributed representation 小傑是強化系 Dimension Reduction 小傑是

Word Embedding 1-of-N Encoding dog rabbit run jump cat tree flower apple = [ 1 0 0 0 0] bag = [ 0 1 0 0 0] cat = [ 0 0 1 0 0] Word Class dog = [ 0 0 0 1 0] class 1 Class 2 Class 3 ran flower elephant = [ 0 0 0 0 1] dog jumped cat apple tree bird walk

Word Embedding • Machine learn the meaning of wordsfrom reading a lot of documents without supervision Word Embedding tree flower dog rabbit run jump cat

Word Embedding • Generating Word Vector is unsupervised Apple Training data is a lot of text Neural Network

Word Embedding • Machine learn the meaning of wordsfrom reading a lot of documents without supervision • A word can be understood by its context You shall know a word by the company it keeps 蔡英文、馬英九 are something very similar 馬英九 520宣誓就職蔡英文 520宣誓就職

How to exploit the context? • Count based • If two words wi and wj frequently co-occur, V(wi) and V(wj) would be close to each other • E.g. Glove Vector: V(wi) . V(wj) Ni,j Inner product Number of times wi and wj in the same document

Prediction-based wi …… wi-2 wi-1___ • 0 • z1 • 1-of-N encoding • of the word wi-1 • 1 • z2 The probability for each word as the next word wi • 0 …… tree • Take out the input of the neurons in the first layer z2 … flower …… …… dog rabbit • Use it to represent a word w run jump cat • Word vector, word embedding feature: V(w) z1

Prediction-based You shall know a word by the company it keeps • 0 • z1 • 1 • z2 The probability for each word as the next word wi • 0 蔡英文 or 馬英九 …… “宣誓就職” should have large probability z2 … Training text: …… …… …… 蔡英文宣誓就職 …… 蔡英文 wi wi-1 馬英九 …… 馬英九宣誓就職 …… wi-1 z1 wi

Prediction-based– Various Architectures • Continuous bag of word(CBOW) model • Skip-gram wi-1 Neural Network …… wi-1____ wi+1 …… wi wi+1 predicting the word given its context Neural Network wi-1 …… ____ wi____ …… wi wi+1 predicting the context given a word

Word2vec

Prediction-based– TrainingContinuous bag of word(CBOW) model Neural Network 潮水退了 Collect data: 就潮水退了就知道誰 … 不爽不要買 … 公道價八萬一 … ……… 退了 Neural Network 就知道就 Minimizing cross entropy Neural Network 知道谁

对于训练出来的权重矩阵W乘以任何一个单词的one-hot表示,都将得到自己的word embedding

Word Embedding Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014

Word Embedding Fu, Ruiji, et al. "Learning semantic hierarchies via word embeddings."Proceedings of the 52th Annual Meeting of the Association for Computational Linguistics: Long Papers. Vol. 1. 2014.

Word Embedding • Characteristics • Solving analogies Rome : Italy = Berlin : ? Compute Find the word w with the closest V(w)

Multi-lingual Embedding Bilingual Word Embeddings for Phrase-Based Machine Translation, Will Zou, Richard Socher, Daniel Cer and Christopher Manning, EMNLP, 2013

Multi-domain Embedding Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D. Manning, Andrew Y. Ng, Zero-Shot Learning Through Cross-Modal Transfer, NIPS, 2013

Document Embedding • word sequences with different lengths → the vector with the same length • The vector representing the meaning of the word sequence • A word sequence can be a document or a paragraph … word sequence (a document or paragraph)

Semantic Embedding Reference: Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507 Bag-of-word

Beyond Bag of Word • To understand the meaning of a word sequence, the order of the words can not be ignored. white blood cells destroying an infection positive different meaning exactly the same bag-of-word an infection destroying white blood cells negative

Transfer Learning Dog/Cat Classifier dog cat Data not directly related to the task considered elephant cat dog tiger Different domains, same task Similar domain, different tasks

http://www.bigr.nl/website/structure/main.php?page=researchlines&subpage=project&id=64http://www.bigr.nl/website/structure/main.php?page=researchlines&subpage=project&id=64 Why? http://www.spear.com.hk/Translation-company-Directory.html Task Considered Data not directly related English Speech Recognition Chinese …… Image Recognition Medical Images Text Analysis Specific domain Webpages Taiwanese

PCA: Dimension Reduction Method for Machine Learning

PCA: Dimension Reduction Method for Machine Learning

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

CHR-test

Test Automation Tools: QF-Test and Selenium

Test de Q I -1

System Test Specification

Le test de logiciels

TDC ( Test Description Code)

第八章

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Intolleranze alimentari: test da non usare perché non validati

第八章

Test of Significance

System Test Tools

Lesson 7