1 / 83

PCA: Dimension Reduction Method for Machine Learning

Learn about Principal Component Analysis (PCA), a technique used in machine learning for dimension reduction. Understand the concepts, optimization objectives, and steps involved in PCA.

tmonique
Download Presentation

PCA: Dimension Reduction Method for Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 可视计算 Machine Learning Basics Weiteng Xie

  2. scenario Learning Map task method Semi-supervised Learning Transfer Learning Regression Unsupervised Learning Reinforcement Learning Linear Model Deep Learning SVM, decision tree, K-NN… Structured Learning Non-linear Model Classification Supervised Learning

  3. Learning Map

  4. Outline

  5. Dimension Reduction

  6. Principle Component Analysis (PCA)

  7. Bias and Variance

  8. Bias v.s. Variance Error from bias Error from variance Error observed Underfitting Overfitting Small Bias Large Bias Large Variance Small Variance

  9. Dimension Reduction Looks like 3-D Actually, 2-D

  10. Why?

  11. 本质上讲,PCA就是将高维的数据通过线性变换投影到低维空间上去,主要作用可以总结为两点:本质上讲,PCA就是将高维的数据通过线性变换投影到低维空间上去,主要作用可以总结为两点: Why? 1.降噪 2.去冗余

  12. X = (3,2) 1.向量的表示及基变换 X = (3,2) 要准确描述向量,首先要确定一组基,然后给出在基所在的各个直线上的投影值就可以了

  13. 注意这里R可以小于N,而R决定了变换后数据的维数。即我们可以将一N维数据变换到更低维度的空间中去,变换后的维度取决于基的数量。注意这里R可以小于N,而R决定了变换后数据的维数。即我们可以将一N维数据变换到更低维度的空间中去,变换后的维度取决于基的数量。 其中pi是一个行向量,表示第i个基,aj是一个列向量,表示第j个原始数据记录。 将M个N维向量,变换为由R个N维向量表示的新空间中

  14. 如何选择基才是最优的?或者说,如果我们有一组N维向量,现在要将其降到K维(K小于N),那么我们应该如何选择K个基才能最大程度保留原有的信息?如何选择基才是最优的?或者说,如果我们有一组N维向量,现在要将其降到K维(K小于N),那么我们应该如何选择K个基才能最大程度保留原有的信息? 2.优化目标 中心化 希望投影后的投影值尽可能分散

  15. 前面说过,我们希望投影后投影值尽可能分散,而这种分散程度,可以用数学上的方差来表述前面说过,我们希望投影后投影值尽可能分散,而这种分散程度,可以用数学上的方差来表述 3.方差 于是上面的问题被形式化表述为:寻找一个一维基,使得所有数据变换为这个基上的坐标表示后,方差值最大。 均值为0

  16. 对于二维降成一维的问题来说,找到那个使得方差最大的方向就可以了。但是对于更高维呢?对于二维降成一维的问题来说,找到那个使得方差最大的方向就可以了。但是对于更高维呢? 4.协方差 当协方差为0时,表示两个字段完全独立。为了让协方差为0,我们选择第二个基时只能在与第一个基正交的方向上选择。因此最终选择的两个方向一定是正交的。 至此,我们得到了降维问题的优化目标:将一组N维向量降为K维(K大于0,小于N),其目标是选择K个单位(模为1)正交基,使得原始数据变换到这组基上后,各字段两两间协方差为0,而字段的方差则尽可能大(在正交的约束下,取最大的K个方差)。

  17. 假设我们只有a和b两个字段,那么我们将它们按行组成矩阵X:假设我们只有a和b两个字段,那么我们将它们按行组成矩阵X: 5.协方差矩阵 然后我们用X乘以X的转置,并乘上系数1/m: 根据上述推导,我们发现要达到优化目前,等价于将协方差矩阵对角化:即除对角线外的其它元素化为0,并且在对角线上将元素按大小从上到下排列,这样我们就达到了优化目的

  18. 6.协方差矩阵对角化

  19. 设有m条n维数据: 7.PCA算法步骤总结 1.将原始数据按列组成n行m列矩阵X 2.将X的每一行(代表一个属性字段)进行零均值化,即减去这一行的均值 3.求出协方差矩阵 4.求出协方差矩阵的特征值及对应的特征向量 5.将特征向量按对应特征值大小从上到下按行排列成矩阵,取前k行组成矩阵P 6.Y=PX即为降维到k维后的数据

  20. PCA - MNIST images 30 components: Eigen-digits

  21. PCA - Face 30 components: http://www.cs.unc.edu/~lazebnik/research/spring08/assignment3.html Eigen-face

  22. Weakness of PCA • Unsupervised • Linear PCA Non-linear dimension reduction in the following lectures LDA

  23. Unsupervised Learning:Word Embedding

  24. Distributed Representation • Clustering: an object must belong to one cluster • Distributed representation 小傑是強化系 Dimension Reduction 小傑是

  25. Word Embedding 1-of-N Encoding dog rabbit run jump cat tree flower apple = [ 1 0 0 0 0] bag = [ 0 1 0 0 0] cat = [ 0 0 1 0 0] Word Class dog = [ 0 0 0 1 0] class 1 Class 2 Class 3 ran flower elephant = [ 0 0 0 0 1] dog jumped cat apple tree bird walk

  26. Word Embedding • Machine learn the meaning of wordsfrom reading a lot of documents without supervision Word Embedding tree flower dog rabbit run jump cat

  27. Word Embedding • Generating Word Vector is unsupervised Apple Training data is a lot of text Neural Network

  28. Word Embedding • Machine learn the meaning of wordsfrom reading a lot of documents without supervision • A word can be understood by its context You shall know a word by the company it keeps 蔡英文、馬英九 are something very similar 馬英九 520宣誓就職 蔡英文 520宣誓就職

  29. How to exploit the context? • Count based • If two words wi and wj frequently co-occur, V(wi) and V(wj) would be close to each other • E.g. Glove Vector: V(wi) . V(wj) Ni,j Inner product Number of times wi and wj in the same document

  30. Prediction-based wi …… wi-2 wi-1___ • 0 • z1 • 1-of-N encoding • of the word wi-1 • 1 • z2 The probability for each word as the next word wi • 0 …… tree • Take out the input of the neurons in the first layer z2 … flower …… …… dog rabbit • Use it to represent a word w run jump cat • Word vector, word embedding feature: V(w) z1

  31. Prediction-based You shall know a word by the company it keeps • 0 • z1 • 1 • z2 The probability for each word as the next word wi • 0 蔡英文 or 馬英九 …… “宣誓就職” should have large probability z2 … Training text: …… …… …… 蔡英文 宣誓就職 …… 蔡英文 wi wi-1 馬英九 …… 馬英九 宣誓就職 …… wi-1 z1 wi

  32. Prediction-based– Various Architectures • Continuous bag of word(CBOW) model • Skip-gram wi-1 Neural Network …… wi-1____ wi+1 …… wi wi+1 predicting the word given its context Neural Network wi-1 …… ____ wi____ …… wi wi+1 predicting the context given a word

  33. Word2vec

  34. Prediction-based– TrainingContinuous bag of word(CBOW) model Neural Network 潮水 退了 Collect data: 就 潮水 退了 就 知道 誰 … 不爽 不要 買 … 公道價 八萬 一 … ……… 退了 Neural Network 就 知道 就 Minimizing cross entropy Neural Network 知道 谁

  35. 对于训练出来的权重矩阵W乘以任何一个单词的one-hot表示,都将得到自己的word embedding

  36. Word Embedding Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014

  37. Word Embedding Fu, Ruiji, et al. "Learning semantic hierarchies via word embeddings."Proceedings of the 52th Annual Meeting of the Association for Computational Linguistics: Long Papers. Vol. 1. 2014.

  38. Word Embedding • Characteristics • Solving analogies Rome : Italy = Berlin : ? Compute Find the word w with the closest V(w)

  39. Multi-lingual Embedding Bilingual Word Embeddings for Phrase-Based Machine Translation, Will Zou, Richard Socher, Daniel Cer and Christopher Manning, EMNLP, 2013

  40. Multi-domain Embedding Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D. Manning, Andrew Y. Ng, Zero-Shot Learning Through Cross-Modal Transfer, NIPS, 2013

  41. Document Embedding • word sequences with different lengths → the vector with the same length • The vector representing the meaning of the word sequence • A word sequence can be a document or a paragraph … word sequence (a document or paragraph)

  42. Semantic Embedding Reference: Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507 Bag-of-word

  43. Beyond Bag of Word • To understand the meaning of a word sequence, the order of the words can not be ignored. white blood cells destroying an infection positive different meaning exactly the same bag-of-word an infection destroying white blood cells negative

  44. Transfer Learning Dog/Cat Classifier dog cat Data not directly related to the task considered elephant cat dog tiger Different domains, same task Similar domain, different tasks

  45. http://www.bigr.nl/website/structure/main.php?page=researchlines&subpage=project&id=64http://www.bigr.nl/website/structure/main.php?page=researchlines&subpage=project&id=64 Why? http://www.spear.com.hk/Translation-company-Directory.html Task Considered Data not directly related English Speech Recognition Chinese …… Image Recognition Medical Images Text Analysis Specific domain Webpages Taiwanese

More Related