490 likes | 764 Views
Matrix Factorization via SGD. Background. Recovering latent factors in a matrix. r. m movies. m movies. ~. H. W. V. n users. V[ i,j ] = user i’s rating of movie j. MF VIA SGD. Matrix factorization as SGD. local gradient. …scaled up by N to approximate gradient. step size.
E N D
Recovering latent factors in a matrix r m movies m movies ~ H W V n users V[i,j] = user i’s rating of movie j
Matrix factorization as SGD local gradient …scaled up by N to approximate gradient step size
KDD 2011 talk pilfered from …..
Parallel Perceptrons • Simplest idea: • Split data into S “shards” • Train a perceptron on each shard independently • weight vectors are w(1) , w(2) , … • Produce some weighted average of the w(i)‘s as the final result
Parallelizing perceptrons Instances/labels Split into example subsets Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 Compute vk’s on subsets vk -1 vk- 2 vk-3 Combine by some sort of weighted averaging vk
Parallel Perceptrons – take 2 • Idea: do the simplest possible thing iteratively. • Split the data into shards • Let w = 0 • For n=1,… • Train a perceptron on each shard with one passstarting with w • Average the weight vectors (somehow) and let wbe that average • Extra communication cost: • redistributing the weight vectors • done less frequently than if fully synchronized, more frequently than if fully parallelized All-Reduce
Parallelizing perceptrons – take 2 Instances/labels Split into example subsets w (previous) Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 Compute local vk’s w -1 w- 2 w-3 Combine by some sort of weighted averaging w
More detail…. M= • Initialize W,H randomly • not at zero • Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch” • Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in subepoch k • Use “bold driver” to set step size: • increase step size when loss decreases (in an epoch) • decrease step size when loss increases • Implemented in Hadoop and R/Snowfall
Wall Clock Time8 nodes, 64 cores, R/snow In-memory implementation
Wall Clock TimeHadoop One map-reduce job per epoch
Hadoop scalability Hadoop process setup time starts to dominate