350 likes | 604 Views
SPARSE TENSORS DECOMPOSITION SOFTWARE. Papa S. Diaw, Master’s Candidate Dr. Michael W. Berry, Major Professor. Introduction. Large data sets Nonnegative Matrix Factorization (NMF) Insights on the hidden relationships Arrange multi-way data into a matrix Computation memory and higher CPU
E N D
SPARSE TENSORS DECOMPOSITION SOFTWARE Papa S. Diaw, Master’s Candidate Dr. Michael W. Berry, Major Professor
Introduction • Large data sets • Nonnegative Matrix Factorization (NMF) • Insights on the hidden relationships • Arrange multi-way data into a matrix • Computation memory and higher CPU • Linear relationships in the matrix representation • Failure to capture important structure information • Slower or less accurate calculations
Introduction (cont'd) • Nonnegative Tensor Factorizations (NTF) • Natural way for high dimensionality • Original multi-way structure of the data • Image processing, text mining
Tensor Toolbox For MATLAB • Sandia National Laboratories • Licenses • Proprietary Software
Motivation of the PILOT • Python Software for NTF • Alternative to Tensor Toolbox for MATLAB • Incorporation into FutureLens • Exposure to NTF • Interest in the open source community
Tensors • Multi-way array • Order/Mode/Ways • High-order • Fiber • Slice • Unfolding • Matricization or flattening • Reordering the elements of an N-th order tensor into a matrix. • Not unique
Tensors (cont’d) • Kronecker Product • Khatri-Rao product • A⊙B=[a1⊗b1 a2⊗b2… aJ⊗bJ]
Tensor Factorization • Hitchcock in 1927 and later developed by Cattell in 1944 and Tucker in 1966 • Rewrite a given tensor as a finite sum of lower-rank tensors. • Tucker and PARAFAC • Rank Approximation is a problem
PARAFAC • Parallel Factor Analysis • Canonical Decomposition (CANDE-COMPE) • Harsman,Carroll and Chang, 1970
PARAFAC (cont’d) • Given a three-way tensor X and an approximation rank R, we define the factor matrices as the combination of the vectors from the rank-one components.
PARAFAC (cont’d) • Alternating Least Square (ALS) • We cycle “over all the factor matrices and performs a least-square update for one factor matrix while holding all the others constant.”[7] • NTF can be considered an extension of the PARAFAC model with the constraint of nonnegativity
Python • Object-oriented, Interpreted • Runs on all systems • Flat learning curve • Supports object methods (everything is an object in Python)
Python (cont’d) • Recent interest in the scientific community • Several scientific computing packages • Numpy • Scipy • Python is extensible
Data Structures • Dictionary • Store the tensor data • Mutable type of container that can store any number of Python objects • Pairs of keys and their corresponding values • Suitable for sparseness of our tensors • VAST 2007 contest data 1,385,205,184 elements, with 1,184,139 nz • Stores the nonzero elements and keeps track of the zeros by using the default value of the dictionary
Data Structures (cont’d) • Numpy Arrays • Fundamental package for scientific computing in Python • Khatri-Rao products or tensors multiplications • Speed
Modules (cont’d) • SPTENSOR • Most important module • Class (subscripts of nz, values) • Flexibility (Numpy Arrays, Numpy Matrix, Python Lists) • Dictionary • Keeps a few instances variables • Size • Number of dimensions • Frobenius norm (Euclidean Norm)
Modules (cont’d) • PARAFAC • coordinates the NTF • Implementation of ALS • Convergence or the maximum number of iterations • Factor matrices are turned into a Kruskal Tensor
Modules (cont’d) • INNERPROD • Inner product between SPTENSOR and KTENSOR • PARAFAC to compute the residual norm • Kronecker product for matrices • TTV • Product sparse tensor with a (column) vector • Returns a tensor • Workhorse of our software package • Most computation • It is called by the MTTKRP and INNERPROD modules
Modules (cont’d) • MTTKRP • Khatri-Rao product off all factor matrices except the one being updated • Matrix multiplication of the matricized tensor with KR product obtained above • Ktensor • Kruskal tensor • Object returned after the factorization is done and the factor matrices are normalized. • Class • Instance variables such as the Norm. • Norm of ktensor plays a big part in determining the residual norm in the PARAFAC module.
Performance • Python Profiler • Run time performance • Tool for detecting bottlenecks • Code optimization • negligible improvement • efficiency loss in some modules
Performance (cnt’d) • Lists and Recursions
Performance (cnt’d) • Numpy Arrays
Performance (cnt’d) • After removing Recursions
Floating-Point Arithmetic • Binary floating-point • “Binary floating-point cannot exactly represent decimal fractions, so if binary floating-point is used it is not possible to guarantee that results will be the same as those using decimal arithmetic.”[12] • Makes the iterations volatile
Conclusion • There is still work to do after NTF • Preprocessing of data • Post Processing of results such as FutureLens • Expertise • Extract and Identify hidden components • Tucker Implementation. • C extension to increase speed. • GUI
Acknowledgments • Mr. Andrey Puretskiy • Discussions at all stages of the PILOT • Consultancy in text mining • Testing • Tensor Toolbox For MATLAB (Bader and Kolda) • Understanding of tensor Decomposition • PARAFAC
References • http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/ • Tamara G. Kolda, Brett W. Bader , “Tensor Decompostions and Applications”, SIAM Review , June 10, 2008. • Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, Shun-ichi Amari, “Nonnegative Matrix and Tensor Factorizations”, John Wiley & Sons, Ltd, 1009. • http://docs.python.org/library/profile.html • http://www.mathworks.com/access/helpdesk/help/techdoc • http://www.scipy.org/NumPy_for_Matlab_Users • Brett W. Bader, Andrey A. Puretskiy, Michael W. Berry, “Scenario Discovery Using Nonnegative Tensor Factorization”, J. Ruiz-Schulcloper and W.G. Kropatsch (Eds.): CIARP 2008, LNCS 5197, pp.791-805, 2008 • http://docs.scipy.org/doc/numpy/user/ • http://docs.scipy.org/doc/ • http://docs.scipy.org/doc/numpy/user/whatisnumpy.html • Tamara G. Kolda, “Multilinear operators for higher-order decompositions”, SANDIA REPORT, April 2006 • http://speleotrove.com/decimal/decifaq1.html#inexact