290 likes | 459 Views
Introduction to Python Session 2: Beginning Numerical Python and Visualization Jeremy Chen. This Session’s Agenda. NumPy Arrays Random Number Generation Numerical Linear Algebra Visualization An OLS Example. Preliminaries. By now you should have a Python distribution…
E N D
Introduction to Python Session 2: Beginning Numerical Python and Visualization Jeremy Chen
This Session’s Agenda • NumPy Arrays • Random Number Generation • Numerical Linear Algebra • Visualization • An OLS Example
Preliminaries • By now you should have a Python distribution… • Run IPython or Spyder • If not, start downloading oneand work with IDLE
NumPy Arrays • The ndarray is a typed N-dimensional array object: >>> import numpy as np >>> A = np.ndarray((2,3)) # Kinda the same as np.empty((2,3)); Uninitialized data array([[ 8.48798326e-314, 2.12199579e-314, 0.00000000e+000], [ 0.00000000e+000, 1.05290307e-253, 1.47310613e-319]]) >>> A[0,0] = 2; A[0,1] = 3; A[1,1] = 4; A[1,2] = 5; A array([[ 2., 3., 0.], [ 0., 4., 5.]]) >>> 10 * A + 1 # Simple vector operations array([[ 21., 31., 1.], [ 1., 41., 51.]]) >>> np.exp(A) # And various other vector operations array([[ 7.3890561 , 20.08553692, 1. ], [ 1. , 54.59815003, 148.4131591 ]]) >>> [A.shape, A.ndim] # And don’t forget ndarrays have a "shape" and "dimension" [(2, 3), 2] >>> A.dtype # ... and a data type. dtype('float64')
NumPy Arrays: Creating • Common ways of creating a ndarray >>> np.array(np.arange(5)) # arange is like range but returns a ndarray array([0, 1, 2, 3, 4]) >>> np.array(np.arange(5)).dtype # numpy selects an appropriate datatype dtype('int32') >>> np.reshape([3,2,1,3,1,2,6,7], (2,4)) array([[3, 2, 1, 3], [1, 2, 6, 7]]) >>> np.ones((2,3,4)) array([[[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]], [[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]]]) >>> np.zeros((5)) array([ 0., 0., 0., 0., 0.])
NumPyArrays: Creating • An amusing way to create a ndarray >>> M = np.mat('[1,2,3;4,5,6]'); M # Like MATLAB matrix([[1, 2, 3], [4, 5, 6]]) >>> A = np.array(M); A # ... not really necessary array([[1, 2, 3], [4, 5, 6]]) >>> # Here's the obligatory Matrix Multiplication digression >>> M * M.T # Matrix multiplication (M.T is the transpose of M) matrix([[14, 32], [32, 77]]) >>> A * A.T # Not matrix multiplication Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: operands could not be broadcast together with shapes (2,3) (3,2) >>> A.dot(A.T) # Matrix multiplication for 2 dimensional ndarrays array([[14, 32], [32, 77]]) >>> np.dot(A,A.T) # ... same here. (Dot product for 1 dimensional ndarrays) array([[14, 32], [32, 77]])
NumPy Arrays: Accessing • Simple slicing creates views (not copies) >>> A = np.reshape(np.arange(2*4)+1, (2,4)); A array([[1, 2, 3, 4], [5, 6, 7, 8]]) >>> B = A[:,1:3]; B array([[2, 3], [6, 7]]) >>> B[0,0] = 100; B; A array([[100, 3], [ 6, 7]]) array([[ 1, 100, 3, 4], [ 5, 6, 7, 8]]) • “Fancy Indexing” creates copies >>> C = A[:, [1,3]]; C # Using lists/arrays for indexing array([[100, 4], [ 6, 8]]) >>> C[0,0] = 200; C; A array([[200, 4], [ 6, 8]]) array([[ 1, 100, 3, 4], [ 5, 6, 7, 8]])
NumPy Arrays: A Little More • Some standard operations >>> A = np.reshape(np.arange(2*4)+1, (2,4)); A array([[1, 2, 3, 4], [5, 6, 7, 8]]) >>> A.sum() 36 >>> A.sum(axis=0) array([ 6, 8, 10, 12]) >>> A.sum(axis=1) array([10, 26]) >>> A.cumsum(axis=1) array([[ 1, 3, 6, 10], [ 5, 11, 18, 26]]) >>> (A.mean(axis=1), A.std(axis=1), A.var(axis=1), A.min(axis=1), A.argmax(axis=1)) (array([ 2.5, 6.5]), array([ 1.11803399, 1.11803399]), array([ 1.25, 1.25]), array([1, 5]), array([3, 3])) >>> arr = np.array([1,2,4,9,1,4]); arr.sort(); arr # Sorting array([1, 1, 2, 4, 4, 9]) >>> np.unique(arr) # Keep only unique entries array([1, 2, 4, 9]) >>> np.unique(np.array(['A', 'B', 'CC', 'c', 'A', 'CC'])) # Applies to strings too array(['A', 'B', 'CC', 'c'], dtype='|S2')
Random Number Generation • The Random Seed >>> np.random.seed(10) • Various Distributions >>> np.random.SOME_DISTRIBUTION(params, size=(dimensions)) >>> # rand ~ Uniform; randn ~ Normal; exponential; beta; >>> # gamma; binomial; randint(=>Discrete) ... even triangular
Numerical Linear Algebra • Matrix Decompositions >>> A = np.ones((5,5)) + np.eye(5) >>> L = np.linalg.cholesky(A) # Cholesky decomposition. >>> np.allclose(np.dot(L,L.T), A) # Check... True >>> Q, R = np.linalg.qr(A) # QR decomposition >>> np.allclose(np.dot(Q,R), A) True >>> # Eigenvalue decomposition (Use eigh when symmetric/Hermetian) >>> e, V = np.linalg.eig(A) >>> np.allclose(np.dot(A,V), np.dot(V,np.diag(e)))# V may not be unitary True >>> # Singular Value Decomposition >>> U, s, V = np.linalg.svd(A, full_matrices = True) >>> np.allclose(np.dot(np.dot(U, np.diag(s)),V), A) True
Numerical Linear Algebra • Solving Linear Systems >>> np.linalg.solve(A, np.ones((5,1))).T # Solve Linear System array([[ 0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667]]) >>> np.linalg.inv(A) # Matrix Inverse array([[ 0.83333333, -0.16666667, -0.16666667, -0.16666667, -0.16666667], [-0.16666667, 0.83333333, -0.16666667, -0.16666667, -0.16666667], [-0.16666667, -0.16666667, 0.83333333, -0.16666667, -0.16666667], [-0.16666667, -0.16666667, -0.16666667, 0.83333333, -0.16666667], [-0.16666667, -0.16666667, -0.16666667, -0.16666667, 0.83333333]]) • Other Stuff >>> np.linalg.norm(A) # Matrix norm 6.324555320336759 >>> np.linalg.det(A) # Determinant (should be 6...) 5.9999999999999982 >>> np.trace(A) # Matrix trace 10 >>> np.linalg.matrix_rank(A) # Matrix rank 5
Visualization (By Example) • By now you should have a Python distribution. • Run IPython or Spyder In [1]: %pylab Using matplotlib backend: module://IPython.kernel.zmq.pylab.backend_inline Populating the interactive namespace from numpy and matplotlib In [2]: plot( arange(20) ) Out[2]: [<matplotlib.lines.Line2D at 0x8918850>]
Visualization (By Example) • A Standard Brownian Motion In [3]: dx = 0.01 # Hit CTRL-Enter for multi-line input ...: walk = (np.random.randn(1000) * np.sqrt(dx)).cumsum() ...: plot(np.arange(0,1000)*dx, walk - walk[0]) Out[3]: [<matplotlib.lines.Line2D at 0xd31f770>]
Visualization (By Example) • A 2D Intensity Plot In [4]: xy_extent= 20 ...: xy_range = np.arange(-xy_extent,xy_extent,0.1) ...: X_m, Y_m = np.meshgrid(xy_range, xy_range) ...: f = ((X_m-5) - 2 * (Y_m+5)) ** 2 - (1.5*(X_m+3) + 1 * (Y_m-5)) ** 2 ...: imshow(f, cmap = matplotlib.cm.bone, \ ...: extent=[-xy_extent, xy_extent, -xy_extent, xy_extent]) ...: colorbar() Out[4]: <matplotlib.colorbar.Colorbar instance at 0x0C3AFE90>
Visualization (By Example) • A 2D Contour Plot In [5]: xy_extent= 20 ...: xy_range = np.arange(-xy_extent,xy_extent,0.1) ...: X_m, Y_m = np.meshgrid(xy_range, xy_range) ...: f = ((X_m-5) - 2 * (Y_m+5)) ** 2 - (1.5*(X_m+3) + 1 * (Y_m-5)) ** 2 ...: contourf(X_m, Y_m, f, 20) ...: colorbar() Out[5]: <matplotlib.colorbar.Colorbar instance at 0x10C12DA0>
Visualization (By Example) Our Objective…
Visualization (By Example) • Some data for plotting In [6]: XY = np.arange(1,1+100)/100.0 ...: XZ = np.arange(1,1+1000)/1000.0 ...: Y = np.random.randn(100) ...: Z = np.random.randn(1000) ...: Z.sort() • Setting up the figure and subplots In [7]: n_row = 2 ...: n_col = 2 ...: fig1 = plt.figure(figsize=(12,8)) ...: axes = []; # Collect axis references in a list ...: for k in range(n_row * n_col): ...: axes.append( fig1.add_subplot(n_row ,n_col , k+1) )
Visualization (By Example) • The simple line plot In [8]: axes[0].plot( XZ - 0.5, np.cos(1.5 * (XZ - 0.5) * \ ...: 2 * 3.14159), color='k') ...: axes[0].set_xlim([-0.6, 0.6]) ...: axes[0].set_title(r'$\cos\ 2\pi x$') ...: axes[0].set_xlabel('x') ...: axes[0].set_ylabel('y') ...: fig1 Out[8]:
Visualization (By Example) • The histogram of normal samples In [9]: axes[1].hist( Z, bins=20, color='b') ...: axes[1].set_title('Some Histogram of 1000 samples from $N(0,1)$') ...: fig1 Out[9]:
Visualization (By Example) • The scatter plots (with legend) In [10]: axes[2].scatter( XY[10:60], Y[:50] + XY[10:60]*0.05, marker='o', \ ...: c=(0.5, 0.5, 0.5), label='Population 1') ...: axes[2].scatter( XY[40:90], Y[50:] + XY[40:90]*0.05, marker='x', \ ...: c='#FF00A0', label='Population 2') ...: axes[2].legend(loc = 'best') ...: fig1 Out[10]:
Visualization (By Example) • The big mess of normal CDFs In [11]: axes[3].set_title(r'AColourful Mess of $N(\mu_k, 1)$ CDFs') ...: col = ['r', 'g', 'b', 'c', 'm', 'y', 'k'] # w for white ...: line = ['solid', 'dashed', 'dashdot', 'dotted'] ...: linestyle = [(c,l) for c in col for l in line] ...: for k in range(40): ...: ls_idx= k % len(linestyle) ...: axes[3].plot( Z+(k-20)*0.5, XZ, color = linestyle[ls_idx][0], \ ...: linestyle=linestyle[ls_idx][1]) ...: axes[3].text(-10, 0.45, 'Nothing to See Here.', fontsize=20) ...: fig1 Out[11]:
Example: Ordinary Least Squares • The standard linear model • where • The Maximum Likelihood Estimator is the Least Squares Solution In [12]: import math ...: import numpy as np ...: import matplotlib.pyplot as plt ...: import scipy.stats ...: # Load data (Choose between data sets with... ...: # 100, 200, 500, 1000, 5000, and 10000 samples) ...: data = np.load('OLS_data_500.npz') ...: X = data['X'] ...: Y = data['Y'] ...: true_beta = data['true_beta'] # Yeah... These are provided ...: true_error_std = data['true_error_std'] ...: data.close()
Example: Ordinary Least Squares In [13]: if X.shape[0] != Y.shape[0]: # Always good to check for errors ...: raise ValueError("Data dimension mismatch.") ...: num_pred = X.shape[1] ...: num_samples = X.shape[0] In [14]: # Descriptive Statistics ...: plt_side = math.ceil(math.sqrt(num_pred + 1)) ...: fig = plt.figure(figsize = (12,12)); ...: for k in range(num_pred + 1): ...: ax = fig.add_subplot(plt_side, plt_side, k+1) ...: if k == 0: ...: ax.hist(Y, bins=20) ...: ax.set_title( "Dependent Variable ($\\mu={mean:.3}$, $\\sigma={std:.3}$)".format( \ ...: mean = np.mean(Y), std = np.std(Y)) ) ...: else: ...: ax.hist(X[:, k-1], bins=20) ...: ax.set_title( "Predictor {pred} ($\\mu={mean:.3}$, $\\sigma={std:.3}$)".format( \ ...: pred= k, mean = np.mean(X[:, k-1]), std = np.std(X[:, k-1])) )
Example: Ordinary Least Squares In [15]: # OLS ...: XtX = np.dot(X.T, X) ...: XtY = np.dot(X.T, Y) ...: beta_est = np.linalg.solve(XtX, XtY) ...: s_sq = np.var(Y - np.dot(X, beta_est)) ...: ...: print "True Betas", true_beta.T ...: print "Estimated Betas", beta_est.T ...: print ...: ...: print "True Error Variance: ", true_error_std ** 2 ...: print "Estimated Error Variance: ", s_sq True Betas [[ 1. 1. 1. 1. 1. 0.]] Estimated Betas [[ 1.2389584 0.98711812 0.90258976 0.86002269 0.99770273 0.01079312]] True Error Variance: 4 Estimated Error Variance: 3.85523406528
Example: Ordinary Least Squares In [16]: # Hypothesis Testing ...: standard_errors = np.reshape(np.sqrt(np.diag( \ ...: s_sq* np.linalg.solve(XtX, np.eye(num_pred)) )), (num_pred,1)) ...: print "Standard Errors: ", standard_errors.T ...: t_statistics = beta_est / standard_errors ...: print "t-Statistics: ", t_statistics.T ...: df = num_samples - num_pred ...: p_values = 2 * scipy.stats.t.cdf(-abs(t_statistics), df).T ...: print "p-values (2-sided): ", p_values Standard Errors: [[ 0.26819354 0.05023788 0.16891535 0.09578419 0.09744182 0.06468454]] t-Statistics: [[ 4.61964296 19.6488801 5.34344442 8.97875377 10.23895845 0.16685784]] p-values (2-sided): [[ 4.91038992e-06 6.11554987e-64 1.39498275e-07 5.75446104e-18 1.92610049e-22 8.67550186e-01]] So let’s do it again!
Example: Ordinary Least Squares In [15]: # OLS Again ...: X = X[:,:-1]; num_pred= X.shape[1]; num_samples= X.shape[0] ...: XtX = np.dot(X.T, X) ...: XtY = np.dot(X.T, Y) ...: beta_est = np.linalg.solve(XtX, XtY) ...: s_sq = np.var(Y - np.dot(X, beta_est)) ...: print "Estimated Betas", beta_est.T ...: standard_errors = np.reshape(np.sqrt(np.diag( \ ...: s_sq* np.linalg.solve(XtX, np.eye(num_pred)) )), (num_pred,1)) ...: print "Standard Errors: ", standard_errors.T ...: t_statistics = beta_est / standard_errors ...: print "t-Statistics: ", t_statistics.T ...: df = num_samples - num_pred ...: p_values = 2 * scipy.stats.t.cdf(-abs(t_statistics), df).T ...: print "p-values (2-sided): ", p_values Estimated Betas [[ 1.25027031 0.98744205 0.90124032 0.86084415 0.9975555 ]] Standard Errors: [[ 0.25949091 0.05020176 0.16872633 0.09566025 0.09744054]] t-Statistics: [[ 4.81816615 19.6694721 5.34143267 8.99897405 10.23758223]] p-values (2-sided): [[ 1.92950811e-06 4.54116320e-64 1.40853414e-07 4.88551777e-18 1.93166121e-22]]