180 likes | 313 Views
Best Practices for Convolutional NNs Applied to Visual Document Anal ysis (according to P.A.Simard, D. Steinkraus, and J.C. Platt). outline. the task training set expansion network architecture learning. the task. handwriting recognition segmented handwritten digits data:
E N D
Best Practices for Convolutional NNsApplied to Visual Document Analysis(according to P.A.Simard, D. Steinkraus, and J.C. Platt)
outline • the task • training set expansion • network architecture • learning
the task • handwriting recognition • segmented handwritten digits • data: • benchmark set of English digit images (MNIST) • size-normalized to 28 x 28 pixels • 60,000 training patterns, 10,000 test patterns • goal: image vector {0, 1, …, 9}
the task • example from test set:
training set expansion • Etest – Etrain 1/P (P – size of training set) • idea: apply transformations to generate additional data • learning algorithm will learn transformation invariance (wrt. original, non-transformed, input)
training set expansion • examples of transformations: • translation • rotation • skewing • method: for every pixel in original image, computenew location, e.g. x(x,y)=1, y(x,y)=0 x(x,y)=a*x, y(x,y)=a*y (+ interpolation if a not int) • elastic deformations
training set expansion A (0,0) 3 (1,0) 7 (2,0) (0.75, 0.5) 5 (1,-1) 9 (2,-1) (0,0) xnew(x,y) = 1.75 ynew(x,y) = - 0.5 gray level (gl): evaluate gl at (xnew,ynew) with bilinear interpolation: over x: 3 + 0.75 * (7 - 3) = 6 5 + 0.75 * (9 - 5) = 8 over y: 8 + 0.5 * (6 - 8) = 7
training set expansion • elastic deformations • x(x,y) = rand(-1, +1), y(x,y) = rand(-1, +1) • smooth with Gaussian function of given SD (in pix) • if chosen SD large, resulting values small • if SD small, random field • intermediate SD: elastic deformation • factor for intensity
training set expansion • examples of distortions:
network architecture • account for topological properties of input (shape of curves, edges, etc.) • gradually extract more complex features • simple features extracted at higher resolutions, more coarser features at coarser resolutions over smaller regions • conversion from one to the other with operation of convolution • coarser resolutions generated by sub-sampling
network architecture • set of layers each with one or more planes • each unit on plane receives input from small area on planes in previous layer local receptive fields • shared weights at all points on a plane reduce number of parameters • multiple planes in each layer detect multiple features • once feature detected, spatial subsampling local averaging of weights • (partial) invariance to translation, rotation, scale, and deformation
network architecture S2 S1 (factor of 2) C2 C1 Kernel size: 5x5 100 hidden units 50 features 5 features; e.g.: edge, ink, intersection
gradient-based learning • backpropagation • output: Yp = F(Xp, W) • loss function: Ep = D(Dp, F(Xp, W) • Etrain(W): average of Ep over training {(X1,D1), … {(Xp,Dp)} • Ep = (Dp - F(Xp, W))2 / 2 • Etrain(W) = 1/P * sum(Ep) • simplest setting: find W such that min Etrain(W)
gradient-based learning • if E differentiable wrt. W, • gradient-based optimization can be used to compute min • module output: Xn= Fn (Xn-1, Wn) • Wn: trainable parameters; Wn W • Xn-1: module’s input (previous module’s output) • X0: input pattern Xp
gradient-based learning Ep Ep Ep Ep Fn Ep Fk Fn Ep Fn Fn Ep Xn xi Xn-1 Xn-1 W Wn Xn X W X Wn Xn = Jki • if known, then and can be computed: = (Wn, Xn-1) (Wn, Xn-1) : J[F(wn,xn)] wrt. W evaluated at (Wn,Xn-1) – compute gradient : J[F(wn,xn)] wrt. X at (Wn,Xn-1) – propagate backward : J[F]: martix containing partial derivatives of all outputs wrt. all inputs
gradient-based learning Fn W W(t) = W(t-1) - • simplest minimization: gradient descent • W iteratively adjusted as follows: • traditional backprop: special case of gradient learning with: Yn = Wn Xn-1 Xn = F(Yn)
application • zip-code scanning (generalized version over time-domain) • fax reading • similar techniques used in other digital image recognition(e.g. face recognition, X-ray, MRI, etc.) • later version (2003): dynamically changing layer parameters