1 / 18

outline

Best Practices for Convolutional NNs Applied to Visual Document Anal ysis (according to P.A.Simard, D. Steinkraus, and J.C. Platt). outline. the task training set expansion network architecture learning. the task. handwriting recognition segmented handwritten digits data:

Download Presentation

outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Best Practices for Convolutional NNsApplied to Visual Document Analysis(according to P.A.Simard, D. Steinkraus, and J.C. Platt)

  2. outline • the task • training set expansion • network architecture • learning

  3. the task • handwriting recognition • segmented handwritten digits • data: • benchmark set of English digit images (MNIST) • size-normalized to 28 x 28 pixels • 60,000 training patterns, 10,000 test patterns • goal: image vector  {0, 1, …, 9}

  4. the task • example from test set:

  5. training set expansion • Etest – Etrain 1/P (P – size of training set) • idea: apply transformations to generate additional data • learning algorithm will learn transformation invariance (wrt. original, non-transformed, input)

  6. training set expansion • examples of transformations: • translation • rotation • skewing • method: for every pixel in original image, computenew location, e.g. x(x,y)=1, y(x,y)=0 x(x,y)=a*x, y(x,y)=a*y (+ interpolation if a not int) • elastic deformations

  7. training set expansion A (0,0) 3 (1,0) 7 (2,0) (0.75, 0.5)  5 (1,-1) 9 (2,-1) (0,0) xnew(x,y) = 1.75 ynew(x,y) = - 0.5 gray level (gl): evaluate gl at (xnew,ynew) with bilinear interpolation: over x: 3 + 0.75 * (7 - 3) = 6 5 + 0.75 * (9 - 5) = 8 over y: 8 + 0.5 * (6 - 8) = 7

  8. training set expansion • elastic deformations • x(x,y) = rand(-1, +1), y(x,y) = rand(-1, +1) • smooth with Gaussian function of given SD (in pix) • if chosen SD large, resulting values small • if SD small, random field • intermediate SD: elastic deformation • factor for intensity

  9. training set expansion • examples of distortions:

  10. network architecture • account for topological properties of input (shape of curves, edges, etc.) • gradually extract more complex features • simple features extracted at higher resolutions, more coarser features at coarser resolutions over smaller regions • conversion from one to the other with operation of convolution • coarser resolutions generated by sub-sampling

  11. network architecture

  12. network architecture • set of layers each with one or more planes • each unit on plane receives input from small area on planes in previous layer local receptive fields • shared weights at all points on a plane  reduce number of parameters • multiple planes in each layer  detect multiple features • once feature detected, spatial subsampling local averaging of weights • (partial) invariance to translation, rotation, scale, and deformation

  13. network architecture S2 S1 (factor of 2) C2 C1 Kernel size: 5x5 100 hidden units 50 features 5 features; e.g.: edge, ink, intersection

  14. gradient-based learning • backpropagation • output: Yp = F(Xp, W) • loss function: Ep = D(Dp, F(Xp, W) • Etrain(W): average of Ep over training {(X1,D1), … {(Xp,Dp)} • Ep = (Dp - F(Xp, W))2 / 2 • Etrain(W) = 1/P * sum(Ep) • simplest setting: find W such that min Etrain(W)

  15. gradient-based learning • if E differentiable wrt. W, • gradient-based optimization can be used to compute min • module output: Xn= Fn (Xn-1, Wn) • Wn: trainable parameters; Wn  W • Xn-1: module’s input (previous module’s output) • X0: input pattern Xp

  16. gradient-based learning  Ep  Ep  Ep  Ep  Fn  Ep  Fk  Fn  Ep  Fn  Fn  Ep  Xn  xi  Xn-1  Xn-1  W  Wn  Xn  X  W  X  Wn  Xn = Jki • if known, then and can be computed: = (Wn, Xn-1) (Wn, Xn-1) : J[F(wn,xn)] wrt. W evaluated at (Wn,Xn-1) – compute gradient : J[F(wn,xn)] wrt. X at (Wn,Xn-1) – propagate backward : J[F]: martix containing partial derivatives of all outputs wrt. all inputs

  17. gradient-based learning  Fn  W W(t) = W(t-1) -  • simplest minimization: gradient descent • W iteratively adjusted as follows: • traditional backprop: special case of gradient learning with: Yn = Wn Xn-1 Xn = F(Yn)

  18. application • zip-code scanning (generalized version over time-domain) • fax reading • similar techniques used in other digital image recognition(e.g. face recognition, X-ray, MRI, etc.) • later version (2003): dynamically changing layer parameters

More Related