Multimedia Content Analysis on Clusters and Grids

Multimedia Content Analysison Clusters and Grids Frank J. Seinstra (fjseins@cs.vu.nl) Computer Systems Group, Faculty of Sciences, Vrije Universiteit, Amsterdam Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Overview (1) • Part 1: What is Multimedia Content Analysis (MMCA)? • Part 2: Why parallel computing in MMCA – and how? • Part 3: Software Platform: Parallel-Horus • Part 4: Example – Parallel Image Processing on Clusters Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Overview (1) • Part 5: ‘Grids’ and their specific problems • Part 6: A Software Platform for MMCA on ‘Grids’? • Part 7: Large-scale MMCA applications on ‘Grids’ • Part 8: Future research directions => Jungle Computing Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Introduction • A Few Realistic Problem Scenarios Parallel Computing 2010 – Vrije Universiteit, Amsterdam

automatic analysis? A Real Problem… • News broadcast - September 21, 2005: • Police Investigation: over 80.000 CCTV recordings • First match found only 2.5 months after attacks Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Sarah Palin Another real problem… • Web Video Search: • Search based on annotations • Known to be notoriously bad (e.g, YouTube) • Instead: search based on video content Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Are these realistic problems? • Beeld&Geluid (Dutch Institute for Sound and Vision, Hilversum): • Interactive access to Dutch national TV history • NFI (Dutch Forensics Institute, Den Haag): • Surveillance Camera Analysis • Crime Scene Reconstruction Parallel Computing 2010 – Vrije Universiteit, Amsterdam

But there are many more: • Healthcare • Astronomy • Remote Sensing • Entertainment (e.g. see: PhotoSynth.net) • …. Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Part 1 • What is Multimedia Content Analysis? Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Multimedia • Multimedia = Text + Sound + Image + Video + …. • Video = image + image + image + …. • In many (not all) multimedia applications: • calculations are executed on each separate video frame independently • So: we focus on Image Processing(+ Computer Vision) Parallel Computing 2010 – Vrije Universiteit, Amsterdam

However: • Image dimensionality can range from 1-D to n-D • Example (medical): 5-D = x, y, z, time, emission wavelength • Pixel dimensionality can range from 1-D to n-D • Generally: 1D = binary/grayscale; 3D = color (e.g. RGB) • n-D = hyper-spectral (e.g. remote sensing by satellites) What is a Digital Image? • “An image is a continuous function that has been discretized in spatial coordinates, brightness and color frequencies” • Most often: 2-D with ‘pixels’ as scalar or vector value Parallel Computing 2010 – Vrije Universiteit, Amsterdam

“Pres. Bush stepping off Airforce 1” Impala “Supernova at X,Y,t…” “Blue Car” A K R Z (Parallel-) Horus Low level operations Intermediate level operations High level operations In: image Out: ‘meaning’ful result Image ===> (sub-) Image Image ===> Scalar / Vector Value ===> Feature Vector Image ===> Array of S/V Values Complete A-Z Multimedia Applications Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Binary Pixel Operation (example: addition) = + Template / Kernel / Filter / Neighborhood Operation (example: Gauss filter) = + Low Level Image Processing Patterns (1) Unary Pixel Operation (example: absolute value) = N-ary Pixel Operation… Parallel Computing 2010 – Vrije Universiteit, Amsterdam

2 1 7 6 4 = N-Reduction Operation (example: histogram) M + = Geometric Transformation (example: rotation) transformation matrix Low Level Image Processing Patterns (2) Reduction Operation (example: sum) = Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Example Application: Template Matching for all images { inputIm = readFile ( … ); unaryPixOpI ( sqrdInIm, inputIm, “set” ); binaryPixOpI ( sqrdInIm, inputIm, “mul” ); for all symbol images { symbol = readFile ( … ); weight = readFile ( … ); unaryPixOpI (filtIm1, sqrdInIm, “set”); unaryPixOpI (filtIm2, inputIm, “set”); genNeighborhoodOp (filtIm1, borderMirror, weight, “mul”, “sum”); binaryPixOpI (symbol, weight, “mul” ); genNeighborhoodOp (filtIm2, borderMirror, symbol, ”mul”, “sum”); binaryPixOpI (filtIm1, filtIm2, “sub”); binaryPixOpI (maxIm, filtIm1, “max”); } writeFile ( …, maxIm, … ); } Input Image Template See: http:/www.cs.vu.nl/~fjseins/ParHorusCode/ Result Image Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Part 2 • Why Parallel Computing in MMCA (and how)? Parallel Computing 2010 – Vrije Universiteit, Amsterdam

The ‘Need for Speed’ in MMCA • Growing interest in international ‘benchmark evaluations’ • Task: find ‘semantic concepts’ automatically • Example: NIST TRECVID (200+ hours of video) • A problem of scale: • At least 30-50 hours of processing time per hour of video • Beeld&Geluid: 20,000 hours of TV broadcasts per year • NASA: over 10 TB of hyper-spectral image data per day • London Underground: over 120,000 years of processing…!!! Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Question: • What type of high-performance hardware is most suitable? • Our initial choice: • Clusters of general purpose CPUs (e.g. DAS-cluster) • For many pragmatic reasons… Accelerators Grids Clusters General Purpose CPUs GPUs High Performance Computing • Solution: • Parallel & distributed computing at a very large scale Parallel Computing 2010 – Vrije Universiteit, Amsterdam

User Transparent Parallelization Tools For non-experts in Parallel Computing? Message Passing Libraries (e.g., MPI, PVM) Effort Shared Memory Specifications (e.g., OpenMP) Parallel Languages (e.g., Occam, Orca) Extended High Level Languages (e.g., HPF) Parallel Image Processing Languages (e.g., Apply, IAL) Automatic Parallelizing Compilers Parallel Image Processing Libraries Efficiency Parallel Computing 2010 – Vrije Universiteit, Amsterdam

! • Ignore optimization across library calls [all] Existing Parallel Image Processing Libs • Suffer from many problems: • No ‘familiar’ programming model: • Identifying parallelism still the responsibility of programmer (e.g. data partitioning [Taniguchi97], loop parallelism [Niculescu02, Olk95]) • Reduced maintainability / portability: • Multiple implementations for each operation [Jamieson94] • Restricted to particular machine [Moore97, Webb93] • Non-optimal efficiency of parallel execution: • Ignore machine characteristics for optimization [Juhasz98, Lee97] Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Our Approach • Sustainable software library for user-transparent parallel image processing • (1) Sustainability: • Maintainability, extensibility, portability (i.e. from Horus) • Applicability to commodity clusters • (2) User transparency: • Strictly sequential API (identical to Horus) • Intra-operation efficiency & inter-operation efficiency Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Part 3 (a) • Software Platform: Parallel-Horus (parallel algorithms) Parallel Computing 2010 – Vrije Universiteit, Amsterdam

What Type(s) of Parallelism to support? • Data parallelism: • “exploitation of concurrency that derives from the application of the same operation to multiple elements of a data structure” [Foster, 1995] • Task parallelism: • “a model of parallel computing in which many different operations may be executed concurrently” [Wilson, 1995] Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Why Data Parallelism (only)? • Natural approach for low level image processing • Scalability (in general: #pixels >> #different tasks) • Load balancing is easy • Finding independent tasks automatically is hard • In other words: it’s just the best starting point… (but not necessarily optimal at all times) Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Many Algorithms Embarrassingly Parallel Parallel Operation on Image { Scatter Image (1) Sequential Operation on Partial Image(2) Gather Result Data (3) } • Works (with minor issues) for: unary, binary, n-ary operations & (n-) reduction operations • On 2 CPUs: (1) (2) (3) Parallel Computing 2010 – Vrije Universiteit, Amsterdam

SCRATCH SCRATCH Other only marginally more complex (1) Parallel Filter Operation on Image { Scatter Image (1) Allocate Scratch (2) Copy Image into Scratch (3) Handle / Communicate Borders (4) Sequential Filter Operation on Scratch(5) Gather Image (6) } • Also possible: ‘overlapping’ scatter • But not very useful in iterative filtering • On 2 CPUs (without scatter / gather): Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Other only marginally more complex (2) Parallel Geometric Transformation on Image { Broadcast Image (1) Create Partial Image (2) Sequential Transform on Partial Image (3) Gather Result Image (4) } • Potential faster implementations for special cases • On 2 CPUs (without broadcast/gather shown): RESULT IMAGE RESULT IMAGE Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Equivalent: + = … followed by … Template / Kernel / Filter / Neighborhood Operation (example: Gauss filter) = + + = Challenge: Separable Recursive Filtering Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Challenge: Separable Recursive Filtering • Separable filters: • 1 x 2D becomes 2 x 1D • Drastically reduces sequential computation time • Recursive filtering: • result of each filter step (a pixel value) stored back into input image • So: a recursive filter uses (part of) its output as input • For parallelization: • In each step, newly calculated/stored data may be located on another node • In each step, horizontal OR vertical data dependencies with ‘on the fly’ updates of the data Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Parallel Recursive Filtering: Solution 1 • Drawback: transpose operation is very expensive (esp. when nr. CPUs is large) (SCATTER) (FILTER X-dir) (TRANSPOSE) (FILTER Y-dir) (GATHER) Parallel Computing 2010 – Vrije Universiteit, Amsterdam

P0 P1 P2 P0 P1 P2 P0 P1 P2 Parallel Recursive Filtering: Solution 2 • Loop carrying dependence at final stage (sub-image level) • minimal communication overhead • full serialization • Loop carrying dependence at innermost stage (pixel-column level) • high communication overhead • fine-grained wave-front parallelism • Tiled loop carrying dependence at intermediate stage (image-tile level) • moderate communication overhead • coarse-grained wave-front parallelism Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Wavefront parallelism • Drawback: • partial serialization • non-optimal use of available CPUs CPU 0 CPU 2 CPU 3 CPU 1 Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Parallel Recursive Filtering: Solution 3 • Multipartitioning: • Skewed cyclic block partitioning • Each CPU owns at least one tile in each of the distributed dimensions • All neighboring tiles in a particular direction are owned by the same CPU CPU 0 CPU 2 CPU 3 CPU 1 Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Parallel Recursive Filtering: Solution 3 • Full Parallelism: • First in one direction… • And then in other… • Border exchange at end of each sweep • Communication at end of sweep always with same node CPU 0 CPU 2 CPU 3 CPU 1 Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Part 3 (b) • Software Platform: Parallel-Horus (platform design) Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Parallel Extensions MPI Parallel-Horus: Parallelizable Patterns • Minimal intrusion: • Re-use as much as possible the original sequential Horus library codes • Parallelization localized • Easy to implement extensions Horus Sequential API Parallelizable Patterns Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Pattern implementations (old vs. new) template<class …, class …, class …> inline DstArrayT* CxPatUnaryPixOp(… dst, … src, … upo) { if (dst == 0) dst = CxArrayClone<DstArrayT>(src); CxFuncUpoDispatch(dst, src, upo); return dst; } template<class …, class …, class …> inline DstArrayT* CxPatUnaryPixOp(… dst, … src, … upo) { if (dst == 0) dst = CxArrayClone<DstArrayT>(src); if (!PxRunParallel()) { // run sequential CxFuncUpoDispatch(dst, src, upo); } else { // run parallel PxArrayPreStateTransition(src, …, …); PxArrayPreStateTransition(dst, …, …); CxFuncUpoDispatch(dst, src, upo); PxArrayPostStateTransition(dst); } return dst; } Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Don’t do this: Scatter ImageOp Gather Scatter ImageOp Gather Do this: Scatter ImageOp Avoid Communication ImageOp Gather On the fly! Inter-Operation Optimization • Lazy Parallelization: Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Finite State Machine • Communication operations serve as state transition functions between distributed data structure states • State transitions performed only when absolutely necessary • State transition functions allow correct conversion of legal sequential code to legal parallel code at all times • Nice features: • Requires no a priori knowledge of loops and branches • Can be done on the fly at run-time (with no measurable overhead) Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Part 4 • Example – Parallel Image Processing on Clusters Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Example: Curvilinear Structure Detection • Apply anisotropic Gaussian filter bank to input image • Maximum response when filter tuned to line direction • Here 3 different implementations • fixed filters applied to a rotating image • rotating filters applied to fixed input image • separable (UV) • non-separable (2D) • Depending on parameter space: • few minutes - several hours Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Sequential = Parallel (1) for all orientations theta { geometricOp ( inputIm, &rotatIm, -theta, LINEAR, 0, p, “rotate” ); for all smoothing scales sy { for all differentiation scales sx { genConvolution ( filtIm1, mirrorBorder, “gauss”, sx, sy, 2, 0 ); genConvolution ( filtIm2, mirrorBorder, “gauss”, sx, sy, 0, 0 ); binaryPixOpI ( filtIm1, filtIm2, “negdiv” ); binaryPixOpC ( filtIm1, sx*sy, “mul” ); binaryPixOpI ( contrIm, filtIm1, “max” ); } } geometricOp ( contrIm, &backIm, theta, LINEAR, 0, p, “rotate” ); binaryPixOpI ( resltIm, backIm, “max” ); } IMPLEMENTATION 1 Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Sequential = Parallel (2 & 3) for all orientations theta { for all smoothing scales sy { for all differentiation scales sx { genConvolution (filtIm1, mirrorBorder, “func”, sx, sy, 2, 0 ); genConvolution (filtIm2, mirrorBorder, “func”, sx, sy, 0, 0 ); binaryPixOpI (filtIm1, filtIm2, “negdiv”); binaryPixOpC (filtIm1, sx*sy, “mul”); binaryPixOpI (resltIm, filtIm1, “max”); } } } IMPLEMENTATIONS 2 and 3 Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Measurements (DAS-1) • 512x512 image • 36 orientations • 8 anisotropic filters • => Part of the efficiency of parallel execution always remains in the hands of the application programmer! Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Measurements (DAS-2) • 512x512 image • 36 orientations • 8 anisotropic filters • So: lazy parallelization (or: optimization across library calls) is very important for high efficiency! Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Part 5 • ‘Grids’ and their Specific Problems Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Compare electrical power grid: The ‘Promise of The Grid’ • 1997 and beyond: • efficient and transparent (i.e. easy-to-use) wall-socket computing over a distributed set of resources Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Grid Problems (1) • Getting an account on remote compute clusters is hard! • Find the right person to contact… • Hope he/she does not completely ignore your request… • Provide proof of (a.o.) relevance, ethics, ‘trusted’ nationality… • Fill in and sign NDA’s, Foreign National Information sheets, official usage documents, etc… • Wait for account to be created, & username to be sent to you… • Hope to obtain an initial password as well… • Getting access to an existing international Grid-testbed is easier • But only marginally so… Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Grid Problems (2) • Getting your C++/MPI code to compile and run is hard! • Copying your code to the remote cluster (‘scp’ often not allowed)… • Setting up your environment & finding the right MPI compiler (mpicc, mpiCC, … ???)… • Making the necessary include libraries available… • Find out how to use the cluster reservation system… • Finding the correct way to start your program (mpiexec, mpirun, … and on which nodes ???)… • Getting your compute nodes to communicate with other machines (generally not allowed)… • So: • Nothing is standardized yet (not even Globus) • A working application in one Grid domain will generally fail in all other Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Grid Problems (3) • Keeping an application running (efficiently) is hard! • Grids are inherently dynamic: • Networks and CPUs are shared with others, causing fluctuations in resource availability • Grids are inherently faulty: • compute nodes & clusters may crash at any time • Grids are inherently heterogeneous: • optimization for run-time execution efficiency is by-and-large unknown territory • So: • An application that runs (efficiently) at one moment should be expected to fail a moment later Parallel Computing 2010 – Vrije Universiteit, Amsterdam

Multimedia Content Analysis on Clusters and Grids