Multimedia Content Analysis on Clusters and Grids

Multimedia Content Analysis on Clusters and Grids Frank Seinstra (fjseins@cs.vu.nl) Vrije Universiteit Department of Computer Science Amsterdam

Overview (1) • Part 1: What is Multimedia Content Analysis (MMCA) ? • Part 2: Why parallel computing in MMCA - and how? • Part 3: Software Platform: Parallel-Horus • Part 4: Example – Parallel Image Processing on Clusters

Overview (2) • Part 5: Grids and their specific problems • Part 6: A Software Platform for MMCA on Grids • Part 7: Large-scale MMCA applications on Grids • Part 8: Future research directions (new projects?)

Introduction: Some Realistic Problem Scenarios…

automatic analysis? A Real Problem… • News Broadcast – September 21, 2005: • Police Investigation: over 80.000 CCTV recordings (by hand) • First match found only 2.5 months after attacks

Sarah Palin Another Real Problem… • Web Video Search: • Search based on video annotations • Known to be notoriously bad (e.g. YouTube) • Instead: search based on video content

NFI (Dutch Forensics Institute, Den Haag): • Surveillance Camera Analysis / Crime Scene Reconstruction Are these realistic problems? • Beeld&Geluid (Dutch Institute for Sound and Vision, Hilversum): • Interactive access to Dutch national TV history

But there are many more problems and challenges out there… • Healthcare • Astronomy • Remote Sensing • Entertainment (e.g. see: PhotoSynth.net) • ….

Part 1: What is Multimedia Content Analysis (MMCA)?

Multimedia • Multimedia = Text + Sound + Image + Video + …. • Video = image + image + image + …. • In many (not all) multimedia applications: • calculations are executed on each separate video frame independently • So: we focus on Image Processing(+ Computer Vision)

What is a Digital Image? • An image is a continuous function that has been discretized in spatial coordinates, brightness and color frequencies • Most often: 2-D with ‘pixels’ as scalar or vector value • However: • Image dimensionality can range from 1-D to n-D • Example (medical imaging): 5-D = x, y, z, time, emission wavelength • Pixel dimensionality can range from 1-D to n-D • Generally: 1D = binary/grayscale; 3D = color (e.g. RGB) • n-D = hyper-spectral (e.g. remote sensing by satellites; every 10-20nm)

“Pres. Bush stepping off Airforce 1…” “Supernova at X,Y,t…” Impala “Blue Car” A K R Z (Parallel-) Horus Low level operations Intermediate level operations High level operations In: image Out: ‘meaning’ful result Image ---> (sub-) Image Image ---> Scalar / Vector Value ---> Feature Vector Image ---> Array of S/V Values Feature Vector(s) ---> Similarity Matrix Ranked set of images Feature Vector(s) ---> Clustering (K-Means, …) ---> Recognized object Feature Vector(s) ---> Classification (SVM, …) ...... Complete A-Z Multimedia Applications

+ = Binary Pixel Operation (example: addition) + Template / Kernel / Filter / Neighborhood Operation (example: Gauss filter) = Low Level Image Processing ‘Patterns’ (1) = Unary Pixel Operation (example: absolute value) N-ary Pixel Operation…

2 1 7 6 4 = N-Reduction Operation (example: histogram) M + = Geometric Transformation (example: rotation) transformation matrix Low Level Image Processing ‘Patterns’ (2) = Reduction Operation (example: sum)

Example Application: Template Matching Template Input Image Result Image

Example Application: Template Matching for all images { inputIm = readFile ( … ); unaryPixOpI ( sqrdInIm, inputIm, “set” ); binaryPixOpI ( sqrdInIm, inputIm, “mul” ); for all symbol images { symbol = readFile ( … ); weight = readFile ( … ); unaryPixOpI (filtIm1, sqrdInIm, “set”); unaryPixOpI (filtIm2, inputIm, “set”); genNeighborhoodOp (filtIm1, borderMirror, weight, “mul”, “sum”); binaryPixOpI (symbol, weight, “mul” ); genNeighborhoodOp (filtIm2, borderMirror, symbol, ”mul”, “sum”); binaryPixOpI (filtIm1, filtIm2, “sub”); binaryPixOpI (maxIm, filtIm1, “max”); } writeFile ( …, maxIm, … ); } See: http:/www.science.uva.nl/~fjseins/ParHorusCode/

Part 2: Why Parallel Computing in MMCA (and how)?

The ‘Need for Speed’ in MMCA Research & Applications • Growing interest in international ‘benchmark evaluations’ (i.e.: competitions) • Task: find ‘semantic concepts’ automatically • PASCAL VOC Challenge (10,000++ images) • NIST TRECVID (200+ hours of video) • A problem of scale: • At least 30-50 hours of processing time per hour of video • Beel&Geluid: 20,000 hours of TV broadcasts per year • NASA: over 1 TB of hyper-spectral image data per day • London Underground: over 120,000 years of processing…!!!

GPUs Accelerators General Purpose CPUs • Question: • What type of high-performance hardware is most suitable? Clusters Grids High-Performance Computing • Solution: • Parallel & distributed computing at a very large scale • Our initial choice: • Clusters of general purpose CPUs (e.g. DAS-cluster) • For many pragmatic reasons…

User Transparent Parallelization Tools But… how to let ‘non-experts’ program clusters easily & efficiently? • Parallelization tools: • Compilers • Languages • Parallelization Libraries • General purpose & domain specific Message Passing Libraries (e.g., MPI, PVM) Effort Shared Memory Specifications (e.g., OpenMP) Parallel Languages (e.g., Occam, Orca) Extended High Level Languages (e.g., HPF) Parallel Image Processing Languages (e.g., Apply, IAL) Automatic Parallelizing Compilers Parallel Image Processing Libraries Efficiency

! • Ignore optimization across library calls [all] Existing Parallel Image Processing Libraries • Suffer from many problems: • No ‘familiar’ programming model: • Identifying parallelism still the responsibility of programmer (e.g. data partitioning [Taniguchi97], loop parallelism [Niculescu02, Olk95]) • Reduced maintainability / portability: • Multiple implementations for each operation [Jamieson94] • Restricted to particular machine [Moore97, Webb93] • Non-optimal efficiency of parallel execution: • Ignore machine characteristics for optimization [Juhasz98, Lee97] Bräunl et al (2001)

Our Approach • Sustainable library-based software architecture for user-transparent parallel image processing • (1) Sustainability: • Maintainability, extensibility, portability (i.e. from Horus) • Applicability to commodity clusters • (2) User transparency: • Strictly sequential API (identical to Horus) • Intra-operation efficiency & inter-operation efficiency (2003)

Part 3: Software Platform: Parallel-Horus

What Type(s) of Parallelism to Support? • Data parallelism: • “exploitation of concurrency that derives from the application of the same operation to multiple elements of a data structure” [Foster, 1995] • Task parallelism: • “a model of parallel computing in which many different operations may be executed concurrently” [Wilson, 1995]

Why Data Parallelism (only)? for all images { inputIm = readFile ( … ); unaryPixOpI ( sqrdInIm, inputIm, “set” ); binaryPixOpI ( sqrdInIm, inputIm, “mul” ); for all symbol images { symbol = readFile ( … ); weight = readFile ( … ); unaryPixOpI (filtIm1, sqrdInIm, “set”); unaryPixOpI (filtIm2, inputIm, “set”); genNeighborhoodOp (filtIm1, borderMirror, weight, “mul”, “sum”); binaryPixOpI (symbol, weight, “mul” ); genNeighborhoodOp (filtIm2, borderMirror, symbol, ”mul”, “sum”); binaryPixOpI (filtIm1, filtIm2, “sub”); binaryPixOpI (maxIm, filtIm1, “max”); } writeFile ( …, maxIm, … ); } • Natural approach for low level image processing • Scalability (in general: #pixels >> #different tasks) • Load balancing is easy • Finding independent tasks automatically is hard • In other words: it’s just the best starting point… (but not necessarily optimal at all times)

Many Low Level Imaging Algorithms are Embarrassingly Data Parallel • On 2 CPUs: Parallel Operation on Image { Scatter Image (1) Sequential Operation on Partial Image(2) Gather Result Data (3) } (1) (2) (3) • Works (with minor issues) for unary, binary, n-ary operations & (n-) reduction operations

Other Imaging Algorithms Are Only Marginally More Complex (1) • On 2 CPUs (without scatter / gather): Parallel Filter Operation on Image { Scatter Image (1) Allocate Scratch (2) Copy Image into Scratch (3) Handle / Communicate Borders (4) Sequential Filter Operation on Scratch(5) Gather Image (6) } SCRATCH SCRATCH • Also possible: ‘overlapping’ scatter • But not very useful in iterative filtering

Other Imaging Algorithms Are Only Marginally More Complex (2) • On 2 CPUs (rotation; without b-cast / gather): Parallel Geometric Transformation on Image { Broadcast Image (1) Create Partial Image (2) Sequential Transform on Partial Image (3) Gather Result Image (4) } RESULT IMAGE RESULT IMAGE • Potential faster implementations for special cases

followed + + = = by More Challenging: Separable Recursive Filtering (2 x 1-D) • Separable filters (1 x 2D becomes 2 x 1D): drastically reduces sequential computation time • Recursive filtering: result of each filter step (a pixel value) stored back into input image • So: a recursive filter re-uses (part of) its output as input 2D Template / Kernel / Filter / Neighborhood Operation (example: Gauss filter) + =

Parallel Recursive Filtering: Solution 1 • Drawback: transpose operation is very expensive (esp. when nr. CPUs is large) (SCATTER) (FILTER X-dir) (TRANSPOSE) (FILTER Y-dir) (GATHER)

P0 P1 P2 P0 • Loop carrying dependence at innermost stage (pixel-column level) • high communication overhead • fine-grained wave-front parallelism P1 P2 P0 • Tiled loop carrying dependence at intermediate stage (image-tile level) • moderate communication overhead • coarse-grained wave-front parallelism P1 P2 Parallel Recursive Filtering: Solution 2 • Loop carrying dependence at final stage (sub-image level) • minimal communication overhead • full serialization

Parallel Recursive Filtering: Wave-front Parallelism • Drawback: • partial serialization • non-optimal use of available CPUs Processor 0 Processor 1 Processor 2 Processor 3

Parallel Recursive Filtering: Solution 3 • Multipartitioning: • Skewed cyclic block partitioning • Each CPU owns at least one tile in each of the distributed dimensions • All neighboring tiles in a particular direction are owned by the same CPU Processor 0 Processor 1 Processor 2 Processor 3

Multipartitioning • Full Parallelism: • First in one direction… • And then in other… • Border exchange at end of each sweep • Communication at end of sweep always with same node Processor 0 Processor 1 Processor 2 Processor 3

Parallel-Horus: Parallelizable Patterns • Minimal intrusion • Re-use as much as possible the original sequential Horus library codes • Parallelization localized in the code • Easy to implement extensions Horus SequentialAPI Parallel Extensions Parallelizable Patterns MPI

Parallel-Horus: Pattern Implementations (old vs. new) template<class …, class …, class …> inline DstArrayT* CxPatUnaryPixOp(… dst, … src, … upo) { if (dst == 0) dst = CxArrayClone<DstArrayT>(src); if (!PxRunParallel()) { // run sequential CxFuncUpoDispatch(dst, src, upo); } else { // run parallel PxArrayPreStateTransition(src, …, …); PxArrayPreStateTransition(dst, …, …); CxFuncUpoDispatch(dst, src, upo); PxArrayPostStateTransition(dst); } return dst; } template<class …, class …, class …> inline DstArrayT* CxPatUnaryPixOp(… dst, … src, … upo) { if (dst == 0) dst = CxArrayClone<DstArrayT>(src); CxFuncUpoDispatch(dst, src, upo); return dst; }

Do this: Avoid Communication On the fly! Parallel-Horus: Inter-operation Optimization • Lazy Parallelization: • Don’t do this:

local structures Parallel-Horus: Distributed Image Data Structures • Distributed image data structure abstraction • 3-tuple: < state of global, state of local, distribution type > global structure • state of global = { none, created, valid, invalid } • state of local = { none, valid, invalid } • distribution type = { none, partial, full, not-reduced } CPU 0 (host) CPU 1 • 9 combinations are ‘legal’ states(e.g.: < valid, valid, partial > ) CPU 2 CPU 3

Lazy Parallelization: Finite State Machine • Communication operations serve as state transition functions between distributed data structure states • State transitions performed only when absolutely necessary • State transition functions allow correct conversion of legal sequential code to legal parallel code at all times • Nice features: • Requires no a priori knowledge of loops and branches • Can be done on the fly at run-time (with no measurable overhead)

Part 4: Example – Parallel Image Processing on Clusters

Application: Detection of Curvilinear Structures • Apply anisotropic Gaussian filter bank to input image • Maximum response when filter tuned to line direction • Here 3 different implementations • fixed filters applied to a rotating image • rotating filters applied to fixed input image • separable (UV) • non-separable (2D) • Depending on parameter space: • few minutes - several hours

Sequential = Parallel for all orientations theta { geometricOp ( inputIm, &rotatIm, -theta, LINEAR, 0, p, “rotate” ); for all smoothing scales sy { for all differentiation scales sx { genConvolution ( filtIm1, mirrorBorder, “gauss”, sx, sy, 2, 0 ); genConvolution ( filtIm2, mirrorBorder, “gauss”, sx, sy, 0, 0 ); binaryPixOpI ( filtIm1, filtIm2, “negdiv” ); binaryPixOpC ( filtIm1, sx*sy, “mul” ); binaryPixOpI ( contrIm, filtIm1, “max” ); } } geometricOp ( contrIm, &backIm, theta, LINEAR, 0, p, “rotate” ); binaryPixOpI ( resltIm, backIm, “max” ); } IMPLEMENTATION 1 for all orientations theta { for all smoothing scales sy { for all differentiation scales sx { genConvolution (filtIm1, mirrorBorder, “func”, sx, sy, 2, 0 ); genConvolution (filtIm2, mirrorBorder, “func”, sx, sy, 0, 0 ); binaryPixOpI (filtIm1, filtIm2, “negdiv”); binaryPixOpC (filtIm1, sx*sy, “mul”); binaryPixOpI (resltIm, filtIm1, “max”); } } } IMPLEMENTATIONS 2 and 3

Measurements on DAS-1 (Vrije Universiteit) • 512x512 image • 36 orientations • 8 anisotropic filters • So: part of the efficiency of parallel execution always remains in the hands of the application programmer!

Measurements on DAS-2 (Vrije Universiteit) • 512x512 image • 36 orientations • 8 anisotropic filters • So: lazy parallelization (or: optimization across library calls) is very important for high efficiency!

Appendix: Intermediate Level MMCA Algorithms

Example: (histogram) = 1 2 3 4 5 6 6 6 6 5 4 3 3 (can be approximated by a mathematical function, e.g. a Weibull distribution; only 2 parameters ‘ß’, ‘γ’) <FIRE, ß=0.93, γ=0.13> Intermediate Level Algorithms • Feature Vector • A labeled sequence of (scalar) values • Each (scalar) value represents image data related property • Label: from user annotation or from automatic clustering Let’s call this: “FIRE”

Annotation (low level) Sky • Annotation for low level ‘visual words’ • Define N low level visual concepts • Assign concepts to image regions • For each region, calculate feature vector: • <SKY, ß=0.93, γ=0.13, … > • <SKY, ß=0.91, γ=0.15, … > • <SKY, ß=0.97, γ=0.12, … > • <ROAD, ß=0.89, γ=0.09, … > • <USA FLAG, ß=0.99, γ=0.14, … > • N human-defined ‘visual words’, each having multiple descriptions Sky Sky USA Flag Road

Alternative: Clustering • Example: • Split image in X regions, and obtain feature vector for each • All feature vectors have position in high-dimensional space • Clustering algorithm applied to obtain N clusters • => N non-human ‘visual words’, each with multiple descriptions

(3) … and count the number of region-matches with each visual word e.g.: 3 x ‘Sky’; 7 x ‘Grass’; 4 x ‘Road’; … => This defines an accumulated feature vector for a full image Feature Vectors for Full Images (2) Compute similarity between each image region… . . . (1) Partition image in regions Sky Grass Road … and each low level visual word …

Annotation (high level) • outdoors • airplane • traffic situation • Annotation for high level ‘semantic concepts’ • Define M high level visual concepts, e.g.: • ‘sports event’ • ‘outdoors’ • ‘airplane’ • ‘president Bush’ • ‘traffic situation’ • ‘human interaction’, … • For all images in a known (training) set, assign all appropriate high level concepts • outdoors • traffic situation • president Bush • human interaction

Multimedia Content Analysis on Clusters and Grids