Collision Detection Design & Final Project Topic

Collision Detection Design & Final Project Topic Brandon Smith November 5, 2008 ME 964

contact_data Allocation • Possible ways to allocate the contact_data array: • Allocate contact_data[ N(N-1)/2 ] • Allocate contact_data[ n_contacts ] • To avoid creating a huge array, I chose the second method: • 1st Kernel Call • Find the number of contacts. • 2nd Kernel Call • Calculate the contact_data for each contact.

Kernel Call Setup • The total number of contact tests is: n_tests = N(N-1)/2 • The total number of concurrent threads is: n_concurrent_threads = N_SMs * BLOCKS_PER_SM * THREADS_PER_BLOCK • Each thread will perform several tests: n_test_per_thread = n_tests / n_concurrent_threads + 1

Collide Kernel: Indexing • Given the block number and thread number, a range of test numbers (ki,kf) are generated: thread_id = bx*THREADS_PER_BLOCK + tx; ki = tests_per_thread*thread_id + 1; kf = ki + tests_per_thread - 1; • Given a test number k, the indices (i,j) can be calculated: • k = ( (j-1)2-(j-1) )/2 + I • k <= (j2-j )/2

Collide Kernel: Contact Testing • __global__ function calls __device__ test to actually perform the contact test • In the first pass it simply tests for contact • In the second pass it calculates contact_data. • atomicAdd is used to count the number of contacts • Keeps one contact tall for all concurrent threads • No need for condensation of results from each thread • Hassle to compile: nvcc.exe -ccbin "C:\Program Files\Microsoft Visual Studio 8\VC\bin" -c -arch sm_11 -D_CONSOLE -Xcompiler "/EHsc /W3 /nologo /Wp64 /O2 /Zi /MT " - I"C:\CUDA\include" -I"C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc" -o Release\collide.obj collide.cu

Final Project: Monte Carlo Radiation Transport • Objective: • Compute radiation flux or derived quantities over a spatial/temporal domain. • Method: • Follow the life of individual particles through the domain. • Quality of Results: • Statistical error is proportional to 1/sqrt(n_particles) • Difficult to get even particle distribution across the domain • Many particles are required to achieve low statistical error

Example: Fusion Reactor Shielding • The GPU Advantage: • Increase the number of simulated particles • Decrease statistical error

Tasks during a Particle’s Life • Birth: particles are created at a source • Ray-cast: the distance to the next surface is calculated • Collision: the particle interacts with matter • Next volume: the particle crosses a boundary into another material • Death: if the particle is absorbed, it is killed.

Existing Fortran Code • Geometry: • 3-D geometry supporting boxes and spheres • Physics: • Only neutral particles (neutrons, photons) • No energy dependence • No time dependence • Materials: • Simple materials (only a few isotopes) • Sources: • point, line, area, volume • Results: • mesh tallies and volume tallies

Potential for Parallelism • Usually we can assume each particle is independent, unless: • criticality, weight windows, etc… • Each thread could calculate independent particle trajectories • embarrassingly parallel • When enough particles are simulated, condense the results from each thread

Implementation Challenges • Current code is in Fortran 90 • ~1700 lines • Has anyone tried F2C? • Designed for Fortran 77 • Particles are tracked on a large mesh • ~1 M mesh elements, accessed once per particle • Mesh will need to be in global memory • Mesh will be accessed with an atomic function for data sharing? • Ensure that random numbers are not repeated • Use a pseudo-random number generator for each thread • Each thread will need a different random seed • Check to ensure sufficiently large stride • Could schedule rendezvous to check for solution convergence • Stop simulation once statistical error falls below a set value ( 5% )

ME 964: Project ProposalVikalp Mishra

Collision Detection • Aim • Solve collision detection problem given N rigid spheres in 3D space • Approach • Brute Force • Compare each sphere with every other sphere • O(n2) • If distance between centers is • more than sum of radii  No collision • Less than sum of radii  Collision • When collision detected • compute normal and object IDs

Final Project: Bone FEA • Title: • GPU based Finite Element Analysis of Femur • Femur • Thigh bone: Bone between hip and knee joint • Longest/ strongest bone in the body

Why study femur ? • To better understand bone mechanics/ properties • Across species • To understand the impact & extent of injury under various loading • Use in sports medicine & surgery • To study impact of DNA change on bone formation/ growth • Improve the process of cloning to develop better species • To study effect of nutrition cycle on bone development

Background • In past • Experiments were done to study bone behavior / material properties • Test performed • Fracture test • Bending test • Torsion test • Experiments on mouse / pig • Costly and time consuming • Only one experiment per sample possible • Alternative • Capture bone geometry and material properties • Use computational tools for various analysis • Saves time/ money

Typical approach • Given: • CT scan data of bone (geometry) • Material property distribution • Loading scheme • 3 or 4 point loading / Torsion test / Bending test

Use of FEA • Use Finite Element Method • To capture geometry • Physical properties • Hexahedral elements • Tetrahedral elements • Formulate FE problem • Use boundary conditions to define element level • stiffness matrix (Ke) • load vector (Fe) • Assemble elements in global matrix (Kg, Fg) • Solve FE problem • Obtain deflection (u = Kg-1Fg) • Compare with experimental results • Verify model

Bottleneck • Bone geometry is complex • Large number of elements required • For pig bone ~ 0.5 – 1 million elements (coarse mesh)

GPU based approach • Potential for GPU based computation • Same set of computation for each element • Stiffness matrix computation (Ke) • Load vector computation (Fe) • Different data sets for each element • SIMD • Approach • Use GPU for element level computation • Account for 67% of total time • Use CPU for global matrix inversion • Compare results with MATLAB based model

ME 964 – Midterm and Final Projects Saigopal Nelaturi

CUDA Collision detection • Problem – Given n spheres in 3d space, compute all pair-wise collisions • Approach – Brute force algorithm with quadratic complexity • Idea – every pair of spheres can be tested independently, and in parallel

Task Parallelism – pseudo code

Final Project • Constructive operators in SE(3) • SE(3) is the group of 4x4 rigid transformation matrices • Point in SE(3) = matrix • Set in SE(3) = set of matrices • Can devise operators using Boolean algebra and matrix multiplication (group operation)

Example How to compute workspace? Position + orientation of coordinate frame on coupler Use set formulation in SE(3) – Intersection of sets Embarrassingly parallel process! Many other applications in design/geometric modeling/ motion planning …

Goals • For very large sets of 4x4 transformation matrices , implement • Intersection – pairwise comparison between matrices • Convolution – pairwise multiplication between matrices • Show some workspace computations (hopefully in 3d) If possible, implement • Deconvolution – combination of pairwise intersection/multiplication

Midterm Project Ram Subramanian

The Task To solve a collision detection problem: Given an arbitrary number of rigid spheres with known radii, distributed in the 3D space, To find out which spheres are in contact/penetration with which other spheres.

The Algorithm • One pass over array to determine collisions. • One pass over all the collided bodies to compute the values of collision required. • Two Kernel Calls. • O(n.(n-1)/2)

Indexing • Every Thread gets a Reference body (Body A) and a Comparison body (Body B). • Each block has 512 threads (assumption 1). • Each row in a grid has 512 blocks (assumption 2). • Total number of threads is n(n-1)/2. • Compute the index value with the thread ID and block ID. • Using this index value and the number of bodies (using the div and mod) the index of the Body A and Body B, respectively, can be determined.

Final Project - Image Processing on the GPU Goal – Implement Image Processing Algorithms for the GPU. Eventually have an image processing library for the GPUs using CUDA Motivation – Most image processing tasks involve operating on individual pixels or a region of the image. Many of these tasks are embarrassingly parallel.

Proposed Implementations • Harris Corner Detector Motivation – This is an algorithm used in the first stage processing of many other Image Processing and Computer Vision algorithms (e.g. : 3D reconstruction, Scene Stitching, Object Tracking, Visual Servoing, etc… ) Ambitious Goal Implement an image stitching algorithm or 3D reconstruction algorithm that will stitch two images together using the Harris Corner detector.

Harris Corner Detector • At every pixel in the image place a window (larger the better, e.g. 5x5) call it W • Assume either 4 or 8 neighborhood of the current pixel position • Slide the window to each neighboring pixel, giving W1, W2 …Wi (where i = 4 or 8)

Harris Corner Detector Contd.. • Compute the sum of squared differences (SSD) between W and each Wi • A Corner is detected when all SSD values are below a given threshold set by user (or the smallest value is below a given threshold).

Midterm and Final Projects Toby Heyn ME 964 11/06/08

Midterm Project • Spatial Subdivision • Partition space into uniform grid (cells) • For each object, determine which cells the object overlaps • Objects can only collide if they occupy the same cell or adjacent cells

Midterm Project • Construct Cell ID Array • Each thread determines the cell IDs of the cells its sphere occupies, loads into Cell ID Array • Sort Cell ID Array • Radix Sort Algorithm • Create Collision Cell List • Scan sorted Cell ID Array, look for changes in cell ID • Write Collision Cell List with Cell ID Array indices, number of objects in the cell • Traverse Collision Cell List • One thread per Collision Cell • Each thread checks all collision pairs in the Collision Cell • Collisions are written to output

Midterm Project • Radix Sort • Sorts cell IDs in several passes • Sorts low order bits before higher order bits, retaining order of IDs with same cell ID • This helps in a later step • Takes 4 passes to sort the 32 bit (4 byte) integers • Makes use of parallel scan operation

Final Project • Default final project – granular dynamics using collision detection from midterm • Incorporate midterm collision detection into Chrono::Engine multibody dynamics engine • Simulate Mars Rover with many (millions) of bodies

Final Project • Chrono::Engine • C++ API • Commands for creating simulation environment, populating with bodies, creating constraints, etc • Uses Bullet for collision detection • Has been used to solve systems with ~100,000 bodies • Has a CUDA parallelized dynamics solver (based on LCP formulation)

Final Project • Each wheel is a union of primitives • Terrain consists of ~5000 spheres (much too coarse) • Obstacles: • Non spherical bodies in wheels • Large mass difference between small grain and large rover

Final Project • Handling non-spherical bodies • Represent the surface of the body as a composite of smaller spheres • New representation has more bodies, but only spheres • Maintain same dimensions, mass, inertia properties

Final Project • Parallelism • Collision detection • Many bodies/collision pairs to check • Spatial sub-division: geometric decomposition, task decomposition • Dynamics • Many equations of motion to solve • Geometric decomposition • Potentially many non-spherical bodies to process in parallel

Final Project • Remaining Issues • Re-use of data • After solving the collision detection problem once, can data be reused to reduce the size of the problem to be solved in subsequent steps? • Automate handling of non-spherical geometry • Can an automated method be created to represent arbitrary geometry with spheres?

ME 964 Midterm & Final Project Justin Madsen

Outline • Midterm & final are the same project • “default scheme” • Collision detection method • Baraff • Brief overview of 2 phase algorithm • Ideas for CUDA implementation • Ideas for final project • Integrating CUDA collision detection with other dynamics programs

Efficient collision detection • Baraff method • Axis Aligned bounding boxes (AABB) • Simple yet efficient • Only dealing with spheres • Can be extended to convex polyhedra • (actually don’t need bounding boxes for spheres, it’s a special case) Figure 1. AABB size and orientation depends on the local coordinate system

Overview of method • One dimensional case (x-axis) • Sort & Sweep • Each object has a length along the axis according to the AABB • Data: beginning and end values (b and e) of each box • Sorted lowest to highest according to these values Figure 2. Six objects and their AABB axes [1]

Determine possible contacts • After sorting, collision detection happens in two phases • Phase 1: broad phase • Traverse the axis; add objects to “possible contact list” when biis encountered • For one dimensional case, when biadded to the list, it means contact occurs with all other objects in the list

Three dimensional case • Phase 1 for 3-D: • Extend one dimensional contact check by checking b and e for values along the y and z axes of the other objects in the list • If contact check comes back positive for all 3 axes, add the object to the “possible contact list” • Possible because…

Collision Detection Design & Final Project Topic