HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs

HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs Duksu Kim Jae-Pil Heo Jaehyuk Huh John Kim Sung-eui Yoon http://sglab.kaist.ac.kr/HPCCD

Collision Detection (CD) • Collision detection is widely used in various applications • Games • Physically-based simulations • Robotics from AION from “Need for speed” from HUBO Lab. In KAIST

Motivations • Increasing demands for accurate and fast collision detection for deformable models • Model complexity continues to grow from Creative Assembly’s “Rome: Total war” Courtesy from Prof.Ko

Goals • Achieve interactive performance for exact collision detection between large-scale deformable models • E.g., deforming models consisting of tens or hundreds of thousand triangles <Cloth-ball, 94K triangles> <Breaking dragon, 252K triangles>

Discrete vs. Continuous • Discrete collision detection (DCD) • Detect collisions at each frame • Fast, but can miss collisions Miss collisions Frame1 Frame2

Discrete vs. Continuous • Discrete collision detection (DCD) • Continuous collision detection (CCD) • Identify the first time-of-contact (ToC) • Accurate, but requires a long computation time • Not widely used in interactive applications The first time-of-contact (ToC) Frame1 Frame2

Inter- and Self-Collisions • Inter-collisions • Collisions between two objects • Self-collisions • Collisions between two regions of a deformable object • Takes a long computation time to detect From Govindaraju’s paper

Parallel Computing Trends • Many core architectures • Multi-core CPU architectures • GPU architectures • Heterogeneous architectures • Intel’s Larabee and AMD’s Fusion • Designing parallel algorithms is important to utilize these parallel architectures

Main Contributions • A novel, hybrid parallel CCD method • Utilize both multi-core CPUs and GPUs • No locking in the main loop of CD • GPU-based exact CD between two triangles • High scalability • Interactive performance

Cloth Benchmark (94K triangles) • Test machine • A quad-core CPU • Two GPUs • Results • 10.4 X speed-up over using a CPU core • 23ms per frame (43 FPS)

Related Work • Algorithms specialized on certain types • Rigid objects [Reden et al. 2002] • Articulated bodies [Zhang et al. 2007] • Meshes with fixed topology [Govindaraju et al. 2005, Wong et al. 2005] • Efficient culling methods • Eliminates redundant elementary tests [Curtis et al. 2008] • Connectivity-based culling [Tang et al. 2008]

Related Work • GPU-based approaches • Visibility queries [Govindaraju et al. 2005] • Unified GPU-framework for proximity queries [Sud et al. 2006] • Multi-core CPU-based approaches • Voxel-based collision detection method [Lawlor and Laxmikant 2002] • Front based task decomposition method [Tang et al. 2009] Use only multi-core CPUs or GPUs and do not provide interactive performance for large-scale models

Outline • Background on CCD • Inter-CD based parallel CCD • GPU-based elementary tests • System summary • Results

BoundingVolumeHierarchies(BVHs) • Organize bounding volumes as a tree • Leaf nodes have triangles

BVH-based Collision Detection • BVH traversal BV overlap test A A X X B C Y Z Dequeue (A,X) • Collision test pair queue

BVH-based Collision Detection • BVH traversal BV overlap test A X B C Y Z Dequeue Refine Self-CD (B,Y) (B,Z) (C,Y) (C,Y) (B,C) (Y,Z) • Collision test pair queue

BVH-based Collision Detection • BVH traversal • Elementary tests • At leaf nodes, exact collision tests between triangles are done by solving cubic equations [Provot 1997]

Lazy BVH Update • Update BVs of visited nodes during BVH traversal • Improve the performance of CD for large-scale models No overlap A X B C Y Z B C Y Z BV updates are not required

Outline • Background on CCD • Inter-CD based parallel CCD • GPU-based elementary tests • System overview • Results

Naïve Parallel Method A X B C Y Z • Collision test pair queue (B,Y), (B,Z), (C,Y), (C,Z) (B,C), (Y,Z) Thread 1 Thread 2

Low Scalability of Naïve Method • Two issues • Locking • Load-balancing Ideal Naïve 3.5

Two Issues • Locking • To apply the lazy BVH update method, the locking is required to write the BV updates for nodes • Locking serializes multiple threads and lowers the performance of parallel algorithms A X Y B C Y Z • Collision test pair queue (B,Y), (B,Z), (C,Y), (C,Z) (B,C), (Y,Z) Thread 1 Thread 2

Two Issues • Locking • Load-balancing among threads • High variance of the pair-based workload • Collision test pair queues for each thread Thread 1 Thread 2

Our CPU-based Parallel CCD • Inter-CD based task decomposition • Leads lock-free parallel algorithm in the main loop of parallel algorithm • Enables efficient dynamic load balancing among threads

Inter-CD based Task Decomposition • Combine all BVHs of objects into one BVH • Terminology • Inter-CD(N): Inter-collision detection between two sub-BVHs whose roots are two child nodes of the node N N

Inter-CD Based Task Decomposition • We represent all collision detection tasks as a set of Inter-CDs Inter-CD(N) N N D R D R D E F Self-CDs = Inter-CD(D) + Inter-CD(E) + Inter-CD(F) + etc.

Disjoint Property A B Inter-CD(A) Inter-CD(B) Accessed nodes are disjoint

Inter-CD based Serial CCD Task unit queue A A B C B C Thread D E D E Initialize Pop a node n Push children of the node n Process inter-CD(n) Pop a node n Push children of the node n Process Inter-CD(n)

Inter-CD based Serial CCD Task unit queue A A B C B C C Always satisfy the disjoint property Thread A D E D E Pop a node n Push children of the node n Process inter-CD(n) Pop a node n Push children of the node n Process Inter-CD(n)

Inter-CD based Parallel CCD • Initial task assignment n1 Each thread performs CCD without any locking while lazily update BVHs n2 n3 Front Satisfy the disjoint property n4 n5 n6 n7 n4 n5 n6 n7 T1 T2 T3 T4

Inter-CD based Dynamic Task Assignment • Require locking for scheduling queues, but no locking in the main loop of CD algorithm • Low locking overhead for task assignment Thread1 Thread 2 Task unit queue Task unit queue The front node likely to cause the highest computational workload Request T1 Scheduling queue Scheduling queue

Results of Our CPU-based Parallel CCD • Remove locking in the main loop of CD • Employ efficient dynamic load-balancing based on inter-CD task units Ideal Our method Naïve 7.1 3.5

Results of Our CPU-based Parallel CCD A core 8-cores 2.8 FPS 17.9 FPS We will further improve the performance of CCD by using GPUs <Cloth-ball, 94K triangles> 1.2 FPS 7.7 FPS <Breaking dragon, 252K triangles>

Task Distribution BVH update • BVH traversal Elementary tests • Random accesses - Solving cubic equations CCD Elementary tests BVH update and traversal Streaming processors optimized with floating point operations - Branch prediction - Caching Multi-coreCPUs GPUs

Simple Interface between CPUs and GPUs • Each thread sends information of two triangles to GPUs • Two issues • Data transfer overhead • Each thread maintains its own GPU context T1 GPUs T2 T3 T4 A pair of two triangles (24 byte)

Asynchronous Data Transfer • Asynchronously send geometry during BVH update and traversal • Data transfer time is hidden T1 GPUs Geometry T2 T3 T4

Reduce the Data Size • At leaf nodes, only send two indices of triangles A pair of two triangles (24 byte) Two indices of triangles (8 byte) T1 GPUs T2 T3 T4

Reduce the DeviceContext Maintain Overhead • Use the master-slave model Index pairs Master GPUs Chunks Index pairs Results S1 Triangle index queue (TIQ) Index pairs S2 Index pairs S3 16 Kbytes ~256 Kbytes S4 Segmented queue

Summary of HPCCD Slaves Master Elementary tests Index pairs • Inter-CD based parallel BVH update and traversal • Index pair queue Index pairs Index pairs Results Index pairs Geometry Geometry Index pairs Multi-coreCPUs GPUs

Testing Environment • Machine • One quad-core CPU (Intel i7 CPU, 3.2 GHz ) • Two GPUs (NvidiaGeforce GTX285) • Run eight CPU threads by using Intel’s hyper threading technology

Breaking Dragon Benchmark • 252K triangles • Results • 12.5 X speed-up over using a CPU core • 54ms per frame (19 FPS)

Low-Resolution N-body Bench. • 34K triangles • Results • 11.4 X speed-up over using a CPU core • 6.8ms per frame (148 FPS)

High-Resolution N-body Bench. • 146K triangles • Results • 13.6 X speed-up over using a CPU core • 54ms per frame (19 FPS)

Results of HPCCD • As the number of GPUs is increased, we get higher performances

Limitation • Low scalability for small rigid models

Summary • A novel, hybrid parallel algorithm • Utilize both multi-core CPUs and GPUs • High scalability • About 13 times performance improvement for CCD by using a quad-core CPU and two GPUs compared with using a single CPU core • Interactive performance • Show 19-140 FPSfor various deformable models consisting of tens or hundreds of thousand triangles The implementation code will be available as OpenCCD library (http://sglab.kaist.ac.kr/OpenCCD)

HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs