620 likes | 821 Views
HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs. Duksu Kim Jae-Pil Heo Jaehyuk Huh John Kim Sung- eui Yoon. http://sglab.kaist.ac.kr/HPCCD. Collision Detection (CD). Collision detection is widely used in various applications Games
E N D
HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs Duksu Kim Jae-Pil Heo Jaehyuk Huh John Kim Sung-eui Yoon http://sglab.kaist.ac.kr/HPCCD
Collision Detection (CD) • Collision detection is widely used in various applications • Games • Physically-based simulations • Robotics from AION from “Need for speed” from HUBO Lab. In KAIST
Motivations • Increasing demands for accurate and fast collision detection for deformable models • Model complexity continues to grow from Creative Assembly’s “Rome: Total war” Courtesy from Prof.Ko
Goals • Achieve interactive performance for exact collision detection between large-scale deformable models • E.g., deforming models consisting of tens or hundreds of thousand triangles <Cloth-ball, 94K triangles> <Breaking dragon, 252K triangles>
Discrete vs. Continuous • Discrete collision detection (DCD) • Detect collisions at each frame • Fast, but can miss collisions Miss collisions Frame1 Frame2
Discrete vs. Continuous • Discrete collision detection (DCD) • Continuous collision detection (CCD) • Identify the first time-of-contact (ToC) • Accurate, but requires a long computation time • Not widely used in interactive applications The first time-of-contact (ToC) Frame1 Frame2
Inter- and Self-Collisions • Inter-collisions • Collisions between two objects • Self-collisions • Collisions between two regions of a deformable object • Takes a long computation time to detect From Govindaraju’s paper
Parallel Computing Trends • Many core architectures • Multi-core CPU architectures • GPU architectures • Heterogeneous architectures • Intel’s Larabee and AMD’s Fusion • Designing parallel algorithms is important to utilize these parallel architectures
Main Contributions • A novel, hybrid parallel CCD method • Utilize both multi-core CPUs and GPUs • No locking in the main loop of CD • GPU-based exact CD between two triangles • High scalability • Interactive performance
Cloth Benchmark (94K triangles) • Test machine • A quad-core CPU • Two GPUs • Results • 10.4 X speed-up over using a CPU core • 23ms per frame (43 FPS)
Related Work • Algorithms specialized on certain types • Rigid objects [Reden et al. 2002] • Articulated bodies [Zhang et al. 2007] • Meshes with fixed topology [Govindaraju et al. 2005, Wong et al. 2005] • Efficient culling methods • Eliminates redundant elementary tests [Curtis et al. 2008] • Connectivity-based culling [Tang et al. 2008]
Related Work • GPU-based approaches • Visibility queries [Govindaraju et al. 2005] • Unified GPU-framework for proximity queries [Sud et al. 2006] • Multi-core CPU-based approaches • Voxel-based collision detection method [Lawlor and Laxmikant 2002] • Front based task decomposition method [Tang et al. 2009] Use only multi-core CPUs or GPUs and do not provide interactive performance for large-scale models
Outline • Background on CCD • Inter-CD based parallel CCD • GPU-based elementary tests • System summary • Results
Outline • Background on CCD • Inter-CD based parallel CCD • GPU-based elementary tests • System summary • Results
BoundingVolumeHierarchies(BVHs) • Organize bounding volumes as a tree • Leaf nodes have triangles
BVH-based Collision Detection • BVH traversal BV overlap test A A X X B C Y Z Dequeue (A,X) • Collision test pair queue
BVH-based Collision Detection • BVH traversal BV overlap test A X B C Y Z Dequeue Refine Self-CD (B,Y) (B,Z) (C,Y) (C,Y) (B,C) (Y,Z) • Collision test pair queue
BVH-based Collision Detection • BVH traversal • Elementary tests • At leaf nodes, exact collision tests between triangles are done by solving cubic equations [Provot 1997]
Lazy BVH Update • Update BVs of visited nodes during BVH traversal • Improve the performance of CD for large-scale models No overlap A X B C Y Z B C Y Z BV updates are not required
Outline • Background on CCD • Inter-CD based parallel CCD • GPU-based elementary tests • System overview • Results
Naïve Parallel Method A X B C Y Z • Collision test pair queue (B,Y), (B,Z), (C,Y), (C,Z) (B,C), (Y,Z) Thread 1 Thread 2
Low Scalability of Naïve Method • Two issues • Locking • Load-balancing Ideal Naïve 3.5
Two Issues • Locking • To apply the lazy BVH update method, the locking is required to write the BV updates for nodes • Locking serializes multiple threads and lowers the performance of parallel algorithms A X Y B C Y Z • Collision test pair queue (B,Y), (B,Z), (C,Y), (C,Z) (B,C), (Y,Z) Thread 1 Thread 2
Two Issues • Locking • Load-balancing among threads • High variance of the pair-based workload • Collision test pair queues for each thread Thread 1 Thread 2
Our CPU-based Parallel CCD • Inter-CD based task decomposition • Leads lock-free parallel algorithm in the main loop of parallel algorithm • Enables efficient dynamic load balancing among threads
Inter-CD based Task Decomposition • Combine all BVHs of objects into one BVH • Terminology • Inter-CD(N): Inter-collision detection between two sub-BVHs whose roots are two child nodes of the node N N
Inter-CD Based Task Decomposition • We represent all collision detection tasks as a set of Inter-CDs Inter-CD(N) N N D R D R D E F Self-CDs = Inter-CD(D) + Inter-CD(E) + Inter-CD(F) + etc.
Disjoint Property A B Inter-CD(A) Inter-CD(B) Accessed nodes are disjoint
Inter-CD based Serial CCD Task unit queue A A B C B C Thread D E D E Initialize Pop a node n Push children of the node n Process inter-CD(n) Pop a node n Push children of the node n Process Inter-CD(n)
Inter-CD based Serial CCD Task unit queue A A B C B C C Always satisfy the disjoint property Thread A D E D E Pop a node n Push children of the node n Process inter-CD(n) Pop a node n Push children of the node n Process Inter-CD(n)
Inter-CD based Parallel CCD • Initial task assignment n1 Each thread performs CCD without any locking while lazily update BVHs n2 n3 Front Satisfy the disjoint property n4 n5 n6 n7 n4 n5 n6 n7 T1 T2 T3 T4
Inter-CD based Dynamic Task Assignment • Require locking for scheduling queues, but no locking in the main loop of CD algorithm • Low locking overhead for task assignment Thread1 Thread 2 Task unit queue Task unit queue The front node likely to cause the highest computational workload Request T1 Scheduling queue Scheduling queue
Results of Our CPU-based Parallel CCD • Remove locking in the main loop of CD • Employ efficient dynamic load-balancing based on inter-CD task units Ideal Our method Naïve 7.1 3.5
Results of Our CPU-based Parallel CCD A core 8-cores 2.8 FPS 17.9 FPS We will further improve the performance of CCD by using GPUs <Cloth-ball, 94K triangles> 1.2 FPS 7.7 FPS <Breaking dragon, 252K triangles>
Outline • Background on CCD • Inter-CD based parallel CCD • GPU-based elementary tests • System summary • Results
Task Distribution BVH update • BVH traversal Elementary tests • Random accesses - Solving cubic equations CCD Elementary tests BVH update and traversal Streaming processors optimized with floating point operations - Branch prediction - Caching Multi-coreCPUs GPUs
Simple Interface between CPUs and GPUs • Each thread sends information of two triangles to GPUs • Two issues • Data transfer overhead • Each thread maintains its own GPU context T1 GPUs T2 T3 T4 A pair of two triangles (24 byte)
Asynchronous Data Transfer • Asynchronously send geometry during BVH update and traversal • Data transfer time is hidden T1 GPUs Geometry T2 T3 T4
Reduce the Data Size • At leaf nodes, only send two indices of triangles A pair of two triangles (24 byte) Two indices of triangles (8 byte) T1 GPUs T2 T3 T4
Reduce the DeviceContext Maintain Overhead • Use the master-slave model Index pairs Master GPUs Chunks Index pairs Results S1 Triangle index queue (TIQ) Index pairs S2 Index pairs S3 16 Kbytes ~256 Kbytes S4 Segmented queue
Outline • Background on CCD • Inter-CD based parallel CCD • GPU-based elementary tests • System summary • Results
Summary of HPCCD Slaves Master Elementary tests Index pairs • Inter-CD based parallel BVH update and traversal • Index pair queue Index pairs Index pairs Results Index pairs Geometry Geometry Index pairs Multi-coreCPUs GPUs
Outline • Background on CCD • Inter-CD based parallel CCD • GPU-based elementary tests • System summary • Results
Testing Environment • Machine • One quad-core CPU (Intel i7 CPU, 3.2 GHz ) • Two GPUs (NvidiaGeforce GTX285) • Run eight CPU threads by using Intel’s hyper threading technology
Breaking Dragon Benchmark • 252K triangles • Results • 12.5 X speed-up over using a CPU core • 54ms per frame (19 FPS)
Low-Resolution N-body Bench. • 34K triangles • Results • 11.4 X speed-up over using a CPU core • 6.8ms per frame (148 FPS)
High-Resolution N-body Bench. • 146K triangles • Results • 13.6 X speed-up over using a CPU core • 54ms per frame (19 FPS)
Results of HPCCD • As the number of GPUs is increased, we get higher performances
Limitation • Low scalability for small rigid models
Summary • A novel, hybrid parallel algorithm • Utilize both multi-core CPUs and GPUs • High scalability • About 13 times performance improvement for CCD by using a quad-core CPU and two GPUs compared with using a single CPU core • Interactive performance • Show 19-140 FPSfor various deformable models consisting of tens or hundreds of thousand triangles The implementation code will be available as OpenCCD library (http://sglab.kaist.ac.kr/OpenCCD)