Accelerating CUDA C++ Applications with Multiple GPUs

Discover how to write CUDA C++ applications that efficiently and correctly utilize all available GPUs in a single node, dramatically improving the performance of applications and making the most cost-effective use of systems with multiple GPUs.

8 hours of instruction

Discover how to write CUDA C++ applications that efficiently and correctly utilize all available GPUs in a single node, dramatically improving the performance of applications and making the most cost-effective use of systems with multiple GPUs.

OBJECTIVES

  1. Use concurrent CUDA Streams to overlap memory transfers with GPU computation
  2. Utilize all available GPUs on a single node to scale workloads across all available GPUs
  3. Combine the use of copy/compute overlap with multiple GPUs
  4. Rely on the NVIDIA Systems Visual Profiler timeline to observe improvement opportunities and the impact of the techniques covered in the workshop

PREREQUISITES

None

SYLLABUS & TOPICS COVERED

  1. Introduction And Using Jupyter Lab
    • Meet the instructor and get familiar with your GPU-accelerated interactive JupyterLab environment
  2. Application Overview
    • Orient yourself with a single GPU CUDA C++ application that will be the starting point for the course
    • Observe the current performance of the single GPU CUDA C++ application
    • using the Nsight Systems
    • Introduction To CUDA Streams
  3. Learn the rules that govern concurrent CUDA Stream behavior
    • Use multiple CUDA streams to perform concurrent host-to-device and deviceto-host memory transfers
    • Utilize multiple CUDA streams for launching GPU kernels
    • Observe multiple streams in the Nsight Systems Visual Profiler timeline view
    • Copy Or Compute Overlap With CUDA Streams
  4. Learn the key concepts for effectively performing copy/compute overlap
    • Explore robust indexing strategies for the flexible use of copy/compute overlap in applications
    • Refactor the single-GPU CUDA C++ application to perform copy/compute overlap
    • See copy/compute overlap in the Nsight Systems visual profiler timeline
    • Multiple GPUs With CUDAC Plus Plus
    • Learn the key concepts for effectively using multiple GPUs on a single node with CUDA C++
  5. Explore robust indexing strategies for the flexible use of multiple GPUs in applications
    • Refactor the single-GPU CUDA C++ application to utilize multiple GPUs
    • See multiple GPU utilization in the Nsight Systems Visual Profiler timeline
    • Copy Or Compute Overlap With Multiple GPUs
    • Learn the key concepts for effectively performing copy/compute overlap on multiple GPUs
    • Explore robust indexing strategies for the flexible use of copy/compute overlap on multiple GPUs
    • Refactor the single-GPU CUDA C++ application to perform copy/compute overlap on multiple GPUs
    • Observe performance benefits for copy/compute overlap on multiple GPUs See copy/compute over

SOFTWARE REQUIREMENTS

Each participant will be provided with dedicated access to a fully configured, GPU-accelerated workstation in the cloud.

About Instructor

DataSociety

148 Courses

Not Enrolled
This course is currently closed