Jog Hiller
March 14th, 2025 02:22
NVIDIA’s latest NCCL 2.24 release introduces new features to enhance multi-GPU and multi-node communication, including RAS subsystems, NIC fusion, and FP8 support, to optimize deep learning training.
The NVIDIA Collective Communications Library (NCCL) has introduced the latest version 2.24, bringing significant advances in reliability and observability of networking for multi-GPU and Martindo (MGMN) communications. As reported in the NVIDIA Developer Blog, this release is optimized specifically for NVIDIA GPUs and networking, making it an essential component for multi-GPU deep learning training.
New features for NCCL 2.24
This update includes several new features aimed at improving performance and reliability.
Reliability, Availability, and Serviceability (RAS) Subsystem User Buffer (UB) Martindo Collective Registration Optional fusion fusion to support strict enforcement of nicl_algo and nccl_proto
RAS Subsystem
The RAS subsystem is one of the outstanding additions of NCCL 2.24. It is designed to help users diagnose application issues such as crashes and hangs, especially in large-scale deployments. This low-overhead infrastructure provides a global view of the execution of applications that allow for the detection of anomalies such as unresponsive nodes and delayed processes. It works by creating a network of threads throughout the NCCL process that monitors each other’s health through regular keepalive messages.
Enhancement of user buffer registration
NCCL 2.24 introduces user buffer (UB) registration for multi-node populations, enabling more efficient data transfer and reduced GPU resource consumption. The library supports UB registration for multiple ranked aggregation networks per node and standard peer-to-peer networks, which provide significant performance improvements, especially for operations such as Allgather and Broadcast.
Nick Fusion
NCCL has adapted to optimize network communications as many NIC systems grow. New Nic Fusion features allow you to logically merge multiple NICs into a single entity, ensuring efficient use of network resources. This feature is particularly beneficial for systems with multiple NICs per GPU that address issues such as crashes and inefficient resource allocation.
Additional features and fixes
This update also introduces complete reception of options for the LL and LL128 protocols, allowing for reduced overhead and congestion. NCCL 2.24 supports Native FP8 reductions for Nvidia Hopper and new architectures, increasing processing power. Additionally, stricter enforcement of NCCL_ALGO and NCCL_Proto is implemented to ensure more accurate tuning and error handling for users.
The update also includes a variety of bug fixes and minor improvements, including various bug fixes and more efficient, including tweaking pad tuning and expanding memory allocation functions, which increases overall robustness and efficiency of the NCCL library.
Image source: ShutterStock