Felix Pinkstone
February 13, 2025 18:01
Nvidia’s DeepSeek-R1 model uses inference time scaling to improve GPU kernel generation and optimizes the performance of AI models by efficiently managing computational resources during inference.
In a significant advancement in AI model efficiency, Nvidia has introduced a new technique called Incerence-Time Scaling, driven by the DeepSeek-R1 model. This method is set up to optimize GPU kernel generation, and according to NVIDIA, it improves performance by carefully allocating computational resources during inference.
The role of inference time scaling
Inference time scaling, also known as AI inference or long-term thinking, allows an AI model to evaluate multiple potential outcomes and select the best outcome. This approach reflects human problem-solving techniques and allows for more strategic and systematic solutions to complex problems.
In Nvidia’s latest experiment, engineers automatically generated GPU attention kernels using improved computing power along with the DeepSeek-R1 model. These kernels are numerically accurate, optimized for a variety of attention types without explicit programming, sometimes outperforming those created by experienced engineers.
Caution: Challenges when optimizing the kernel
Important attention mechanisms in the development of large-scale language models (LLMS) allow AI to selectively focus on key input segments, improving prediction and revealing hidden data patterns. However, the computational demand for attention operations increases quadrature with the length of the input sequence, requiring an optimized GPU kernel implementation to avoid runtime errors and increase computational efficiency.
Various attentional variants such as causality and relative position embedding further complicate kernel optimization. Multimodal models such as vision transformers introduce additional complexity and require special attentional mechanisms to maintain spatial information.
Innovative workflow using DeepSeek-R1
Nvidia engineers developed a new workflow using DeepSeek-R1, which incorporates validation agents during closed-loop system inference. The process begins with a manual prompt to generate the initial GPU code, followed by analysis and iterative improvements with validator feedback.
This method achieved 100% and 96% numerical correctness for level 1 and level 2 problems, significantly improving the generation of attention kernels, as benchmarked by Stanford’s kernel bench .
Future outlook
The introduction of inference time scaling using DeepSeek-R1 shows promising advances in GPU kernel generation. Although initial results are encouraging, continuous research and development is essential to achieving consistently superior results across a wider range of issues.
For developers and researchers interested in exploring the technology further, the DeepSeek-R1 NIM Microservice is now available on Nvidia’s build platform.
Image source: ShutterStock