Together AI Launches FlashAttention-3, Achieves 75% GPU Utilisation with NVIDIA H100
In a bid to enhance the efficiency of Transformer architectures, Together AI has introduced FlashAttention-3, building upon its predecessors’ success in accelerating attention mechanisms.
Developed by a collaborative effort including Colfax Research, Meta, NVIDIA, and Princeton University’s Together AI, FlashAttention-3 incorporates novel approaches to maximize GPU performance on the latest Hopper architecture.
The core innovation lies in leveraging asynchrony and low-precision computing to expedite attention operations.
By overlapping computation and data movement through warp-specialisation and interleave techniques, FlashAttention-3 achieves up to 75% utilisation of the H100 GPU’s theoretical FLOPS capability. This represents a significant enhancement over FlashAttention-2, which utilised only 35% of similar resources on previous-generation hardware.
One of the standout features of FlashAttention-3 is its compatibility with FP8 precision, enabling operations at nearly 1.2 PFLOPS while maintaining competitive accuracy levels. This advancement not only accelerates processing speed—up to 2 times faster than previous iterations—but also reduces memory footprint, potentially lowering operational costs for large-scale AI deployments.
Moreover, the implementation of FlashAttention-3 facilitates the handling of longer contextual inputs in large language models (LLMs), crucial for applications demanding extensive text comprehension and generation capabilities.
By minimizing memory reads and writes through optimized tiling and softmax rescaling, the algorithm achieves up to 4 times faster execution times compared to traditional methods.
The introduction of new hardware features like WGMMA and TMA on Hopper GPUs further amplifies FlashAttention-3’s performance gains. These enhancements enable efficient data transfer and processing, ensuring that both matrix multiplication (GEMM) and softmax operations proceed concurrently, thus maximizing computational throughput.
Overall, FlashAttention-3 represents a significant leap forward in optimizing attention mechanisms for Transformer-based architectures. Its adoption promises not only enhanced computational efficiency and cost-effectiveness but also broader capabilities in handling complex AI tasks requiring extended contextual analysis. The release of
FlashAttention-3 underscores ongoing efforts to push the boundaries of AI model training and inference, catering to the growing demands of modern computing applications.
For more detailed insights and technical specifications, the research paper and implementation are available on GitHub, offering a comprehensive overview of the methodology and benchmarks achieved by FlashAttention-3.
The post Together AI Launches FlashAttention-3, Achieves 75% GPU Utilisation with NVIDIA H100 appeared first on Analytics India Magazine.




… 

