fast-reduction
Fast-reduction implements fused GPU kernels for computing linear transformations with cross-entropy and entropy losses on NVIDIA Hopper architectures, reducing memory consumption from 56GB to 1.3GB while improving computational speed. It provides a progression of implementations from unfused PyTorch operations through increasingly optimized CuTe DSL variants, culminating in a megakernel that achieves both forward and backward passes 2x faster than baseline approaches. Designed for researchers and engineers optimizing large language model training on H100s.
Fast-reduction implements highly optimized GPU kernels for linear transformations with cross-entropy and entropy losses on NVIDIA H100s, reducing memory usage from 56GB to 1.3GB while achieving 2x speedup. The project provides a systematic progression from unfused PyTorch operations to a fused megakernel, with detailed performance benchmarks and error analysis.
- ✓Achieves 400x precision improvement by reading fp32 accumulator directly from registers instead of truncated bf16 from HBM
- ✓Comprehensive benchmark ladder with detailed error decomposition showing epilogue contributes only 0.3% of total error
- ✓Megakernel implementation delivers best-in-class performance (243ms forward+backward) with superior gradient accuracy (0.0009 MAE)
- →Add automated test suite with correctness verification and performance regression tests across different tensor sizes
- →Include installation instructions, dependency requirements, and minimal usage examples in the README