claudedwithlove
explore/fast-reduction

fast-reduction

Cherished

Fast-reduction implements fused GPU kernels for computing linear transformations with cross-entropy and entropy losses on NVIDIA Hopper architectures, reducing memory consumption from 56GB to 1.3GB while improving computational speed. It provides a progression of implementations from unfused PyTorch operations through increasingly optimized CuTe DSL variants, culminating in a megakernel that achieves both forward and backward passes 2x faster than baseline approaches. Designed for researchers and engineers optimizing large language model training on H100s.

·0··submitted April 19, 2026
View on GitHub
Clauded With Love Rating
8.5 / 10

Fast-reduction implements highly optimized GPU kernels for linear transformations with cross-entropy and entropy losses on NVIDIA H100s, reducing memory usage from 56GB to 1.3GB while achieving 2x speedup. The project provides a systematic progression from unfused PyTorch operations to a fused megakernel, with detailed performance benchmarks and error analysis.

Code Quality7.5
Usefulness9.2
Claude Usage8.1
Documentation8.7
Originality8.9
Highlights
  • Achieves 400x precision improvement by reading fp32 accumulator directly from registers instead of truncated bf16 from HBM
  • Comprehensive benchmark ladder with detailed error decomposition showing epilogue contributes only 0.3% of total error
  • Megakernel implementation delivers best-in-class performance (243ms forward+backward) with superior gradient accuracy (0.0009 MAE)
To Improve
  • Add automated test suite with correctness verification and performance regression tests across different tensor sizes
  • Include installation instructions, dependency requirements, and minimal usage examples in the README
Language
Topic