fast-reduction

♥Cherished

Fast-reduction implements fused GPU kernels for computing linear transformations with cross-entropy and entropy losses on NVIDIA Hopper architectures, reducing memory consumption from 56GB to 1.3GB while improving computational speed. It provides a progression of implementations from unfused PyTorch operations through increasingly optimized CuTe DSL variants, culminating in a megakernel that achieves both forward and backward passes 2x faster than baseline approaches. Designed for researchers and engineers optimizing large language model training on H100s.

by justinchiu

·★ 0··submitted April 19, 2026

View on GitHub

Clauded With Love Rating

8.5 / 10

Fast-reduction implements highly optimized GPU kernels for linear transformations with cross-entropy and entropy losses on NVIDIA H100s, reducing memory usage from 56GB to 1.3GB while achieving 2x speedup. The project provides a systematic progression from unfused PyTorch operations to a fused megakernel, with detailed performance benchmarks and error analysis.

Code Quality7.5

Usefulness9.2

Claude Usage8.1

Documentation8.7

Originality8.9

Highlights

✓Achieves 400x precision improvement by reading fp32 accumulator directly from registers instead of truncated bf16 from HBM
✓Comprehensive benchmark ladder with detailed error decomposition showing epilogue contributes only 0.3% of total error
✓Megakernel implementation delivers best-in-class performance (243ms forward+backward) with superior gradient accuracy (0.0009 MAE)

To Improve

→Add automated test suite with correctness verification and performance regression tests across different tensor sizes
→Include installation instructions, dependency requirements, and minimal usage examples in the README

Stack

Docker AWS

Topic

AI/ML

Language

Python

Type

Library/Package