dp-data-pipeline

♥Cherished

Experiments for filtering and evaluating Triton kernel datasets for SFT

·★ 0··submitted April 15, 2026

Clauded With Love Rating

7.1 / 10

This project creates a comprehensive data pipeline for filtering and preparing PyTorch-to-Triton kernel datasets for supervised fine-tuning. It processes 18k+ samples through multiple filtering, evaluation, and synthetic data generation stages to create high-quality training datasets.

Code Quality5.5

Usefulness8.0

Claude Usage6.0

Documentation8.5

Originality7.5

Highlights

✓Comprehensive pipeline with 6 distinct processing steps including difficulty rating and deduplication
✓Well-organized dataset hierarchy with multiple variants (filtered, unique, synthetic tasks, no-reasoning) hosted on HuggingFace
✓Excellent documentation with clear dataset descriptions, sample counts, and step-by-step usage instructions

To Improve

→Add actual implementation code and tests - the repository appears to contain only documentation without the Python scripts referenced
→Include data validation and error handling details, plus examples of input/output formats for each pipeline stage