claudedwithlove
explore/dp-data-pipeline

dp-data-pipeline

Cherished

Experiments for filtering and evaluating Triton kernel datasets for SFT

S
by S1ro1
·0··submitted April 15, 2026
View on GitHub
Clauded With Love Rating
7.1 / 10

This project creates a comprehensive data pipeline for filtering and preparing PyTorch-to-Triton kernel datasets for supervised fine-tuning. It processes 18k+ samples through multiple filtering, evaluation, and synthetic data generation stages to create high-quality training datasets.

Code Quality5.5
Usefulness8.0
Claude Usage6.0
Documentation8.5
Originality7.5
Highlights
  • Comprehensive pipeline with 6 distinct processing steps including difficulty rating and deduplication
  • Well-organized dataset hierarchy with multiple variants (filtered, unique, synthetic tasks, no-reasoning) hosted on HuggingFace
  • Excellent documentation with clear dataset descriptions, sample counts, and step-by-step usage instructions
To Improve
  • Add actual implementation code and tests - the repository appears to contain only documentation without the Python scripts referenced
  • Include data validation and error handling details, plus examples of input/output formats for each pipeline stage