papercutter

♥Cherished

Papercutter automates the extraction of structured data from PDF research papers and books into analysis-ready datasets. It combines PDF-to-markdown conversion with LLM-powered schema-based extraction, letting researchers configure custom data fields and generate CSV matrices or PDF reports from document collections. Built for systematic reviews and meta-analyses where manual data entry from dozens or hundreds of papers becomes infeasible.

by rawatpranjal

·★ 2··submitted April 17, 2026

View on GitHub

Clauded With Love Rating

7.5 / 10

Papercutter automates extraction of structured data from PDF research papers into analysis-ready datasets using PDF-to-markdown conversion and LLM-powered schema extraction. It targets researchers conducting systematic reviews and meta-analyses who need to process dozens or hundreds of papers efficiently.

Code Quality6.5

Usefulness8.5

Claude Usage7.0

Documentation8.0

Originality7.5

Highlights

✓Solves a genuine pain point for researchers with a complete end-to-end pipeline from PDF ingestion to CSV output and LaTeX reports
✓Well-structured CLI interface with logical command progression (ingest → configure → extract → report) that matches researcher workflow
✓Includes concrete examples directory with real outputs from seminal ML papers and book processing, demonstrating practical value

To Improve

→Add error handling documentation and recovery strategies for failed PDF processing or LLM extraction failures
→Implement batch processing controls and rate limiting options for large document collections to prevent API quota exhaustion

Topic

Automation AI/ML Data Pipeline

Language

Python

Type

CLI Tool