aiwolf-nlp-llm-judge

♥Cherished

An evaluation system that scores AIWolf game logs using LLMs against predefined criteria, supporting multiple game formats (5-player, 13-player, etc.) with both common and format-specific metrics. It processes game logs in parallel, aggregates results by team, and outputs detailed JSON evaluations and CSV summaries.

by aiwolfdial

·★ 0··submitted April 17, 2026

View on GitHub

Clauded With Love Rating

7.7 / 10

This is an evaluation system that uses LLMs to score AIWolf game logs against predefined criteria, supporting multiple game formats with parallel processing and team aggregation. It processes CSV game logs and JSON character files to generate structured evaluations in both detailed JSON and summary CSV formats.

Code Quality7.5

Usefulness8.0

Claude Usage7.0

Documentation8.5

Originality7.5

Highlights

✓Excellent bilingual documentation with comprehensive setup instructions, usage examples, and clear project structure
✓Sophisticated evaluation architecture supporting both common and game-format-specific criteria with flexible configuration
✓Thoughtful aggregation regeneration feature that allows rebuilding summaries without re-running expensive LLM evaluations

To Improve

→Add comprehensive test coverage with unit tests for core evaluation logic and integration tests for the full pipeline
→Implement proper error handling and retry mechanisms for LLM API calls, including rate limiting and graceful degradation

Topic

Analytics AI/ML Automation Data Pipeline

Language

Python