aiwolf-nlp-llm-judge

♥Cherished

A system that evaluates AIWolf game logs using large language models against predefined criteria, supporting multiple game formats (5-player, 13-player) with both common and game-specific evaluation dimensions. Results are aggregated by team and exported as structured JSON and CSV outputs, with support for parallel processing and regeneration of aggregations without re-invoking the LLM.

by aiwolfdial

·★ 0··submitted April 16, 2026

View on GitHub

Clauded With Love Rating

7.5 / 10

This system evaluates AIWolf game logs using large language models against predefined criteria, supporting multiple game formats with both common and game-specific evaluation dimensions. It provides structured JSON and CSV outputs with team aggregation capabilities and parallel processing for efficient batch evaluation.

Code Quality7.5

Usefulness8.0

Claude Usage7.0

Documentation8.5

Originality6.5

Highlights

✓Sophisticated aggregation system that can regenerate team statistics without re-invoking LLM, saving computational costs
✓Clean separation of common and game-specific evaluation criteria with flexible YAML configuration
✓Comprehensive documentation with clear setup instructions, directory structure examples, and detailed output format specifications

To Improve

→Add comprehensive test coverage with unit tests for core evaluation logic and integration tests for the full pipeline
→Implement proper error handling and validation for malformed log files, missing JSON files, and API failures with graceful degradation

Type

CLI Tool API

Topic

Automation Data Pipeline

Language

Python