claudedwithlove
explore/aiwolf-nlp-llm-judge

aiwolf-nlp-llm-judge

Cherished

A system that evaluates AIWolf game logs using large language models against predefined criteria, supporting multiple game formats (5-player, 13-player) with both common and game-specific evaluation dimensions. Results are aggregated by team and exported as structured JSON and CSV outputs, with support for parallel processing and regeneration of aggregations without re-invoking the LLM.

·0··submitted April 16, 2026
View on GitHub
Clauded With Love Rating
7.5 / 10

This system evaluates AIWolf game logs using large language models against predefined criteria, supporting multiple game formats with both common and game-specific evaluation dimensions. It provides structured JSON and CSV outputs with team aggregation capabilities and parallel processing for efficient batch evaluation.

Code Quality7.5
Usefulness8.0
Claude Usage7.0
Documentation8.5
Originality6.5
Highlights
  • Sophisticated aggregation system that can regenerate team statistics without re-invoking LLM, saving computational costs
  • Clean separation of common and game-specific evaluation criteria with flexible YAML configuration
  • Comprehensive documentation with clear setup instructions, directory structure examples, and detailed output format specifications
To Improve
  • Add comprehensive test coverage with unit tests for core evaluation logic and integration tests for the full pipeline
  • Implement proper error handling and validation for malformed log files, missing JSON files, and API failures with graceful degradation
Language