Benchmarking
Benchmarking Framework¶
Cogitator includes a framework for running benchmarks to compare the performance of different Chain-of-Thought methods on various datasets.
Overview¶
The framework consists of two main scripts:
benches/run.py
: Handles the generation phase. It loads datasets, configures LLMs and CoT strategies based onbenches.yml
and CLI arguments, runs the strategies on dataset questions, and saves the raw LLM outputs and timings to a JSONL file.benches/evaluate.py
: Handles the evaluation phase. It reads the results file generated byrun.py
, extracts final answers from the raw outputs (using either heuristics or an LLM), compares them to gold answers, and calculates accuracy metrics for each method.
Configuration¶
Benchmarks are configured using the benches.yml
file located in the project root, which can be overridden by
command-line arguments passed to the scripts.
How to Run¶
Detailed instructions on how to configure and run the benchmarks are available in the benchmarking README
(benches/README.md
).
This includes:
- Command-line options for
run.py
andevaluate.py
. - Detailed explanation of the
benches.yml
configuration structure. - The benchmark workflow.
- Example usage commands.
- Available datasets and dependencies.