Skip to content

Benchmarking

Benchmarking Framework

Cogitator includes a framework for running benchmarks to compare the performance of different Chain-of-Thought methods on various datasets.

Overview

The framework consists of two main scripts:

  1. benches/run.py: Handles the generation phase. It loads datasets, configures LLMs and CoT strategies based on benches.yml and CLI arguments, runs the strategies on dataset questions, and saves the raw LLM outputs and timings to a JSONL file.
  2. benches/evaluate.py: Handles the evaluation phase. It reads the results file generated by run.py, extracts final answers from the raw outputs (using either heuristics or an LLM), compares them to gold answers, and calculates accuracy metrics for each method.

Configuration

Benchmarks are configured using the benches.yml file located in the project root, which can be overridden by command-line arguments passed to the scripts.

How to Run

Detailed instructions on how to configure and run the benchmarks are available in the benchmarking README (benches/README.md). This includes:

  • Command-line options for run.py and evaluate.py.
  • Detailed explanation of the benches.yml configuration structure.
  • The benchmark workflow.
  • Example usage commands.
  • Available datasets and dependencies.