From c6d70b9beaa1a101db4ebf6b08da12e1f3fd02ca Mon Sep 17 00:00:00 2001
From: Georgi Gerganov <ggerganov@gmail.com>
Date: Mon, 16 Feb 2026 13:08:56 +0200
Subject: [PATCH] add AGENTS.md

---
 examples/llama-eval/AGENTS.md | 190 ++++++++++++++++++++++++++++++++++
 1 file changed, 190 insertions(+)
 create mode 100644 examples/llama-eval/AGENTS.md

diff --git a/examples/llama-eval/AGENTS.md b/examples/llama-eval/AGENTS.md
new file mode 100644
index 0000000000..60700aefc7
--- /dev/null
+++ b/examples/llama-eval/AGENTS.md
@@ -0,0 +1,190 @@
+# llama-eval Codebase Guidelines
+
+## Overview
+
+This directory contains Python evaluation tools for llama.cpp:
+- `llama-eval.py` - Main evaluation tool with multiple datasets (AIME, AIME2025, GSM8K, GPQA)
+- `llama-server-simulator.py` - Flask-based server simulator for testing
+- `test-simulator.sh` - Test script for the simulator
+
+## Build/Run Commands
+
+### Virtual Environment
+The project uses a virtual environment located at `venv/`:
+```bash
+source venv/bin/activate
+```
+
+### Running the Main Evaluator
+```bash
+python llama-eval.py \
+  --server http://127.0.0.1:8013 \
+  --model gpt-oss-20b-hf-low \
+  --dataset aime \
+  --n_cases 10 \
+  --grader-type llm \
+  --seed 42
+```
+
+### Running the Simulator (for testing)
+```bash
+python llama-server-simulator.py --port 8033 --success-rate 0.8
+```
+
+### Running Tests
+```bash
+./test-simulator.sh
+```
+
+## Code Style Guidelines
+
+### Imports
+- Standard library imports first (argparse, json, os, re, subprocess, sys, time)
+- Third-party imports (requests, tqdm, datasets, flask) after standard library
+- Relative imports not used
+- Group imports by category with blank line between groups
+
+### Formatting
+- 4-space indentation
+- Max line length: 125 characters (per parent project's .flake8)
+- Use double quotes for strings
+- Use triple double quotes for docstrings
+- Binary operators at the beginning of continued lines
+
+### Naming Conventions
+- Classes: PascalCase (e.g., `AimeDataset`, `Grader`, `Processor`)
+- Functions: snake_case (e.g., `normalize_number`, `get_prompt`)
+- Variables: snake_case (e.g., `question_text`, `correct_count`)
+- Constants: UPPER_SNAKE_CASE (e.g., `GRADER_PATTERNS`, `TEMPLATE_REGISTRY`)
+- Private methods: prefix with underscore (e.g., `_load_dataset`, `_grade_regex`)
+
+### Types
+- Use type hints for all function signatures
+- Import from `typing` module: `Dict`, `List`, `Optional`, `Any`, `Tuple`
+- Use `@dataclass` for data structures
+- Prefer `Optional[T]` over `Union[T, None]`
+
+### Error Handling
+- Use try/except for network requests and file operations
+- Return `None` or `False` on errors when appropriate
+- Use `ValueError` for invalid arguments
+- Use `FileNotFoundError` for missing files
+- CLI scripts should handle exceptions gracefully
+
+### Dataclasses
+- Use `@dataclass` for structured data
+- Define fields with explicit types
+- Use `Optional[T]` for nullable fields
+- Provide default values where appropriate
+
+### String Formatting
+- Use f-strings for formatting (Python 3.6+)
+- Use triple double quotes for multi-line strings
+- Escape backslashes in regex patterns: `r'\\boxed{(\d+)}'`
+
+### File Paths
+- Use `pathlib.Path` instead of string paths
+- Create directories with `mkdir(parents=True, exist_ok=True)`
+- Use `Path.home()` for user home directory
+
+### Logging
+- Use `print()` for user-facing output
+- Use `sys.stderr` for debug logging
+- Simulator writes debug logs to `/tmp/simulator-debug.log`
+
+### Testing
+
+- Test script uses bash with `set -e` for strict error handling
+- Simulator runs in background with PID tracking
+- Tests verify correct answers, error cases, and edge cases
+- Use `curl` for HTTP testing in shell scripts
+
+### Whitespace Cleanup
+- Remove trailing whitespace from all lines
+- When making edits, do not leave trailing whitespace
+
+## Dataset Support
+
+### AIME Dataset
+- 90 questions from 2025 AIME competition
+- Answers in `\boxed{answer}` format
+- Supports regex, CLI, and LLM grading
+
+### AIME2025 Dataset
+- 30 questions from 2025 AIME I & II
+- Answers in `\boxed{answer}` format
+- Requires loading two config parts
+
+### GSM8K Dataset
+- 7473 math word problems
+- Answers numeric values with `####` separator
+- Supports regex, CLI, and LLM grading
+
+### GPQA Dataset
+- 198 questions from GPQA Diamond
+- Multiple choice with shuffled options (A, B, C, D)
+- **Requires LLM grader** (returns letter A/B/C/D)
+
+## Grading Types
+
+### Regex Grader
+- Built-in patterns per dataset
+- Prioritizes `\boxed{}` for AIME datasets
+- Extracts last number for GSM8K
+
+### CLI Grader
+- External script interface
+- Call: `grader.sh --answer <pred> --expected <gold>`
+- Exit code 0 = correct, non-zero = incorrect
+
+### LLM Grader
+- Uses judge model for answer extraction
+- Includes few-shot examples
+- Case-insensitive comparison
+- Required for GPQA
+
+## Configuration
+
+### Sampling Parameters (Optional)
+- `--temperature`: Sampling temperature
+- `--top-k`: Top K sampling
+- `--top-p`: Top P sampling
+- `--min-p`: Min P sampling
+- Only passed to API if explicitly specified
+
+### Default Values
+- `--n_predict`: -1 (infinite)
+- `--grader-type`: llm
+- `--seed`: 1234
+- `--threads`: 32
+- `--output`: llama-eval-state.json
+
+## Output Format
+
+### Progress Table
+- Shows task ID, dataset, prompt (truncated to 43 chars), expected answer, status
+- Uses `tqdm` for progress bars
+
+### Results Summary
+- Format: `Results: X/Y correct (Z%)`
+- Displayed after all tasks complete
+
+### JSON Output
+- Complete eval state saved to output file
+- Contains: task IDs, correctness, prompts, extracted answers, sampling config
+- Uses `dataclasses.asdict()` for serialization
+
+## HuggingFace Datasets
+
+- Cache directory: `~/.cache/huggingface/datasets`
+- Set via `HF_DATASETS_CACHE` environment variable
+- Telemetry disabled via `HF_HUB_DISABLE_TELEMETRY=1`
+- Datasets loaded with `datasets.load_dataset()`
+
+## Flask Simulator
+
+- Runs on configurable port (default: 5000)
+- Endpoint: `/v1/chat/completions` (OpenAI-compatible)
+- Uses Dice coefficient for question matching
+- Configurable success rate for testing
+- Debug logs to `/tmp/simulator-debug.log`