From c6d70b9beaa1a101db4ebf6b08da12e1f3fd02ca Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Mon, 16 Feb 2026 13:08:56 +0200 Subject: [PATCH] add AGENTS.md --- examples/llama-eval/AGENTS.md | 190 ++++++++++++++++++++++++++++++++++ 1 file changed, 190 insertions(+) create mode 100644 examples/llama-eval/AGENTS.md diff --git a/examples/llama-eval/AGENTS.md b/examples/llama-eval/AGENTS.md new file mode 100644 index 0000000000..60700aefc7 --- /dev/null +++ b/examples/llama-eval/AGENTS.md @@ -0,0 +1,190 @@ +# llama-eval Codebase Guidelines + +## Overview + +This directory contains Python evaluation tools for llama.cpp: +- `llama-eval.py` - Main evaluation tool with multiple datasets (AIME, AIME2025, GSM8K, GPQA) +- `llama-server-simulator.py` - Flask-based server simulator for testing +- `test-simulator.sh` - Test script for the simulator + +## Build/Run Commands + +### Virtual Environment +The project uses a virtual environment located at `venv/`: +```bash +source venv/bin/activate +``` + +### Running the Main Evaluator +```bash +python llama-eval.py \ + --server http://127.0.0.1:8013 \ + --model gpt-oss-20b-hf-low \ + --dataset aime \ + --n_cases 10 \ + --grader-type llm \ + --seed 42 +``` + +### Running the Simulator (for testing) +```bash +python llama-server-simulator.py --port 8033 --success-rate 0.8 +``` + +### Running Tests +```bash +./test-simulator.sh +``` + +## Code Style Guidelines + +### Imports +- Standard library imports first (argparse, json, os, re, subprocess, sys, time) +- Third-party imports (requests, tqdm, datasets, flask) after standard library +- Relative imports not used +- Group imports by category with blank line between groups + +### Formatting +- 4-space indentation +- Max line length: 125 characters (per parent project's .flake8) +- Use double quotes for strings +- Use triple double quotes for docstrings +- Binary operators at the beginning of continued lines + +### Naming Conventions +- Classes: PascalCase (e.g., `AimeDataset`, `Grader`, `Processor`) +- Functions: snake_case (e.g., `normalize_number`, `get_prompt`) +- Variables: snake_case (e.g., `question_text`, `correct_count`) +- Constants: UPPER_SNAKE_CASE (e.g., `GRADER_PATTERNS`, `TEMPLATE_REGISTRY`) +- Private methods: prefix with underscore (e.g., `_load_dataset`, `_grade_regex`) + +### Types +- Use type hints for all function signatures +- Import from `typing` module: `Dict`, `List`, `Optional`, `Any`, `Tuple` +- Use `@dataclass` for data structures +- Prefer `Optional[T]` over `Union[T, None]` + +### Error Handling +- Use try/except for network requests and file operations +- Return `None` or `False` on errors when appropriate +- Use `ValueError` for invalid arguments +- Use `FileNotFoundError` for missing files +- CLI scripts should handle exceptions gracefully + +### Dataclasses +- Use `@dataclass` for structured data +- Define fields with explicit types +- Use `Optional[T]` for nullable fields +- Provide default values where appropriate + +### String Formatting +- Use f-strings for formatting (Python 3.6+) +- Use triple double quotes for multi-line strings +- Escape backslashes in regex patterns: `r'\\boxed{(\d+)}'` + +### File Paths +- Use `pathlib.Path` instead of string paths +- Create directories with `mkdir(parents=True, exist_ok=True)` +- Use `Path.home()` for user home directory + +### Logging +- Use `print()` for user-facing output +- Use `sys.stderr` for debug logging +- Simulator writes debug logs to `/tmp/simulator-debug.log` + +### Testing + +- Test script uses bash with `set -e` for strict error handling +- Simulator runs in background with PID tracking +- Tests verify correct answers, error cases, and edge cases +- Use `curl` for HTTP testing in shell scripts + +### Whitespace Cleanup +- Remove trailing whitespace from all lines +- When making edits, do not leave trailing whitespace + +## Dataset Support + +### AIME Dataset +- 90 questions from 2025 AIME competition +- Answers in `\boxed{answer}` format +- Supports regex, CLI, and LLM grading + +### AIME2025 Dataset +- 30 questions from 2025 AIME I & II +- Answers in `\boxed{answer}` format +- Requires loading two config parts + +### GSM8K Dataset +- 7473 math word problems +- Answers numeric values with `####` separator +- Supports regex, CLI, and LLM grading + +### GPQA Dataset +- 198 questions from GPQA Diamond +- Multiple choice with shuffled options (A, B, C, D) +- **Requires LLM grader** (returns letter A/B/C/D) + +## Grading Types + +### Regex Grader +- Built-in patterns per dataset +- Prioritizes `\boxed{}` for AIME datasets +- Extracts last number for GSM8K + +### CLI Grader +- External script interface +- Call: `grader.sh --answer --expected ` +- Exit code 0 = correct, non-zero = incorrect + +### LLM Grader +- Uses judge model for answer extraction +- Includes few-shot examples +- Case-insensitive comparison +- Required for GPQA + +## Configuration + +### Sampling Parameters (Optional) +- `--temperature`: Sampling temperature +- `--top-k`: Top K sampling +- `--top-p`: Top P sampling +- `--min-p`: Min P sampling +- Only passed to API if explicitly specified + +### Default Values +- `--n_predict`: -1 (infinite) +- `--grader-type`: llm +- `--seed`: 1234 +- `--threads`: 32 +- `--output`: llama-eval-state.json + +## Output Format + +### Progress Table +- Shows task ID, dataset, prompt (truncated to 43 chars), expected answer, status +- Uses `tqdm` for progress bars + +### Results Summary +- Format: `Results: X/Y correct (Z%)` +- Displayed after all tasks complete + +### JSON Output +- Complete eval state saved to output file +- Contains: task IDs, correctness, prompts, extracted answers, sampling config +- Uses `dataclasses.asdict()` for serialization + +## HuggingFace Datasets + +- Cache directory: `~/.cache/huggingface/datasets` +- Set via `HF_DATASETS_CACHE` environment variable +- Telemetry disabled via `HF_HUB_DISABLE_TELEMETRY=1` +- Datasets loaded with `datasets.load_dataset()` + +## Flask Simulator + +- Runs on configurable port (default: 5000) +- Endpoint: `/v1/chat/completions` (OpenAI-compatible) +- Uses Dice coefficient for question matching +- Configurable success rate for testing +- Debug logs to `/tmp/simulator-debug.log`