Commit Graph

8113 Commits

Author SHA1 Message Date
Georgi Gerganov
c0c3e428dd
refactor 2026-02-16 23:02:45 +02:00
Georgi Gerganov
7f049860b4
resoning and error handling 2026-02-16 22:16:15 +02:00
Georgi Gerganov
2ffa45edfc
add tokens 2026-02-16 21:52:54 +02:00
Georgi Gerganov
9c29be1177
store full response 2026-02-16 21:44:29 +02:00
Georgi Gerganov
013963cfd5
add html 2026-02-16 21:22:06 +02:00
Georgi Gerganov
e2e998a2d6
fix prompts 2026-02-16 21:02:25 +02:00
Georgi Gerganov
6c41664b8b
simplify 2026-02-16 19:50:27 +02:00
Georgi Gerganov
7b84af8051
fix counts 2026-02-16 16:38:31 +02:00
Georgi Gerganov
60a501e138
cleanup 2026-02-16 16:31:14 +02:00
Georgi Gerganov
e6e777cfb3
resume eval 2026-02-16 16:21:36 +02:00
Georgi Gerganov
ad3a54eb68
ignore errors 2026-02-16 15:23:23 +02:00
Georgi Gerganov
c6d70b9bea
add AGENTS.md 2026-02-16 13:13:35 +02:00
Georgi Gerganov
de956a6ca8
cleanup 2026-02-16 12:02:16 +02:00
Georgi Gerganov
350e7c1409
datasets : fix aime2025 2026-02-16 11:55:57 +02:00
Georgi Gerganov
db10dda1f3
grade : improve regex + logs 2026-02-16 11:51:36 +02:00
Georgi Gerganov
52759bf078
grader : update prompt 2026-02-16 11:17:53 +02:00
Georgi Gerganov
99e3c3d02c
datasets : add aime2025 2026-02-16 11:07:54 +02:00
Georgi Gerganov
c6315655b7
cont 2026-02-16 10:56:58 +02:00
Georgi Gerganov
f762a71d56
grader : improve example answers 2026-02-16 10:51:41 +02:00
Georgi Gerganov
73e61d5b75
rename 2026-02-16 10:30:10 +02:00
Georgi Gerganov
cffd268bb3
add gpqa + sampling + docs 2026-02-16 00:52:33 +02:00
Georgi Gerganov
e8a807519a
datasets : add gsm8k 2026-02-15 23:19:46 +02:00
Georgi Gerganov
1db8428f00
remove old files 2026-02-15 22:16:54 +02:00
Georgi Gerganov
7751ae2796
docs 2026-02-15 22:15:50 +02:00
Georgi Gerganov
d2b10302ce
improve grader 2026-02-15 22:12:02 +02:00
Georgi Gerganov
68dde884d6
minor 2026-02-15 21:21:40 +02:00
Georgi Gerganov
fd90796da2
eval : support multiple dataset runs 2026-02-15 21:08:24 +02:00
Georgi Gerganov
8156d549f6
sim : fix answer matching 2026-02-15 21:08:24 +02:00
Georgi Gerganov
9695e6feb4
test : fix path 2026-02-15 21:08:24 +02:00
Georgi Gerganov
fb1481d60d
eval : add prompts 2026-02-15 21:08:24 +02:00
Georgi Gerganov
812ae13ec1
eval : print progress 2026-02-15 21:08:24 +02:00
Georgi Gerganov
e79e8d02d5
examples: add task summary table to llama-eval-new.py 2026-02-15 21:08:23 +02:00
Georgi Gerganov
a939f4c47e
docs: update llama-eval-discussion.md with threading and model parameter updates
- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
2026-02-15 21:08:23 +02:00
Georgi Gerganov
62b04cef54
examples: add threading support and model parameter to llama-eval-new.py
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
2026-02-15 21:08:23 +02:00
Georgi Gerganov
37b26cafee
docs: update llama-eval-discussion.md with session work summary 2026-02-15 21:08:23 +02:00
Georgi Gerganov
04f6872116
examples: use cached dataset path in simulator to avoid HF Hub requests 2026-02-15 21:08:23 +02:00
Georgi Gerganov
c2619c18bf
examples: use cached dataset path to avoid HF Hub requests 2026-02-15 21:08:23 +02:00
Georgi Gerganov
87f8930968
examples: remove HF_HUB_OFFLINE to allow dataset download 2026-02-15 21:08:23 +02:00
Georgi Gerganov
9453f9de12
examples: use HF_HUB_OFFLINE to avoid HF Hub warnings 2026-02-15 21:08:23 +02:00
Georgi Gerganov
5a1be6ce37
examples: implement flexible grader system for answer validation
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
2026-02-15 21:08:23 +02:00
Georgi Gerganov
a80814e97b
docs: remove README.md from llama-eval 2026-02-15 21:08:23 +02:00
Georgi Gerganov
5cc2258e82
examples: add simplified llama-eval-new.py for AIME evaluation
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
2026-02-15 21:08:22 +02:00
Georgi Gerganov
c87af1d527
docs: update llama-eval-discussion.md with session work summary
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
2026-02-15 21:08:22 +02:00
Georgi Gerganov
23d4e21a81
examples: refactor test-simulator.sh for better readability
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
2026-02-15 21:08:22 +02:00
Georgi Gerganov
07d5e1e0ea
examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.
2026-02-15 21:08:22 +02:00
gatbontonpc
8839037528
add checkpointing 2026-02-15 21:08:22 +02:00
gatbontonpc
89cab3dbc5
Add readme 2026-02-15 21:08:22 +02:00
gatbontonpc
c2d83ca048
multi source llama-eval 2026-02-15 21:08:22 +02:00
gatbontonpc
c05df17ce3
working llama-eval mc and math suite 2026-02-15 21:08:19 +02:00
David Friehs
27b93cbd15
cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)
* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined
2026-02-15 22:38:42 +05:30