ik_llama.cpp/tests
Kawrakow 2fe098e938
Async compute graph evaluation (2 or more GPUs) (#1089)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Simple async

* This sync seems enough

* Only do async for 4 or more backends

With 2 GPUs (so, 3 backends) not using async is slightly faster

* Scheduler changes

* Use OpenMP if available

Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!

* Do not use OpenMP if there are tensor overrides

* Set omp max active levels

* Be more careful with having set the device before using a stream

* Command line option to turn on async. Set to false by defualt for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-27 08:18:06 +01:00
..
.gitignore tests : gitignore ggml-common.h 2024-03-09 14:17:11 +02:00
CMakeLists.txt Async compute graph evaluation (2 or more GPUs) (#1089) 2025-12-27 08:18:06 +01:00
get-model.cpp ci : add model tests + script wrapper (#4586) 2024-01-26 14:18:00 +02:00
get-model.h ci : add model tests + script wrapper (#4586) 2024-01-26 14:18:00 +02:00
run-json-schema-to-grammar.mjs json-schema-to-grammar improvements (+ added to server) (#5978) 2024-03-21 11:50:43 +00:00
test-autorelease.cpp ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
test-backend-ops.cpp Update mtmd to improve accuracy of M-RoPE (#993) 2025-11-29 07:27:15 +01:00
test-c.c Nomic Vulkan backend (#4456) 2024-01-29 15:50:50 -05:00
test-chat-parser.cpp Add --webui arg to launch llama.cpp new webui (#786) 2025-10-27 14:22:02 +02:00
test-chat-template.cpp Tool calls support from mainline (#723) 2025-09-01 08:38:49 +03:00
test-chat.cpp fix kimi-k2 tool call (#996) 2025-11-24 06:51:16 +01:00
test-double-float.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-function-calls.cpp Fix for Deepseek r1 parsing (#676) 2025-08-08 13:56:44 +03:00
test-grad0.cpp ggml : refactor rope norm/neox (#7634) 2024-06-05 11:29:20 +03:00
test-grammar-integration.cpp Update grammar (#1023) 2025-11-30 18:45:38 +01:00
test-grammar-llguidance.cpp Tool calls support from mainline (#723) 2025-09-01 08:38:49 +03:00
test-grammar-parser.cpp grammars: x{min,max} repetition operator (#6640) 2024-06-06 10:07:06 +01:00
test-json-partial.cpp Tool calls support from mainline (#723) 2025-09-01 08:38:49 +03:00
test-json-schema-to-grammar.cpp Update grammar (#1023) 2025-11-30 18:45:38 +01:00
test-llama-grammar.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-model-load-cancel.cpp ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
test-opt.cpp code : normalize enum names (#5697) 2024-02-25 12:09:09 +02:00
test-quantize-fns.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-quantize-perf.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-regex-partial.cpp Tool calls support from mainline (#723) 2025-09-01 08:38:49 +03:00
test-rope.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-sampling.cpp Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
test-tokenizer-0.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-tokenizer-0.py py : logging and flake8 suppression refactoring (#7081) 2024-05-05 08:07:48 +03:00
test-tokenizer-0.sh tests : fix test-tokenizer-0.sh 2024-05-28 15:04:09 +03:00
test-tokenizer-1-bpe.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-tokenizer-1-spm.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-tokenizer-random.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00