ik_llama.cpp/examples
Kawrakow a719349982 POC: CUDA tensor parallel (MoE models) (#1022)
* Remove most of split mode row

* WIP

* WIP: also allocate the KV cache using tensor split

* WIP: it runs with wrong result

But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
     entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
  wait for GPU 0 to finish its entore FFN calculation before it can
  start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling

* WIP

* This works, but it is slow

* This is slightly better

the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.

* Playing games with the scheduler

This change tricks it into doing the right thing^TM.
Still quite a bit slower than split mode layer for the 8B LlaMA model.
But for the 70B LlaMA it now beats split mode layer for TG:
28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s.
In comparison, split mode "row" in mainline gets
484 t/s PP and 19.3 t/s TG.

* Fix attn split

Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.

* Show memory used per device

* Make it work with partial offload

but no tensor overrides yet, just ngl < num_layers.

* Allow for f16 source in fused_rms_norm

* This results in faster PP.

Now PP is faster than split mode layer for L3-70B.

* Rename split mode "row" to split mode "graph"

* Leave FFN partial results as f16

* WIP GLM4.5 - runs with wrong results

* WIP GLM4.5 - this works

PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.

* Work around compiler bug

It issues a warning that there is an extra semicolon outside of a function,
but there isn't. If I remove the anonymous namespace and turn the
functions inside into static, the warning disapears, so clearly
a compiler bug.

* Make graph reuse work with split mode graph

* Remove more split mode row remnants

* WIP tensor overrides

Runs with wrong results, don't see where the issue could be.

* This works but is slow

Still does not work for row-interleaved quants

* Slightly better

* Slightly better

* Row-interleaved quants work

* Better

* Minor

* Guarad against using split mode "graph" for unsupported models

* Guards against using merge_qkv with split mode "graph"

* WIP split mode attn

Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.

* Split mode graph for qwen3moe

* Try to better distribute the splits

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-01 19:25:40 +01:00
..
baby-llama Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
batched Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
batched-bench MoE fix for R4 quants (#170) 2025-01-12 13:19:14 +02:00
batched.swift Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
benchmark build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
convert-llama2c-to-ggml build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
cvector-generator CUDA: set compute parameters via command line arguments (#910) 2025-11-07 07:11:23 +02:00
deprecation-warning Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
embedding Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
eval-callback Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
export-lora Merge vulkan code from mainline up to commit of 6/28/2025 (#563) 2025-07-02 08:49:42 +02:00
gbnf-validator Update grammar (#1023) 2025-11-30 18:45:38 +01:00
gguf Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
gguf-hash Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
gguf-split gguf-split : update (#444) 2025-05-23 08:07:42 +03:00
gritlm llama : allow pooled embeddings on any model (#7477) 2024-06-21 08:38:22 +03:00
imatrix Fix imatrix calculation for MLA models (#411) 2025-05-13 17:53:38 +03:00
infill Tool calls support from mainline (#723) 2025-09-01 08:38:49 +03:00
jeopardy build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
llama-bench POC: CUDA tensor parallel (MoE models) (#1022) 2025-12-01 19:25:40 +01:00
llama.android Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
llama.swiftui Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
llava add dry sampler (#513) 2025-06-19 10:24:53 +03:00
lookahead add dry sampler (#513) 2025-06-19 10:24:53 +03:00
lookup add dry sampler (#513) 2025-06-19 10:24:53 +03:00
main Port mdmd from mainline + Qwen2/2.5-VL support (#798) 2025-09-27 08:45:29 +02:00
main-cmake-pkg Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
mtmd Update mtmd to improve accuracy of M-RoPE (#993) 2025-11-29 07:27:15 +01:00
parallel Tool calls support from mainline (#723) 2025-09-01 08:38:49 +03:00
passkey Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
perplexity More informative PPL readout line (#914) 2025-11-07 16:41:24 +02:00
quantize Allow quantization of ffn_gate_inp (#896) 2025-11-05 10:44:32 +02:00
quantize-stats Disable experimental code that causes issues with MSVC (#707) 2025-08-19 18:09:49 +03:00
retrieval Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
rpc RPC: support multiple devices including cpu (#1024) 2025-11-30 18:48:02 +01:00
save-load-state Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
server Update grammar (#1023) 2025-11-30 18:45:38 +01:00
simple Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
speculative Support --device and --device-draft parameter (#866) 2025-10-27 18:13:28 +02:00
sweep-bench sweep-bench: be able to set TG tokens via -n (#897) 2025-11-04 14:39:30 +02:00
sycl Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
tokenize Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
base-translate.sh build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
chat-13B.bat Create chat-13B.bat (#592) 2023-03-29 20:21:09 +03:00
chat-13B.sh build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
chat-persistent.sh build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
chat-vicuna.sh build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
chat.sh build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
CMakeLists.txt Port mdmd from mainline + Qwen2/2.5-VL support (#798) 2025-09-27 08:45:29 +02:00
convert_legacy_llama.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
json_schema_pydantic_example.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
json_schema_to_grammar.py Update grammar (#1023) 2025-11-30 18:45:38 +01:00
llama.vim llama.vim : added api key support (#5090) 2024-01-23 08:51:27 +02:00
llm.vim llm.vim : stop generation at multiple linebreaks, bind to <F2> (#2879) 2023-08-30 09:50:55 +03:00
Miku.sh build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
pydantic_models_to_grammar_examples.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
pydantic_models_to_grammar.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
reason-act.sh build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
regex_to_grammar.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
server_embd.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
server-llama2-13B.sh build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
ts-type-to-grammar.sh JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555) 2024-04-12 19:43:38 +01:00