ik_llama.cpp/255 - Feature Request_ dynamic layer by layer offloading during prompt proces.md at 9e1e1c0b5a1b3b8043b6c16d73bf87e12f313e27 - ik_llama.cpp

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

14 KiB

Raw Blame History

✨ #255 - Feature Request: dynamic layer by layer offloading during prompt processing for VRAM constrained scenarios

Author	`binjiechen`
State	❌ Closed
Created	2025-03-13
Updated	2025-03-15

Description

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

During prompt processing (possibly long context), allow dynamically layer by layer offload instead of fixed offload. i.e., offload layer 1 to GPU, process a batch of tokens, then free layer 1 and offload layer 2 to GPU, ... A large batch can be used and compute buffers can be freed before token generation. Optionally, some layers can be retained if VRAM is large enough. It should only work for parallel = 1 I guess.

Motivation

From my experience, prompt processing stage is compute bound as usually a large batch size is used. When VRAM < model size, only a part of the model can be offloaded to GPU and the CPU part could be bottleneck. So, if we offload layer by layer, GPU can be fully utilized and can offer better performance.

I have a 4090 and a 13600k (power limited to 125w) and 192GB memory. I ran some tests on Qwen 2.5 32B which has 64 blocks:

model	size	params	backend	ngl	threads	n_ubatch	fa	fmoe	test	t/s
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	64	13	2048	1	1	pp2048	2627.90 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	63	13	2048	1	1	pp2048	572.61 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	60	13	2048	1	1	pp2048	173.71 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	40	13	2048	1	1	pp2048	30.66 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	20	13	2048	1	1	pp2048	16.93 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	0	13	2048	1	1	pp2048	10.76 ± 0.00

Even if only 1 block is left on CPU, t/s is decreased by 78%. I think layer by layer offloading should help a lot in this situation. Assume a 8GB/s RAM to VRAM speed, then the whole offloading would only cost 2.7s in this case and result in a speed of 758 t/s (if compute is hidden by transfer).

Possible Implementation

No response

💬 Conversation

👤 ikawrakow commented the 2025-03-14 at 08:13:39:

I can look into this next week (travelling right now). But I think there maybe something wrong with the offloading to the GPU. I don't think not offloaded layers are run on the CPU. To check, build without CUDA and run the same benchmarks. I expect performance much better than what you observe with zero layers offloaded.

👤 binjiechen commented the 2025-03-14 at 10:42:34:

I can look into this next week (travelling right now). But I think there maybe something wrong with the offloading to the GPU. I don't think not offloaded layers are run on the CPU. To check, build without CUDA and run the same benchmarks. I expect performance much better than what you observe with zero layers offloaded.

The result with a CPU only build is basically the same:

model	size	params	backend	threads	n_ubatch	fa	mmap	fmoe	test	t/s
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	BLAS	13	2048	1	0	1	pp2048	11.79 ± 0.00

I think not offloaded layers are indeed run on the CPU, as in the previous test with CUDA backend, I observe full CPU utilization from htop. Anyway, thanks for your great work and enjoy your travelling!

👤 ikawrakow commented the 2025-03-15 at 08:32:57:

Very strange. My GPU is RTX-4080, so I can fit a maximum of 45 layers on the GPU for 32B Qwen2.5, and here is what I get with that:

model	size	params	backend	ngl	threads	n_ubatch	test	t/s
qwen2 ?B Q4_K - Medium	18.50 GiB	32.76 B	CUDA	45	16	2048	pp2048	1030.55 ± 11.82
qwen2 ?B Q4_K - Medium	18.50 GiB	32.76 B	CUDA	40	16	2048	pp2048	985.30 ± 2.18
qwen2 ?B Q4_K - Medium	18.50 GiB	32.76 B	CUDA	20	16	2048	pp2048	817.11 ± 1.09
qwen2 ?B Q4_K - Medium	18.50 GiB	32.76 B	CUDA	10	16	2048	pp2048	750.98 ± 0.70
qwen2 ?B Q4_K - Medium	18.50 GiB	32.76 B	CUDA	0	16	2048	pp2048	703.04 ± 16.27
qwen2 ?B Q4_K - Medium	18.50 GiB	32.76 B	CPU	0	32	2048	pp2048	40.63 ± 0.55

The last line in the table is with a CPU-only build, the other line with zero layers offloaded is the CUDA build. So, clearly the model gets offloaded to the GPU for the actual computation. Performance with the 0 layers offloaded is ~70% of the performance with 45 layers offloaded. When 45 layers are offloaded, computing a batch of 2048 tokens takes about 2 seconds. With zero layers offloaded it is 2.9 seconds. So, offloading takes 0.9 seconds. 45 layers are 45/64*18.5 =13 GiB, so we can estimate the throughput of the PCI-E transfer to be 13/0.9=14.4 GiB/s, pretty much in line with the expectation.

It would seem that in your case the layers do not get offloaded to the GPU for some reason. What is the exact model you are using?

Btw, the current multi-threading here (and also upstream) is not very good for CPUs with performance and efficiency cores. The work get simple split in n_thread equal chunks, so the duration of each operation is determined by the performance of the efficiency cores. Have you tried using just the P cores?

👤 ikawrakow commented the 2025-03-15 at 11:06:05:

Aha, I know where the problem is. Try disabling BLAS. I never enable it because the iqk_mul_mat matrix multiplications are faster than any CPU BLAS implementation I have tried.

What happens with BLAS enabled is this: the scheduler goes through all back-ends and checks if they support the operation being scheduled. If more than one back-end is found that supports the operation, then the operation is scheduled on the back-end that already has the model weights participating in the op. Hence, with BLAS enabled (another back-end), matrix multiplications for not offloaded layers get scheduled on the BLAS back-end, and hence they run on the CPU.

👤 binjiechen commented the 2025-03-15 at 12:58:47:

Aha, I know where the problem is. Try disabling BLAS. I never enable it because the iqk_mul_mat matrix multiplications are faster than any CPU BLAS implementation I have tried.

What happens with BLAS enabled is this: the scheduler goes through all back-ends and checks if they support the operation being scheduled. If more than one back-end is found that supports the operation, then the operation is scheduled on the back-end that already has the model weights participating in the op. Hence, with BLAS enabled (another back-end), matrix multiplications for not offloaded layers get scheduled on the BLAS back-end, and hence they run on the CPU.

Ah, yes, I resolved the problem. I now have a better understanding of how llama.cpp works. Thank you very much!

model	size	params	backend	ngl	threads	n_ubatch	fa	fmoe	test	t/s
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	64	6	2048	1	1	pp2048	2657.76 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	32	6	2048	1	1	pp2048	1622.08 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	0	6	2048	1	1	pp2048	1161.20 ± 0.00

Results looks really great this time.

model	size	params	backend	threads	n_ubatch	fa	fmoe	test	t/s
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CPU	13	2048	1	1	pp2048	10.04 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CPU	8	2048	1	1	pp2048	8.32 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CPU	6	2048	1	1	pp2048	11.46 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	BLAS	6	2048	1	1	pp2048	11.84 ± 0.00

For CPU backend, it's true that using only P cores gives better performance. Intel oneMKL BLAS is slightly faster under this setting

model	size	params	backend	ngl	threads	n_ubatch	fa	fmoe	test	t/s
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	64	6	2048	1	1	tg128	25.23 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA+BLAS	64	6	2048	1	1	tg128	25.60 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	32	6	2048	1	1	tg128	4.00 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA+BLAS	32	6	2048	1	1	tg128	4.45 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA	0	6	2048	1	1	tg128	2.18 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CUDA+BLAS	0	6	2048	1	1	tg128	2.43 ± 0.00

Also, I find that for token generation which seems memory bound, CUDA+BLAS gives better performance (for ngl > 64 they're the same). So is it possible to add an option that makes CPU a valid backend and do the computation during token generation?

👤 ikawrakow commented the 2025-03-15 at 13:19:12:

What happens if you add -rtr 1? Is one MKL still faster for CPU-only PP?

👤 binjiechen commented the 2025-03-15 at 13:48:19:

model	size	params	backend	threads	n_ubatch	fa	mmap	rtr	fmoe	test	t/s
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	CPU	6	2048	1	0	1	1	pp2048	16.96 ± 0.00
qwen2 ?B Q5_K - Medium	21.66 GiB	32.76 B	BLAS	6	2048	1	0	1	1	pp2048	11.78 ± 0.00

With -rtr 1, BLAS version is not affected and non-BLAS is significantly faster.

👤 ikawrakow commented the 2025-03-15 at 16:07:08:

In that case, is there a reason to use BLAS? Your TG benchmark shows a slightly better TG performance with BLAS, but I don't really understand why that would be the case. For TG matrix multiplications are not done by BLAS even if it is enabled.

👤 binjiechen commented the 2025-03-15 at 16:43:12:

In that case, is there a reason to use BLAS? Your TG benchmark shows a slightly better TG performance with BLAS, but I don't really understand why that would be the case. For TG matrix multiplications are not done by BLAS even if it is enabled.

No, BLAS is not needed. I thought during TG not offloaded layers are also computed on GPU, and what I meant previously is to keep computation on CPU for not offloaded layers. So weight transfer does not happen, which might increase performance.

But I'm confused now, when ngl is 0, if all computation is on GPU, then TG speed shouldn't be as high as 2 t/s. So during TG, not offloaded layers' computation is actually done on CPU?

👤 ikawrakow commented the 2025-03-15 at 16:48:25:

So during TG, not offloaded layers' computation is actually done on CPU?

Yes. There is a magic threshold set in the CUDA back-end (currently 32). If the batch size is less than that, tensors are not offloaded to the GPU, and the calculation is done on the CPU. One can try to be more intelligent and make it dependent on amount of data that needs to be uploaded, PCI-E speed, relative CPU vs GPU matrix multiplication performance, etc. But for now that's what it is.

👤 binjiechen commented the 2025-03-15 at 17:00:25:

Ok, I got it now. Thanks for your patience!

14 KiB Raw Blame History