ik_llama.cpp

History

Kawrakow f725576345 Minor performance improvements (#179 ) * Try interleaving 8 rows for iq4_xs On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads). * Try interleaving 8 iq4_xs rows It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression. * Cleanup * 8-rows interleaved q8_0 (AVX2) * 8-rows interleaved q8_0 (Zen4) * 8-rows interleaved q8_0 (Zen4) - slightly better PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before. * 8-rows interleaved q8_0 (NEON) PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same. * FA: repack Q8_0 to Q8_0_R8 * Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4) * FA: repack Q8_0 to Q8_0_R8 (NEON) Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation. * q4_0_r8 (AVX2) * q4_0_r8 (NEON) Tiny bit faster PP (~128 vs ~126 t/s), same TG. * q4_0_r8 (Zen4) Somehow only marginally faster? 268 t/s vs 261 t/s * q4_0_r8 (Zen4) - slightly better 282 t/s for a pure q4_0 L3-8B quantization. * Apply platform specific modifications when repacking E.g., on NEON it is useful to pre-apply q ^ 0x88 to q4_0. This results in a ~3% performance improvement. Hence, * Changed the signature of the repack_X functions to take a bool argument indicating if the repacking is done online and, if so, apply modifications as appropriate while repacking. * Added iqk_modify_tensor to apply modifications to models that have already been repacked while loading the model. Caveat: just like rtr, this needs to have mmap disabled (else one would need to move the data to a not mmap-ed buffer, so much more complicated). * Apply platform specific modifications when repacking On Zen4 we can pre-convert the signed quants in q8_0_r4 and q8_k_r8 to unsigned thus avoiding these operations in matrix multiplications. With this change we hit PP-512 = 382.40 t/s (q8_k_r8) PP-512 = 306.92 t/s (q8_0_r4) for L3-8B on a Ryzen-7950X using q8_0 KV-cache. * Process up to 16 columns per kernel call for q8_k_r8 This brings PP-512 up to 389 t/s. * Be able to load Deepseek-v2-Lite --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>		2025-01-27 18:53:47 +02:00
..
CMakeLists.txt	Be able to repack tensors at run time (#147 )	2024-12-17 14:16:34 +01:00
llama-grammar.cpp	Merge mainline - Aug 12 2024 (#17 )	2024-08-12 15:14:32 +02:00
llama-grammar.h	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
llama-impl.h	Time to fix replace_all (#68 )	2024-09-28 17:59:47 +03:00
llama-sampling.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
llama-sampling.h	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
llama-vocab.cpp	Deepseek V3 support added (#176 )	2025-01-23 18:24:10 +02:00
llama-vocab.h	Merge mainline - Aug 12 2024 (#17 )	2024-08-12 15:14:32 +02:00
llama.cpp	Minor performance improvements (#179 )	2025-01-27 18:53:47 +02:00
unicode-data.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
unicode-data.h	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
unicode.cpp	Deepseek V3 support added (#176 )	2025-01-23 18:24:10 +02:00
unicode.h	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00