ik_llama.cpp/ggml/src
Kawrakow b8d1fac97b
Convert models to row-interleaved quants using the quantize tool (#272)
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 07:23:36 +01:00
..
ggml-cann Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
ggml-cuda Make Q8_0 KV cache work with mla=2,fa on CUDA (#264) 2025-03-18 15:40:47 +01:00
ggml-sycl Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
iqk Convert models to row-interleaved quants using the quantize tool (#272) 2025-03-21 07:23:36 +01:00
kompute@4565194ed7 Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
kompute-shaders Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
llamafile Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
vulkan-shaders Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
CMakeLists.txt Compile time option to use bf16 for qunts without MMQ kernels (#261) 2025-03-18 07:37:10 +01:00
ggml-aarch64.c Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
ggml-aarch64.h Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
ggml-alloc.c Give the user the option to override where model weights are stored (#232) 2025-02-25 17:55:58 +02:00
ggml-backend-impl.h Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
ggml-backend.c FlashMLA-2 (CPU): faster and smaller compute buffer size (#253) 2025-03-13 12:07:43 +02:00
ggml-blas.cpp Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
ggml-cann.cpp Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
ggml-common.h Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications (#194) 2025-02-09 09:14:52 +02:00
ggml-cuda.cu Prevent FlashMLA-1 from running on CUDA (#268) 2025-03-19 13:03:59 +01:00
ggml-impl.h Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
ggml-kompute.cpp Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
ggml-metal.m Faster MoE inference (#112) 2024-10-31 12:05:27 +01:00
ggml-metal.metal Faster MoE inference (#112) 2024-10-31 12:05:27 +01:00
ggml-quants.c Flash MLA (CPU only) (#240) 2025-03-03 15:17:51 +02:00
ggml-quants.h IQ1_M_R4: better 1.75 bpw quants (#187) 2025-02-06 14:08:52 +02:00
ggml-rpc.cpp Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
ggml-sycl.cpp Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
ggml-vulkan.cpp Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
ggml.c Convert models to row-interleaved quants using the quantize tool (#272) 2025-03-21 07:23:36 +01:00