4.6 KiB
✨ #626 - Feature Request: Add IQK GEMM for IQ1_M
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-07-18 |
| Updated | 2025-07-18 |
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Quite a few people are trying to run Unsloth models that contain tensors quantized with IQ1_M. In addition, there are now the quantization recipes prepared by the @Thireus GGUF suite, which also tend to contain IQ1_M when a low-bpw has been requested.
When a model contains IQ1_M FFN tensors and -fmoe is specified, ik_llama.cpp will crash with an assert when the number of tokens processed by one of the routed experts is less than 32. This is due to the fused ffn_up+ffn_gate op assuming the presence of an IQK GEMM kernel, which is not implemented.
So, either add IQK GEMM for IQ1_M, or at least quard against the absence of a GEMM kernel in the fused ffn_up+ffn_gate op CPU implementation.
Motivation
Quite a few people are trying to run Unsloth models that contain tensors quantized with IQ1_M. In addition, there are now the quantization recipes prepared by the @Thireus GGUF suite, which also tend to contain IQ1_M when a low-bpw has been requested.
When a model contains IQ1_M FFN tensors and -fmoe is specified, ik_llama.cpp will crash with an assert when the number of tokens processed by one of the routed experts is less than 32. This is due to the fused ffn_up+ffn_gate op assuming the presence of an IQK GEMM kernel, which is not implemented.
Possible Implementation
Either add IQK GEMM for IQ1_M, or at least quard against the absence of a GEMM kernel in the fused ffn_up+ffn_gate op CPU implementation.
💬 Conversation
👤 ubergarm commented the 2025-07-18 at 14:43:32:
I'll not open a new issue regarding unsloths Kimi-K2-Instruct-IQ1_S failing with -fmoe as discussed on other threads here and reported on hugging face here. I also recreated the issue and observed removing -fmoe allows that model to run.
I confirmed using gguf-dump.py script that the model in question indeed has a handfull of IQ1_M ffn tensors:
$ cat logs/gguf-dump-Kimi-K2-Instruct-UD-IQ1_S-0000* | grep IQ1_M
163: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.18.ffn_gate_exps.weight
167: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.18.ffn_up_exps.weight
111: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.50.ffn_gate_exps.weight
115: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.50.ffn_up_exps.weight
129: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.51.ffn_gate_exps.weight
133: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.51.ffn_up_exps.weight
147: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.52.ffn_gate_exps.weight
151: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.52.ffn_up_exps.weight
165: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.53.ffn_gate_exps.weight
169: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.53.ffn_up_exps.weight
183: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.54.ffn_gate_exps.weight
187: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.54.ffn_up_exps.weight
21: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.56.ffn_gate_exps.weight
25: 5637144576 | 7168, 2048, 384, 1 | IQ1_M | blk.56.ffn_up_exps.weight
Given the "unsloth dynamic" is to change the tensor size up and down across layers for the same tensor name, it wasn't obvious from the first GGUF splits that it contained IQ1_M.
👤 ikawrakow commented the 2025-07-18 at 14:46:02:
I created issue #626 for this, so no need to add another one.
👤 ubergarm commented the 2025-07-18 at 17:34:41:
Confirmed I can now run unsloths Kimi-K2-Instruct-UD-IQ1_S-00001-of-00006.gguf with -fmoe! Thanks!
$ ./build/bin/llama-server --version
version: 3808 (38012f72)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu