ik_llama.cpp/167 - Bug_ Unable to quantize Falcon 10B 1.58 bitnet model.md at ik/cpu_repeat - ik_llama.cpp

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

3.7 KiB

Raw Permalink Blame History

🐛 #167 - Bug: Unable to quantize Falcon 10B 1.58 bitnet model

Author	`raymond-infinitecode`
State	❌ Closed
Created	2025-01-09
Updated	2025-01-11

Description

What happened?

Model Source https://huggingface.co/tiiuae/Falcon3-10B-Instruct-1.58bit/tree/main

llama-quantize ggml-model-f32.gguf output.gguf IQ1_BN

output main: build = 3525 (3e685162) main: built with MSVC 19.37.32825.0 for x64 main: quantizing 'd:\llamafile-0.9.0\ggml-model-f32.gguf' to 'output.gguf' as IQ1_BN ggml_calloc: failed to allocate 0.00 MB D:\ik_llama.cpp\ggml\src\ggml.c:378: fatal error

Name and Version

D:\ik_llama.cpp\build\bin\Release>llama-cli --version version: 3525 (3e685162) built with MSVC 19.37.32825.0 for x64

What operating system are you seeing the problem on?

No response

Relevant log output

No response

💬 Conversation

👤 raymond-infinitecode commented the 2025-01-09 at 15:39:01:

How to convert that model to gguf that can be used with ik_llama.cpp ?

👤 raymond-infinitecode commented the 2025-01-09 at 15:39:01:

How to conver that model to gguf that can be used with ik_llama.cpp ?

👤 ikawrakow commented the 2025-01-09 at 15:48:13:

I haven't looked into this model at all. Does it work in mainline llama.cpp? I see them talking about cloning a Microsoft BitNet repository to use this model, so this does not look like a standard llama.cpp GGUF to me.

👤 raymond-infinitecode commented the 2025-01-10 at 03:02:26:

Hi Ikawrakow, it doesn't work with llama.cpp but it works with bitnet repository https://github.com/microsoft/BitNet To be percise it works with https://github.com/Eddie-Wang1120/llama.cpp.git [merge-dev] branch only

👤 ikawrakow commented the 2025-01-10 at 07:14:34:

When a ternary Falcon3 model is released in a more standard format, it will be supported also here. In the meantime you can use the quoted Microsoft BitNet repository.

👤 raymond-infinitecode commented the 2025-01-10 at 11:17:41:

The problem with Microsoft Bitnet repository is that llama-server is not build. I wonder if they did it on intention.

👤 ikawrakow commented the 2025-01-10 at 11:34:24:

And the problem with the model that you want to run is that it is stored quantized as I2_S, which is Microsoft BitNet specific, and does not exist anywhere else. There is no f16 or f32 or q8_0 GGUF. If I follow the BitNet setup instructions, running

python setup_env.py --hf-repo tiiuae/Falcon3-7B-Instruct-1.58bit -q i2_s

actually fetches an f32 version of Falcon3-7B-Instruct-1.58bit from tiiuae/Falcon3-7B-Instruct-1.58bit. Qunatizing that model to IQ1_BN or IQ2_BN works just fine. There is a minor modification required in llama.cpp to add the Falcon3 pre-tokenizer configuration, and then all works.

But to use the 10B model, which appears to be available only as BitNet I2_S quants, one would need to write a I2_S -> IQ2_BN or IQ1_BN or F16/32 converter. I think it is much easier to ask tiiuae to post the model in a standard llama.cpp type (f16, f32, q8_0) than to write converters from obscure quantization types.

👤 ikawrakow commented the 2025-01-10 at 15:46:36:

OK, it doesn't seem to be that hard. WIP on this branch

👤 raymond-infinitecode commented the 2025-01-11 at 05:09:18:

wow, you are really a genius, complete the conversion implementation in less than half a day !

3.7 KiB Raw Permalink Blame History