ik_llama.cpp/src
ubergarm baeefb4731
Add GLM-4-0414 Model Support (#344)
* Add GLM-4-0414 model support

Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp.

Still some issues where it doesn't work:
* offloading >=60 layers to GPU
* no flash attention

* Remove seemingly unused llm_tensor enums

Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already
existed which seems pretty similar? Don't think they were used in the
python code either...

So removed these as possibly just cruft:
* LLM_TENSOR_POST_ATTN_NORM
* LLM_TENSOR_POST_MLP_NORM

* Set flash attention precision to f32 on GLM4 arch

* Set non flash attention precision to f32 on GLM4

* Remove reshape_3d() for Vcur in build_glm4()

This fixes the non-flash-attention inferencing on both CPU and CUDA.
2025-04-26 17:34:04 +02:00
..
CMakeLists.txt Be able to repack tensors at run time (#147) 2024-12-17 14:16:34 +01:00
llama-grammar.cpp Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
llama-grammar.h Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
llama-impl.h Add copyright notices (#317) 2025-04-07 10:43:26 +02:00
llama-sampling.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
llama-sampling.h Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
llama-vocab.cpp LlaMA-4 support (text only) (#321) 2025-04-10 09:05:21 +02:00
llama-vocab.h Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
llama.cpp Add GLM-4-0414 Model Support (#344) 2025-04-26 17:34:04 +02:00
unicode-data.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
unicode-data.h Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
unicode.cpp Deepseek V3 support added (#176) 2025-01-23 18:24:10 +02:00
unicode.h Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00