ik_llama.cpp

History

ubergarm baeefb4731 Add GLM-4-0414 Model Support (#344 ) * Add GLM-4-0414 model support Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp. Still some issues where it doesn't work: * offloading >=60 layers to GPU * no flash attention * Remove seemingly unused llm_tensor enums Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already existed which seems pretty similar? Don't think they were used in the python code either... So removed these as possibly just cruft: * LLM_TENSOR_POST_ATTN_NORM * LLM_TENSOR_POST_MLP_NORM * Set flash attention precision to f32 on GLM4 arch * Set non flash attention precision to f32 on GLM4 * Remove reshape_3d() for Vcur in build_glm4() This fixes the non-flash-attention inferencing on both CPU and CUDA.		2025-04-26 17:34:04 +02:00
..
CMakeLists.txt	Be able to repack tensors at run time (#147 )	2024-12-17 14:16:34 +01:00
llama-grammar.cpp	Merge mainline - Aug 12 2024 (#17 )	2024-08-12 15:14:32 +02:00
llama-grammar.h	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
llama-impl.h	Add copyright notices (#317 )	2025-04-07 10:43:26 +02:00
llama-sampling.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
llama-sampling.h	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
llama-vocab.cpp	LlaMA-4 support (text only) (#321 )	2025-04-10 09:05:21 +02:00
llama-vocab.h	Merge mainline - Aug 12 2024 (#17 )	2024-08-12 15:14:32 +02:00
llama.cpp	Add GLM-4-0414 Model Support (#344 )	2025-04-26 17:34:04 +02:00
unicode-data.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
unicode-data.h	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
unicode.cpp	Deepseek V3 support added (#176 )	2025-01-23 18:24:10 +02:00
unicode.h	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00