llama.cpp

mirror of https://github.com/ggerganov/llama.cpp synced 2026-04-26 18:42:50 +02:00

Author	SHA1	Message	Date
SamareshSingh	81df3f7cfa	fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102 ) * llama: fix crash in print_info for GLM-DSA when vocab_only is set * addressed code review comments * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-20 10:32:46 +03:00
Sigbjørn Skjæret	4f02d47339	model : refactor bias tensor variable names (#22079 ) * refactor bias tensor variable names * use create_tensor_qkv for jina-bert-v2	2026-04-18 20:12:00 +02:00
Johannes Gäßler	fd1c0ec3f0	llama: fit ctx size for CPU only (#21568 )	2026-04-18 08:16:04 +02:00
Eric Zhang	fcc7508759	model : Gemma4 model type detection (#22027 ) * model : Gemma4 model type detection * model : Gemma4 model type detection	2026-04-17 10:07:11 +02:00
Xuan-Son Nguyen	089dd41fe3	cmake: use glob to collect src/models sources (#22005 )	2026-04-16 23:25:16 +02:00
Xuan-Son Nguyen	4fbdabdc61	model: using single llm_build per arch (#21970 ) * model: using single llm_build per arch * fix merge * nits	2026-04-16 21:10:22 +02:00
PikaPikachu	9db77a020c	model : refactor QKV into common build_qkv and create_tensor_qkv helpers (#21245 ) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s	2026-04-16 17:41:34 +02:00
Sigbjørn Skjæret	f772f6e434	model : support NVFP4 tensors for Gemma4 (#21971 ) * support nvfp4 tensors for Gemma4 * add wo_s to build_attn * add wo_s to build_attn * fix glm4	2026-04-16 16:51:47 +02:00
Xuan-Son Nguyen	fae3a28070	ggml : remove ggml-ext.h (#21869 ) * ggml: correct placement of ggml-ext.h * ggml : remove ggml-ext.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-14 17:32:58 +03:00
Johannes Gäßler	865ff06b2f	TP: fix Qwen 3 Next data split (#21732 )	2026-04-11 09:23:42 +02:00
MoonRide303	e62fa13c24	model : make Gemma 4 shared-KV tail attn_k tensors optional on load (#21739 )	2026-04-10 21:45:50 +02:00
Johannes Gäßler	d6f3030047	ggml: backend-agnostic tensor parallelism (experimental) (#19378 ) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-09 16:42:19 +02:00
Xuan-Son Nguyen	057dba336e	model: fix multimodal padding token for gemma3n/gemma4 (#21625 ) * model: fix multimodal padding token for gemma3n/gemma4 * nits	2026-04-09 12:18:23 +02:00
Piotr Wilkin (ilintar)	0ec191e1d7	vocab: add gemma4 tokenizer tests, fix edge case (#21534 ) * YATF (Yet Another Tokenizer Fix) for Gemma 4. With tests! * Remove unnecessary hash from update script. * minor: move constant	2026-04-09 11:41:14 +02:00
Aldehir Rojas	d9a12c82f0	vocab : remove </s> eog token if gemma4 (#21492 )	2026-04-08 09:53:06 -05:00
Erik Scholz	3ba12fed0a	kv-cache : extend cache quantization checks (#21586 ) to also check for enabled flash attention, instead of just auto.	2026-04-08 16:08:57 +03:00
Georgi Gerganov	5764d7c6a6	gemma : perform per-layer projections in the first layer (#21612 ) * gemma : reduce graph splits by keeping per-layer ops in the input layer * gemma : put the per-layer proj in the first layer * cont : move the projection before the layer loop	2026-04-08 16:06:30 +03:00
Georgi Gerganov	4eb19514dd	kv-cache : support attention rotation for heterogeneous iSWA (#21513 ) * kv-cache : support attention rotation for heterogeneous iSWA * cont : remove assert	2026-04-07 20:31:28 +03:00
Son H. Nguyen	0d049d6a92	unicode : add custom Qwen2 regex handler to fix segfault on long input (#21257 ) * unicode : add custom Qwen2 regex handler to fix segfault on long input std::regex uses recursive backtracking internally, which causes a stack overflow (segfault) when tokenizing long sequences of repeated characters (e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to the std::regex fallback path instead of using a custom handler. Add unicode_regex_split_custom_qwen2() following the established pattern used by gpt2, llama3, kimi_k2, and afmoe custom handlers. Closes: https://github.com/ggml-org/llama.cpp/issues/21113 * cont : remove TODO comment * cont : update comment to reflect original regex * use the correct regex in the comment this time... [no ci] --------- Co-authored-by: Aldehir Rojas <hello@alde.dev>	2026-04-07 16:13:38 +03:00
Johannes Gäßler	a8ec0df461	llama: remove per-arch tensor name lists (#21531 )	2026-04-07 15:02:03 +02:00
Pasha Khosravi	2e1f0a889e	ggml: add Q1_0 1-bit quantization support (CPU) (#21273 ) * ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU) * add generic fallback for x86 * remove Q1_0 (group size 32) * rename Q1_0_g128 => Q1_0 * fix Q1_0 LlamaFileType Enum * Fix trailing spaces; add generic fallback for othre backends * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix /r/n spacing + arch-fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-06 20:55:21 +02:00
Aldehir Rojas	4aa962e2b0	vocab : add byte token handling to BPE detokenizer for Gemma4 (#21488 )	2026-04-06 09:08:37 -05:00
Georgi Gerganov	400ac8e194	convert : set "add bos" == True for Gemma 4 (#21500 ) * convert : set "add bos" == True for Gemma 4 * cont : handle old GGUFs	2026-04-06 13:52:07 +03:00
anchortense	58190cc84d	llama : correct platform-independent loading of BOOL metadata (#21428 ) * model-loader : fix GGUF bool array conversion * model-loader : fix remaining GGUF bool pointer uses	2026-04-06 01:40:38 +02:00
Richard Davison	af76639f72	model : add HunyuanOCR support (#21395 ) * HunyuanOCR: add support for text and vision models - Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge - Add separate HUNYUAN_OCR chat template (content-before-role format) - Handle HunyuanOCR's invalid pad_token_id=-1 in converter - Fix EOS/EOT token IDs from generation_config.json - Support xdrope RoPE scaling type - Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.) - Register HunYuanVLForConditionalGeneration for both text and mmproj conversion * fix proper mapping * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * address comments * update * Fix typecheck * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-05 23:32:14 +02:00
Aldehir Rojas	b8635075ff	common : add gemma 4 specialized parser (#21418 ) * common : add gemma4 dedicated parser * cont : add '<\|tool_response>' as eog * cont : emit JSON from Gemma4 tool call AST * cont : more fixes * cont : refactor convert function * cont : refine rules and mapping * cont : add more tests * cont : clean up * cont : remove autoparser gemma4 implementation * cont : more cleanup * cont : rename gemma4.jinja to match the others * cont : add custom template to support interleaved thinking * cont : preserve reasoning in model turns * cont : fix initializer error * cont : fix unused vars * cont : fix accidental static * cont : fix specialized_template signature * fix extra semicolon * remove debug line and extra space [no ci]	2026-04-04 20:39:00 +02:00
SamareshSingh	650bf14eb9	llama-model: read final_logit_softcapping for Gemma 4 (#21390 )	2026-04-04 13:05:10 +02:00
Aman Gupta	b7ad48ebda	llama: add custom newline split for Gemma 4 (#21406 )	2026-04-04 15:06:34 +08:00
Piotr Wilkin (ilintar)	d3416a4aa9	fix: remove stale assert (#21369 )	2026-04-03 13:40:41 +02:00
Piotr Wilkin (ilintar)	b069b10ab4	vocab: fix Gemma4 tokenizer (#21343 ) * seems to work * fix case with new line Co-authored-by: sayap <sokann@gmail.com> * gemma 4: fix pre tok regex --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: sayap <sokann@gmail.com>	2026-04-03 10:33:03 +02:00
Georgi Gerganov	39b27f0da0	(revert) kv-cache : do not quantize SWA KV cache (#21332 ) This reverts commit `17193cce34`.	2026-04-03 09:07:01 +03:00
Bartowski	7992aa7c8e	tests : add unit test coverage for llama_tensor_get_type (#20112 ) * Add unit test coverage for llama_tensor_get_type * Fix merge conflicts, add more schemas * clang formatter changes * Trailing whitespace * Update name * Start rebase * Updating files with upstream changes prior to rebase * Changes needed from rebase * Update attn_qkv schema, change throw behaviour * Fix merge conflicts * White space * Update with latest changes to state counters * Revert accidental personal CLAUDE.md changes * Change quotation mark * Reuse metadata.name since we have it * Move test-only stuff out of llama-quant.cpp * Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns * cont : inital deslop guidelines * Cleanup based on review comments * Continue cleanup * Small cleanup * Manually set proper ordering of tensors, mostly applies to gemma * Formatting * Update tests/test-quant-type-selection.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix merge conflicts --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-02 22:53:58 +02:00
Xuan-Son Nguyen	63f8fe0ef4	model, mtmd: fix gguf conversion for audio/vision mmproj (#21309 ) * fix gguf conversion for audio/vision mmproj * fix test	2026-04-02 17:10:32 +02:00
Jesus Talavera	6137c325a1	chat : add Granite 4.0 chat template with correct tool_call role mapping (#20804 ) * chat : add Granite 4.0 chat template with correct tool_call role mapping Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite 3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`). The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the `assistant_tool_call` role to `<\|start_of_role\|>assistant<\|end_of_role\|><\|tool_call\|>`. Without a matching C++ handler, the fallback path emits the literal role `assistant_tool_call` which the model does not recognize, breaking tool calling when `--jinja` is not used. Changes: - Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X` (preserves existing 3.x behavior unchanged) - Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler - Detection: `<\|start_of_role\|>` + (`<tool_call>` or `<tools>`) → 4.0, otherwise → 3.x - Add production Granite 4.0 Jinja template - Add tests for both 3.x and 4.0 template paths (C++ and Jinja) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Code review: follow standard format and use common logic in test-chat-template.cpp * Rename custom_conversation variable for extra_conversation to give it a more meaningful name --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 11:28:56 +02:00
Georgi Gerganov	17193cce34	kv-cache : do not quantize SWA KV cache (#21277 )	2026-04-02 11:54:05 +03:00
Georgi Gerganov	744c0c7310	llama : rotate activations for better quantization (#21038 ) * llama : rotate activations for better quantization * cont : rotate V more + refactor * cont : rotate caches separately + support non-power-of-2 head sizes * cont : simplify * cont : add reference for V rotation * cont : refactor * cont : support context shift * cont : consolidate * cont : dedup + allow different types for the rotation matrix * cont : add env variable to disable rotation * cont : simplify attn rot kv cache logic + rename env * cont : pre-compute the Hadamard matrices	2026-04-01 16:58:01 +03:00
Ettore Di Giacinto	e1cb817483	memory: respect unified KV cache in hybrid memory for eval tasks (#21224 ) The hybrid memory paths (`llama-memory-hybrid.cpp` and `llama-memory-hybrid-iswa.cpp`) always used sequential equal split, ignoring the unified KV cache flag. This caused hellaswag, winogrande, and multiple-choice evaluations to fail on hybrid models (models with both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with: split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) PR #19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically enabling unified KV mode and setting n_parallel >= 4 for multi-choice eval tasks. However, the hybrid memory paths were not updated. This commit mirrors the iswa fix: use non-sequential split when KV cache is unified (n_stream == 1), which is automatically set by llama-perplexity for hellaswag/winogrande/multiple-choice since #19954. Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model): - HellaSwag: 83.0% (400 tasks) - Winogrande: 74.5% (400 tasks) - MMLU: 41.2% - ARC-Challenge: 56.2% - TruthfulQA: 37.7% All previously failed with llama_decode() error.	2026-04-01 12:50:17 +03:00
Ed Addario	4951250235	llama : refactor llama_model_quantize_params to expose a pure C interface (#20346 ) * Refactor llama_model_quantize_params to expose a pure C interface * Restore comment and cleanup struct def * Code review refactoring Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Code review refactoring --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-01 08:43:00 +03:00
lainon1	0b6ff47996	fix: correct misspellings in code comments (#21217 ) - emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp, gemma-embedding.cpp) - imlpemented → implemented (llama-adapter.cpp) - interere → interfere (llama-graph.cpp) - overridde → overridden (chat.cpp) - stastistics → statistics (ngram-map.h) - layed → laid (llama-kv-cache.h) - worster → worst (llama-context.cpp) - sequantial → sequential (llama-batch.h)	2026-03-31 13:50:51 +02:00
Aman Gupta	278521c33a	llama-model-loader: print warning when using overrides with mmap (#20978 ) * llama-model-loader: use pinned memory for tensor overrides * change to warning	2026-03-30 17:40:17 +08:00
Sigbjørn Skjæret	7c203670f8	add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150 )	2026-03-29 19:45:40 +02:00
Saba Fallah	1743d98057	mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027 ) * mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-27 00:07:55 +01:00
Michael Wand	f8d4abae86	convert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505 ) * convert : fix Qwen3.5 NVFP4 conversion * Updated copilot concerns and rebased * move into _LinearAttentionVReorderBase and simplify * --flake * new_name not needed * Added input_scale to gguf * Fixed input_scale addition as tensor * Added input scale to loader and named _in_s * Update convert_hf_to_gguf.py Re-removed input_scale from aux cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-26 16:52:06 +01:00
Saba Fallah	a970515bdb	mtmd: Add DeepSeekOCR Support (#17400 ) * mtmd: llama.cpp DeepSeekOCR support init commit * loading sam tensors * mtmd: fix vision model processing * deepseek-ocr clip-vit model impl * mtmd: add DeepSeek-OCR LM support with standard attention * mtmd: successfully runs DeepSeek-OCR LM in llama-cli * mtmd: Fix RoPE type for DeepSeek-OCR LM. * loading LM testing Vision model loading * sam warmup working * sam erroneous return corrected * clip-vit: corrected cls_embd concat * clip-vit: model convert qkv_proj split * corrected combining of image encoders' results * fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model * concat image_newline and image_seperator tokens * visual_model warmup (technically) works * window partitioning using standard ggml ops * sam implementation without using CPU only ops * clip: fixed warnings * Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr * mtmd: fix get_rel_pos * mtmd: fixed the wrong scaler for get_rel_pos * image encoding technically works but the output can't be checked singe image decoding fails * mtmd: minor changed * mtmd: add native resolution support * - image encoding debugged - issues fixed mainly related wrong config like n_patches etc. - configs need to be corrected in the converter * mtmd: correct token order * - dynamic resizing - changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4 * mtmd: quick fix token order * mtmd: fix danling pointer * mtmd: SAM numerically works * mtmd: debug CLIP-L (vit_pre_ln) * mtmd: debug CLIP-L & first working DeepSeek-OCR model * mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work * mtmd: simplify SAM patch embedding * mtmd: adapt Pillow image resizing function * mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing * mtmd: remove --dsocr-mode argument * mtmd: refactor code & remove unused helper functions * mtmd: fix tensor names for image newlines and view separator * clean up * reverting automatically removed spaces * reverting automatically removed spaces * mtmd: fixed bad ocr check in Deepseek2 (LM) * mtmd: support combined QKV projection in buid_vit * using common build_attn in sam * corrected code-branch when flash-attn disabled enabling usage of --flash-attn option * mtmd: minor fix * minor formatting and style * fixed flake8 lint issues * minor editorconfig-check fixes * minor editorconfig-check fixes * mtmd: simplify get_rel_pos * mtmd: make sam hparams configurable * mtmd: add detailed comments for resize_bicubic_pillow * mtmd: fixed wrong input setting * mtmd: convert model in FP16 * mtmd: minor fix * mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template * fix: test-1.jpg ORC issue with small (640) resolution setting min-resolution base (1024) max large (1280) for dynamic-resolution * minor: editconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909 added new opt to tests.sh to disable flash-attn * minor: editconfig-check fix * testing deepseek-ocr quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR * quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909 * refactoring, one single builder function and static helpers * added deepseek-ocr test to tests.sh * minor formatting fixes * check with fixed expected resutls * minor formatting * editorconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042 * minor - added GLM-4.6V to big tests - added missing deps for python test * convert: minor fix * mtmd: format code * convert: quick fix * convert: quick fix * minor python formatting * fixed merge build issue * merge resolved - fixed issues in convert - tested several deepseek models * minor fix * minor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * - removed clip_is_deepseekocr - removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions * - cleaning commented out code * fixing instabilities issues reintroducing resize_bicubic_pillow * - use f16 model for deepseek-ocr test - ignore llama-arch test for deepseek-ocr * rename fc_w --> mm_fc_w * add links to OCR discussion * cleaner loading code * add missing .weight to some tensors * add default jinja template (to be used by server) * move test model to ggml-org * rolling back upscale change * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: bluebread <hotbread70127@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-03-25 19:57:40 +01:00
Saba Fallah	80322ebdaf	model: codefuse-ai/F2LLM-v2 support	2026-03-25 18:33:42 +01:00
Dowon	44c51e526b	model : allow causal_attn and pooling_type on all architectures (#20973 ) * models : allow causal_attn and pooling_type on all architectures * fix: move location	2026-03-25 18:12:38 +01:00
Johannes Gäßler	36dafba5c4	llama: fix llama-model-saver (#20503 ) * llama : add fd-based model loading via llama_model_load_from_fd * llama : address review feedback for fd-based model loading * llama : use FILE pointer instead of fd in public API * llama : use FILE pointer consistently, address review feedback * fixup * fix tensor names * fix llama-model-saver * roundtrip tests * fixup * refactor tests * fix prints * fix model saving * fix CI, disable Chameleon * print seed --------- Co-authored-by: Siddhesh2377 <siddheshsonar2377@gmail.com>	2026-03-25 12:53:16 +02:00
Georgi Gerganov	9f102a1407	models : move the token embedding norms to the first layer (#20943 ) * models : move the token embedding norms to the first layer * cont : fix LLM_TENSOR_CONV1D + fix il indexing	2026-03-24 17:00:30 +02:00
Aman Gupta	3fc6f1aed1	ggml-backend: re-enable graph reuse with pipeline parallelism (#20927 )	2026-03-24 20:47:00 +08:00
Aman Gupta	e852eb4901	llama-fit: fix regex pattern for gate_up tensors (#20910 ) * llama-fit: fix regex pattern for gate_up tensors * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-24 12:57:57 +08:00

1 2 3 4 5 ...

920 Commits