llama.cpp

mirror of https://github.com/ggerganov/llama.cpp synced 2026-03-25 16:40:58 +01:00

History

Georgi Gerganov d28961d81e llama : enable chunked fused GDN path (#20340 ) * llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from `2068908975` * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>		2026-03-11 22:46:40 +02:00
..
peg-parser	common: consolidate PEG string parsers (#20263 )	2026-03-10 00:29:21 +01:00
.gitignore	common : introduce composable PEG parser combinators for chat parsing (#17136 )	2025-12-03 12:45:32 +02:00
CMakeLists.txt	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00
get-model.cpp	ci : add model tests + script wrapper (#4586 )	2024-01-26 14:18:00 +02:00
get-model.h	ci : add model tests + script wrapper (#4586 )	2024-01-26 14:18:00 +02:00
gguf-model-data.cpp	tests : model metadata loading from huggingface (#19796 )	2026-02-28 10:44:38 +01:00
gguf-model-data.h	tests : model metadata loading from huggingface (#19796 )	2026-02-28 10:44:38 +01:00
run-json-schema-to-grammar.mjs	llama : move end-user examples to tools directory (#13249 )	2025-05-02 20:27:13 +02:00
test-alloc.cpp	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
test-arg-parser.cpp	ci, tests : use cmake to download models and remove libcurl dependency (#18791 )	2026-01-14 07:46:27 +01:00
test-autorelease.cpp	docs : Minor cleanups (#19252 )	2026-02-02 08:38:55 +02:00
test-backend-ops.cpp	llama : enable chunked fused GDN path (#20340 )	2026-03-11 22:46:40 +02:00
test-backend-sampler.cpp	tests : fix typos in comments in test-backend-sampler [no ci] (#19824 )	2026-02-23 17:12:02 +01:00
test-barrier.cpp	Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes (#17748 )	2025-12-10 12:32:23 -08:00
test-c.c	ggml : remove kompute backend (#14501 )	2025-07-03 07:48:32 +03:00
test-chat-auto-parser.cpp	common : gracefully handle incomplete output (#20191 )	2026-03-08 17:17:02 +01:00
test-chat-peg-parser.cpp	common: consolidate PEG string parsers (#20263 )	2026-03-10 00:29:21 +01:00
test-chat-template.cpp	Autoparser - complete refactoring of parser architecture (#18675 )	2026-03-06 21:01:00 +01:00
test-chat.cpp	common: map developer role to system (#20215 )	2026-03-09 14:25:11 +01:00
test-double-float.cpp	ggml : minor naming changes (#8433 )	2024-07-12 10:46:02 +03:00
test-gbnf-validator.cpp	cmake : do not include ./src as public for libllama (#13062 )	2025-04-24 16:00:10 +03:00
test-gguf-model-data.cpp	tests : model metadata loading from huggingface (#19796 )	2026-02-28 10:44:38 +01:00
test-gguf.cpp	ggml/gguf : prevent integer overflows (#19856 )	2026-02-24 20:17:11 +02:00
test-grammar-integration.cpp	llama : add token matching support to llama-grammar (#17816 )	2025-12-09 00:32:57 -06:00
test-grammar-llguidance.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-grammar-parser.cpp	llama : add token matching support to llama-grammar (#17816 )	2025-12-09 00:32:57 -06:00
test-jinja.cpp	jinja: correct stats for tojson and string filters (#19785 )	2026-02-22 21:08:23 +01:00
test-json-partial.cpp	common : handle unicode during partial json parsing (#16526 )	2025-10-12 16:18:47 +03:00
test-json-schema-to-grammar.cpp	examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968 )	2026-03-10 14:38:18 +01:00
test-llama-archs.cpp	llama: end-to-end tests (#19802 )	2026-03-08 12:30:21 +01:00
test-llama-grammar.cpp	llama : add token matching support to llama-grammar (#17816 )	2025-12-09 00:32:57 -06:00
test-log.cpp	common : use common_ prefix for common library functions (#9805 )	2024-10-10 22:57:42 +02:00
test-lora-conversion-inference.sh	cli: new CLI experience (#17824 )	2025-12-10 15:28:59 +01:00
test-model-load-cancel.cpp	llama : update llama_model API names (#11063 )	2025-01-06 10:55:18 +02:00
test-mtmd-c-api.c	mtmd : add C public API (#13184 )	2025-05-04 23:43:42 +02:00
test-opt.cpp	tests : fix test-opt with GGML_BACKEND_DL (#15599 )	2025-08-26 22:14:38 +02:00
test-peg-parser.cpp	Autoparser - complete refactoring of parser architecture (#18675 )	2026-03-06 21:01:00 +01:00
test-quantize-fns.cpp	ggml : add NVFP4 quantization type support (#19769 )	2026-03-11 21:02:54 +01:00
test-quantize-perf.cpp	ci: run the x64 and arm ci on the github machines instead (#16183 )	2025-09-25 08:06:06 +03:00
test-quantize-stats.cpp	server: introduce API for serving / loading / unloading multiple models (#17470 )	2025-12-01 19:41:04 +01:00
test-reasoning-budget.cpp	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00
test-regex-partial.cpp	common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342 )	2026-01-03 16:02:43 -06:00
test-rope.cpp	ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (#16805 )	2025-11-11 13:33:24 +02:00
test-sampling.cpp	sampling : optimize samplers by reusing bucket sort (#15665 )	2025-08-31 20:41:02 +03:00
test-state-restore-fragmented.cpp	kv-cache: Fix state restore fragmented cache (#17982 )	2025-12-15 19:28:35 +02:00
test-thread-safety.cpp	server : support unified cache across slots (#16736 )	2025-11-02 18:14:04 +02:00
test-tokenizer-0.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-0.py	py : logging and flake8 suppression refactoring (#7081 )	2024-05-05 08:07:48 +03:00
test-tokenizer-0.sh	model : add Jina Embeddings v5 Nano (partial EuroBERT) support (#19826 )	2026-02-26 12:14:09 +01:00
test-tokenizer-1-bpe.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-1-spm.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-random.py	requirements : update transformers/torch for Embedding Gemma (#15828 )	2025-09-09 06:06:52 +02:00
test-tokenizers-repo.sh	devops: add s390x & ppc64le CI (#15925 )	2025-09-27 02:03:33 +08:00
testing.h	common : implement new jinja template engine (#18462 )	2026-01-16 11:22:06 +01:00