llama.cpp/src
Gabe Goodhart 0c74f32632
memory: Hybrid context shift (#17009)
* feat(memory): Only fail partial erasure of recurrent tail

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(main): Check the output of seq_rm for prefix matching

This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(memory): Fix condition for partial erasure failure if p0 > pos

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: compilade <git@compilade.net>

* style: Fix extra parens

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-10 17:14:23 +02:00
..
models hparams : add n_embd_inp() to support extended embed (#16928) 2025-11-07 19:27:58 +01:00
CMakeLists.txt model : add openPangu-Embedded (#16941) 2025-11-05 10:28:58 +01:00
llama-adapter.cpp aLoRA Support (#15327) 2025-09-05 17:32:39 -06:00
llama-adapter.h aLoRA Support (#15327) 2025-09-05 17:32:39 -06:00
llama-arch.cpp model : add openPangu-Embedded (#16941) 2025-11-05 10:28:58 +01:00
llama-arch.h model : add openPangu-Embedded (#16941) 2025-11-05 10:28:58 +01:00
llama-batch.cpp batch : fix consistency checks for the input positions (#16890) 2025-10-31 13:50:33 +02:00
llama-batch.h llama: store mrope data in KV cell (#16825) 2025-10-29 18:09:18 +01:00
llama-chat.cpp model : add openPangu-Embedded (#16941) 2025-11-05 10:28:58 +01:00
llama-chat.h model : add openPangu-Embedded (#16941) 2025-11-05 10:28:58 +01:00
llama-context.cpp hparams : add n_embd_inp() to support extended embed (#16928) 2025-11-07 19:27:58 +01:00
llama-context.h server : support unified cache across slots (#16736) 2025-11-02 18:14:04 +02:00
llama-cparams.cpp cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188) 2025-06-15 10:08:58 +03:00
llama-cparams.h server : support unified cache across slots (#16736) 2025-11-02 18:14:04 +02:00
llama-grammar.cpp server: streaming of tool calls and thoughts when --jinja is on (#12379) 2025-05-25 01:48:08 +01:00
llama-grammar.h tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034) 2025-03-05 13:05:13 +00:00
llama-graph.cpp hparams : add n_embd_inp() to support extended embed (#16928) 2025-11-07 19:27:58 +01:00
llama-graph.h graph : support cacheless embeddings with FA and iSWA (#16528) 2025-10-13 22:42:37 +03:00
llama-hparams.cpp hparams : add n_embd_inp() to support extended embed (#16928) 2025-11-07 19:27:58 +01:00
llama-hparams.h hparams : add n_embd_inp() to support extended embed (#16928) 2025-11-07 19:27:58 +01:00
llama-impl.cpp GGUF: C++ refactor, backend support, misc fixes (#11030) 2025-01-07 18:01:58 +01:00
llama-impl.h llama: use FA + max. GPU layers by default (#15434) 2025-08-30 16:32:10 +02:00
llama-io.cpp llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181) 2025-03-13 12:35:44 +02:00
llama-io.h llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181) 2025-03-13 12:35:44 +02:00
llama-kv-cache-iswa.cpp kv-cache : pad the cache size to 256 for performance (#17046) 2025-11-07 20:03:25 +02:00
llama-kv-cache-iswa.h llama: print memory breakdown on exit (#15860) 2025-09-24 16:53:48 +02:00
llama-kv-cache.cpp model: add support for qwen3vl series (#16780) 2025-10-30 16:19:14 +01:00
llama-kv-cache.h memory : remove KV cache size padding (#16812) 2025-10-28 20:19:44 +02:00
llama-kv-cells.h llama: store mrope data in KV cell (#16825) 2025-10-29 18:09:18 +01:00
llama-memory-hybrid.cpp memory : use sequential equal splits for recurrent modules (#16442) 2025-10-07 08:24:17 +03:00
llama-memory-hybrid.h llama: print memory breakdown on exit (#15860) 2025-09-24 16:53:48 +02:00
llama-memory-recurrent.cpp memory: Hybrid context shift (#17009) 2025-11-10 17:14:23 +02:00
llama-memory-recurrent.h llama: consistent ctx <-> buf order for KV cache (#16746) 2025-10-28 11:23:54 +01:00
llama-memory.cpp memory : correctly handle failure in apply() (#14438) 2025-06-30 18:03:03 +03:00
llama-memory.h llama: print memory breakdown on exit (#15860) 2025-09-24 16:53:48 +02:00
llama-mmap.cpp llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013) 2025-06-05 11:57:42 +02:00
llama-mmap.h llama-mmap: fix missing include (#11796) 2025-02-10 20:58:18 +02:00
llama-model-loader.cpp model : Apertus model implementation (#15852) 2025-10-02 20:43:22 +03:00
llama-model-loader.h model: support GLM 4.5 family of models (#14939) 2025-08-04 20:29:25 +02:00
llama-model-saver.cpp llama : improve sep token handling (#14272) 2025-06-20 14:04:09 +02:00
llama-model-saver.h llama/ggml: add LLM training support (#10544) 2025-05-12 14:44:49 +02:00
llama-model.cpp hparams : add n_embd_inp() to support extended embed (#16928) 2025-11-07 19:27:58 +01:00
llama-model.h model : Minimax M2 (#16831) 2025-10-31 21:20:47 +01:00
llama-quant.cpp llama : use std::abs instead of abs (#16853) 2025-10-30 08:30:58 +02:00
llama-quant.h llama : refactor src/llama.cpp (#10902) 2025-01-03 10:18:53 +02:00
llama-sampling.cpp vocab : mark EOT token for Granite models (#16499) 2025-10-10 17:17:31 +03:00
llama-sampling.h llama : add llama_vocab, functions -> methods, naming (#11110) 2025-01-12 11:32:42 +02:00
llama-vocab.cpp model : Minimax M2 (#16831) 2025-10-31 21:20:47 +01:00
llama-vocab.h model : Minimax M2 (#16831) 2025-10-31 21:20:47 +01:00
llama.cpp llama-quant: add support for mmproj (#16592) 2025-10-15 14:48:08 +02:00
unicode-data.cpp server : better security control for public deployments (#9776) 2024-10-08 13:27:04 +02:00
unicode-data.h llama : reduce compile time and binary size (#9712) 2024-10-02 15:49:55 +02:00
unicode.cpp model : add Kimi-K2 support (#14654) 2025-07-15 21:54:22 +02:00
unicode.h devops: add s390x & ppc64le CI (#15925) 2025-09-27 02:03:33 +08:00