llama.cpp

mirror of https://github.com/ggerganov/llama.cpp synced 2026-05-01 03:42:01 +02:00

History

kvc0 c807c6e3b0 server: (anthropic API) fix prefix caching (#21793 ) When testing claude code against llama.cpp, I noticed that only n_past 18577 was used even when context was 60k or more. The log in llama-server says: ``` slot update_slots: id 3 \| task 10342 \| old: ... ; cch= \| defa0;You are slot update_slots: id 3 \| task 10342 \| new: ... ; cch= \| 1c8b4; ``` I observed that the cch value changed every time. Reading about that, the x-anthropic-billing-header system message seems to be specially handled inside of the anthropic api. I could remove it, but there is a meaningful string sometimes included at the end. So instead, I just replace the changing cch checksum with fffff. I'm treating this as an anthropic message body API detail - I think this is the right way to do this, but by all means please correct me! It's always 5 hexadecimal characters, but I've written the replacement defensively in case they change the protocol.		2026-04-23 17:45:02 +02:00
..
batched-bench	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
cli	cli : cleanup auto-completion code (#21745 )	2026-04-23 15:03:28 +02:00
completion	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
cvector-generator	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
export-lora	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
fit-params	fit-params : refactor + add option to output estimated memory per device (#22171 )	2026-04-21 09:54:36 +03:00
gguf-split	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
imatrix	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
llama-bench	fit-params : refactor + add option to output estimated memory per device (#22171 )	2026-04-21 09:54:36 +03:00
mtmd	mtmd: also support LLAMA_ROPE_TYPE_NONE (#22242 )	2026-04-22 12:16:29 +02:00
parser	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
perplexity	fit-params : refactor + add option to output estimated memory per device (#22171 )	2026-04-21 09:54:36 +03:00
quantize	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
results	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
rpc	rpc : add native RDMA transport for RPC backend (RoCEv2) (#20590 )	2026-04-15 16:44:02 +03:00
server	server: (anthropic API) fix prefix caching (#21793 )	2026-04-23 17:45:02 +02:00
tokenize	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
tts	libs : rename libcommon -> libllama-common (#21936 )	2026-04-17 11:11:46 +03:00
CMakeLists.txt	llama: end-to-end tests (#19802 )	2026-03-08 12:30:21 +01:00