llama.cpp

mirror of https://github.com/ggerganov/llama.cpp synced 2026-03-02 05:09:23 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	6c8a04576e	experiments	2026-01-28 09:45:07 +02:00
Georgi Gerganov	003c90352d	ngram-map : take into account the input can become shorter	2026-01-27 11:56:13 +02:00
Georgi Gerganov	9f8401a533	ngram-map : fix uninitialized values	2026-01-27 11:07:18 +02:00
Georgi Gerganov	bc33838037	common : rename speculative.draftless_type -> speculative.type	2026-01-27 10:19:36 +02:00
Georgi Gerganov	351e798b2a	Merge branch 'master' into pr/18471	2026-01-27 10:04:19 +02:00
Gaurav Garg	a83c73a18a	[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042 ) * [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size. * Set the env variable in the CUDA backend registry allocation * Add link to PR in code comment * Remove warning logs and update documentation	2026-01-27 08:52:44 +02:00
Daniel Bevenius	fc3cdf32ce	common : clarify HTTPS build options in error message (#19103 ) * common : clarify HTTPS build options in error message This commit updates the https error message to provide clearer instructions for users who encounter the "HTTPS is not supported" error. The motivation for this is that it might not be clear to users that only one of these options are needed to enable HTTPS support. The LLAMA_OPENSSL option is also added to the message to cover all possible build configurations. * clarify that OpenSSL is the default for HTTPS support	2026-01-27 06:16:00 +01:00
shalinib-ibm	7afdfc9b84	ggml-cpu: Enable FP16 MMA kernels on PPC (#19060 )	2026-01-27 11:52:34 +08:00
lhez	94eeb5967c	opencl: add flattened q6_K mv (#19054 ) * opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat` * opencl: clean up * opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat` * opencl: tweak the workgroup size a bit * opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat` * opencl: proper alignment for q6_K * opencl: boundary handling for flattened q6_K mv * opencl: rename q6_K mv kernel file * opencl: put flattened q6_K mv in its own file * opencl: use lower k in file name * opencl: use K in variable names	2026-01-26 19:36:24 -08:00
Johannes Gäßler	b0311c16d2	CUDA: fix padding of GQA to power of 2 in FA (#19115 )	2026-01-26 23:24:58 +01:00
Sascha Rogmann	dd23149dea	CODEOWNERS: add common/ngram-map.* (#18471 )	2026-01-26 22:06:43 +01:00
Sascha Rogmann	72f416e973	minor: comments	2026-01-26 22:04:00 +01:00
Georgi Gerganov	8f80d1b254	graph : fix nkvo offload with FA (#19105 )	2026-01-26 20:18:34 +02:00
Sigbjørn Skjæret	142cbe2ac6	ci : use new 1vCPU runner for lightweight jobs (#19107 ) * use new 1vCPU runner for lightweight jobs * pyright is too heavy, look into ty some day use new pip-install input	2026-01-26 15:22:49 +01:00
Georgi Gerganov	1f8d36665d	minor : cleanup + fix build	2026-01-26 14:05:17 +02:00
Georgi Gerganov	a3300937e5	common : better names	2026-01-26 13:59:08 +02:00
Georgi Gerganov	f895bca71a	minor : cleanup	2026-01-26 13:56:28 +02:00
Georgi Gerganov	56f3ebf38e	model : add correct type for GLM 4.7 Flash (#19106 )	2026-01-26 11:24:30 +02:00
Sascha Rogmann	fd4d803c60	common: print performance in spec decoding	2026-01-26 00:20:05 +01:00
Sascha Rogmann	288ab50597	doc: (draftless) speculative decoding	2026-01-25 23:58:55 +01:00
Sascha Rogmann	8ea068e5f8	spec: remove --spec-config	2026-01-25 23:56:29 +01:00
Johannes Gäßler	0c21677e43	CUDA: faster FA for GQA > 1 but not power of 2 (#19092 )	2026-01-25 21:19:47 +01:00
Georgi Gerganov	9ac881767c	cont : naming	2026-01-25 21:39:54 +02:00
ccbinn	0440bfd160	metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (#19088 ) Co-authored-by: chenbin11 <chenbin11@kuaishou.com>	2026-01-25 20:07:19 +02:00
Sigbjørn Skjæret	0bf5636938	convert : yield Gemma3N custom_map tensors directly (#19091 )	2026-01-25 18:03:34 +01:00
Georgi Gerganov	924517dd38	spec : refactor	2026-01-25 18:21:57 +02:00
Sascha Rogmann	af382c384a	common: cleanup (use common_speculative_state_draft)	2026-01-25 16:41:44 +01:00
Aman Gupta	bcb43163ae	ggml-cpu: Use tiled FA for prompt-processing (#19012 ) * ggml-cpu: Use tiled FA for prompt-processing the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. * fix out of bounds for mask * skip rows where there are all masks * skip tile if mask is inf * store mask in worksize * check inf tile earlier	2026-01-25 23:25:58 +08:00
Georgi Gerganov	d9c6ce46f7	kv-cache : support V-less cache (#19067 ) * kv-cache : support V-less cache * cuda : better check for V_is_K_view * cuda : improve V_is_K_view check * graph : add comments * hparams : refactor	2026-01-25 15:48:56 +02:00
Sigbjørn Skjæret	70d860824a	convert : fix Gemma3N, GraniteMoe and Ernie4.5Moe (#19084 ) * fix Gemma3N and Ernie4.5Moe * fix GraniteMoe	2026-01-25 13:05:05 +01:00
Georgi Gerganov	080b161995	completion : fix prompt cache for recurrent models (#19045 )	2026-01-25 09:12:50 +02:00
Molly Sophia	1243f93a2d	readme: update RWKV7 model links (#19061 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2026-01-25 09:11:19 +02:00
Jakkala Mahesh	24bc238303	llama: fix integer type consistency in split helpers (#18894 ) * llama: fix integer type consistency in split helpers * llama: apply minor style fixes * llama: remove trailing whitespace	2026-01-25 09:10:52 +02:00
Daniel Bevenius	16639ba217	common : use two decimal places for float arg help messages (#19048 ) * common : use two decimal places for float arg help messages This commit updates the help messages for various command-line arguments in arg.cpp to display floating-point default values with two decimal places instead of one. The motivation for this changes is that currently only having one decimal place means that values generated using --help or llama-gen-docs will not display the correct values. For example, currently the value of top-p in tools/server/README.md is `0.9`, but the default value is actually '0.95'. And running llama-gen-docs does not update this value as it uses the output from the help message, which shows only one decimal place, so the values look like they are unchanged. * docs : run llama-gen-docs to update docs	2026-01-25 07:31:42 +01:00
Bartowski	9981c30130	convert : fix conversion for inheriting models that were bypassing modify_tensors (#19064 ) * Add undo_permute = False where needed * Replace super().modify_tensors with ModelBase * Add one more ModelBase.modify_tensors * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-25 02:36:47 +01:00
Sascha Rogmann	cb3a40277a	common: moved self-spec impl to ngram-map	2026-01-25 01:16:06 +01:00
Johannes Gäßler	e9fd8dcab4	llama-fit-params: keep explicit --ctx-size 0 (#19070 )	2026-01-24 22:13:08 +01:00
Johannes Gäßler	4e5b83b226	GGUF: check that tensor size is representable (#19072 )	2026-01-24 21:57:51 +01:00
Xuan-Son Nguyen	bb02f74c61	chat: fix language input for translategemma (#19052 ) * chat: fix language input for translategemma * Update common/chat.cpp Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev>	2026-01-24 17:58:45 +01:00
Sascha Rogmann	a1584ac80f	server: cleanup (remove slot.batch_spec, rename)	2026-01-24 15:55:02 +01:00
Sascha Rogmann	1e29af4ea5	common: add option --spec-draftless	2026-01-24 15:55:02 +01:00
Sascha Rogmann	eb43748b05	common: add vector of speculative states	2026-01-24 15:55:02 +01:00
Sascha Rogmann	b38eb5907c	common: add enum common_speculative_type	2026-01-24 15:55:02 +01:00
Sascha Rogmann	456268fa7f	common: ngram map, config self-speculative decoding	2026-01-24 15:36:44 +01:00
Sascha Rogmann	907d094f9e	server: can_speculate() requires a task instance	2026-01-24 15:36:44 +01:00
Sascha Rogmann	f1f6584ce6	common: use %zu format specifier for size_t in logging Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-24 15:36:44 +01:00
Sascha Rogmann	917f4bb14b	server: replace can_speculate() with slot.can_speculate() Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-24 15:36:44 +01:00
Sascha Rogmann	38f7c28795	server: can_speculate() tests self-spec	2026-01-24 15:36:44 +01:00
Sascha Rogmann	e3e809cc01	can_speculate() includes self-speculation Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-24 15:36:44 +01:00
Sascha Rogmann	1faeb628db	server: moved self-call into speculative.cpp	2026-01-24 15:36:43 +01:00

1 2 3 4 5 ...

7873 Commits