Commit Graph

7873 Commits

Author SHA1 Message Date
Georgi Gerganov
6c8a04576e
experiments 2026-01-28 09:45:07 +02:00
Georgi Gerganov
003c90352d
ngram-map : take into account the input can become shorter 2026-01-27 11:56:13 +02:00
Georgi Gerganov
9f8401a533
ngram-map : fix uninitialized values 2026-01-27 11:07:18 +02:00
Georgi Gerganov
bc33838037
common : rename speculative.draftless_type -> speculative.type 2026-01-27 10:19:36 +02:00
Georgi Gerganov
351e798b2a
Merge branch 'master' into pr/18471 2026-01-27 10:04:19 +02:00
Gaurav Garg
a83c73a18a
[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042)
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

* Set the env variable in the CUDA backend registry allocation

* Add link to PR in code comment

* Remove warning logs and update documentation
2026-01-27 08:52:44 +02:00
Daniel Bevenius
fc3cdf32ce
common : clarify HTTPS build options in error message (#19103)
* common : clarify HTTPS build options in error message

This commit updates the https error message to provide clearer
instructions for users who encounter the "HTTPS is not supported" error.

The motivation for this is that it might not be clear to users that only
one of these options are needed to enable HTTPS support.
The LLAMA_OPENSSL option is also added to the message to cover all
possible build configurations.

* clarify that OpenSSL is the default for HTTPS support
2026-01-27 06:16:00 +01:00
shalinib-ibm
7afdfc9b84
ggml-cpu: Enable FP16 MMA kernels on PPC (#19060) 2026-01-27 11:52:34 +08:00
lhez
94eeb5967c
opencl: add flattened q6_K mv (#19054)
* opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat`

* opencl: clean up

* opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat`

* opencl: tweak the workgroup size a bit

* opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat`

* opencl: proper alignment for q6_K

* opencl: boundary handling for flattened q6_K mv

* opencl: rename q6_K mv kernel file

* opencl: put flattened q6_K mv in its own file

* opencl: use lower k in file name

* opencl: use K in variable names
2026-01-26 19:36:24 -08:00
Johannes Gäßler
b0311c16d2
CUDA: fix padding of GQA to power of 2 in FA (#19115) 2026-01-26 23:24:58 +01:00
Sascha Rogmann
dd23149dea CODEOWNERS: add common/ngram-map.* (#18471) 2026-01-26 22:06:43 +01:00
Sascha Rogmann
72f416e973 minor: comments 2026-01-26 22:04:00 +01:00
Georgi Gerganov
8f80d1b254
graph : fix nkvo offload with FA (#19105) 2026-01-26 20:18:34 +02:00
Sigbjørn Skjæret
142cbe2ac6
ci : use new 1vCPU runner for lightweight jobs (#19107)
* use new 1vCPU runner for lightweight jobs

* pyright is too heavy, look into ty some day

use new pip-install input
2026-01-26 15:22:49 +01:00
Georgi Gerganov
1f8d36665d
minor : cleanup + fix build 2026-01-26 14:05:17 +02:00
Georgi Gerganov
a3300937e5
common : better names 2026-01-26 13:59:08 +02:00
Georgi Gerganov
f895bca71a
minor : cleanup 2026-01-26 13:56:28 +02:00
Georgi Gerganov
56f3ebf38e
model : add correct type for GLM 4.7 Flash (#19106) 2026-01-26 11:24:30 +02:00
Sascha Rogmann
fd4d803c60 common: print performance in spec decoding 2026-01-26 00:20:05 +01:00
Sascha Rogmann
288ab50597 doc: (draftless) speculative decoding 2026-01-25 23:58:55 +01:00
Sascha Rogmann
8ea068e5f8 spec: remove --spec-config 2026-01-25 23:56:29 +01:00
Johannes Gäßler
0c21677e43
CUDA: faster FA for GQA > 1 but not power of 2 (#19092) 2026-01-25 21:19:47 +01:00
Georgi Gerganov
9ac881767c
cont : naming 2026-01-25 21:39:54 +02:00
ccbinn
0440bfd160
metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (#19088)
Co-authored-by: chenbin11 <chenbin11@kuaishou.com>
2026-01-25 20:07:19 +02:00
Sigbjørn Skjæret
0bf5636938
convert : yield Gemma3N custom_map tensors directly (#19091) 2026-01-25 18:03:34 +01:00
Georgi Gerganov
924517dd38
spec : refactor 2026-01-25 18:21:57 +02:00
Sascha Rogmann
af382c384a common: cleanup (use common_speculative_state_draft) 2026-01-25 16:41:44 +01:00
Aman Gupta
bcb43163ae
ggml-cpu: Use tiled FA for prompt-processing (#19012)
* ggml-cpu: Use tiled FA for prompt-processing

the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.

* fix out of bounds for mask

* skip rows where there are all masks

* skip tile if mask is inf

* store mask in worksize

* check inf tile earlier
2026-01-25 23:25:58 +08:00
Georgi Gerganov
d9c6ce46f7
kv-cache : support V-less cache (#19067)
* kv-cache : support V-less cache

* cuda : better check for V_is_K_view

* cuda : improve V_is_K_view check

* graph : add comments

* hparams : refactor
2026-01-25 15:48:56 +02:00
Sigbjørn Skjæret
70d860824a
convert : fix Gemma3N, GraniteMoe and Ernie4.5Moe (#19084)
* fix Gemma3N and Ernie4.5Moe

* fix GraniteMoe
2026-01-25 13:05:05 +01:00
Georgi Gerganov
080b161995
completion : fix prompt cache for recurrent models (#19045) 2026-01-25 09:12:50 +02:00
Molly Sophia
1243f93a2d
readme: update RWKV7 model links (#19061)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2026-01-25 09:11:19 +02:00
Jakkala Mahesh
24bc238303
llama: fix integer type consistency in split helpers (#18894)
* llama: fix integer type consistency in split helpers

* llama: apply minor style fixes

* llama: remove trailing whitespace
2026-01-25 09:10:52 +02:00
Daniel Bevenius
16639ba217
common : use two decimal places for float arg help messages (#19048)
* common : use two decimal places for float arg help messages

This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.

The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.

For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.

* docs : run llama-gen-docs to update docs
2026-01-25 07:31:42 +01:00
Bartowski
9981c30130
convert : fix conversion for inheriting models that were bypassing modify_tensors (#19064)
* Add undo_permute = False where needed

* Replace super().modify_tensors with ModelBase

* Add one more ModelBase.modify_tensors

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-25 02:36:47 +01:00
Sascha Rogmann
cb3a40277a common: moved self-spec impl to ngram-map 2026-01-25 01:16:06 +01:00
Johannes Gäßler
e9fd8dcab4
llama-fit-params: keep explicit --ctx-size 0 (#19070) 2026-01-24 22:13:08 +01:00
Johannes Gäßler
4e5b83b226
GGUF: check that tensor size is representable (#19072) 2026-01-24 21:57:51 +01:00
Xuan-Son Nguyen
bb02f74c61
chat: fix language input for translategemma (#19052)
* chat: fix language input for translategemma

* Update common/chat.cpp

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-01-24 17:58:45 +01:00
Sascha Rogmann
a1584ac80f server: cleanup (remove slot.batch_spec, rename) 2026-01-24 15:55:02 +01:00
Sascha Rogmann
1e29af4ea5 common: add option --spec-draftless 2026-01-24 15:55:02 +01:00
Sascha Rogmann
eb43748b05 common: add vector of speculative states 2026-01-24 15:55:02 +01:00
Sascha Rogmann
b38eb5907c common: add enum common_speculative_type 2026-01-24 15:55:02 +01:00
Sascha Rogmann
456268fa7f common: ngram map, config self-speculative decoding 2026-01-24 15:36:44 +01:00
Sascha Rogmann
907d094f9e server: can_speculate() requires a task instance 2026-01-24 15:36:44 +01:00
Sascha Rogmann
f1f6584ce6 common: use %zu format specifier for size_t in logging
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-24 15:36:44 +01:00
Sascha Rogmann
917f4bb14b server: replace can_speculate() with slot.can_speculate()
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-24 15:36:44 +01:00
Sascha Rogmann
38f7c28795 server: can_speculate() tests self-spec 2026-01-24 15:36:44 +01:00
Sascha Rogmann
e3e809cc01 can_speculate() includes self-speculation
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-24 15:36:44 +01:00
Sascha Rogmann
1faeb628db server: moved self-call into speculative.cpp 2026-01-24 15:36:43 +01:00