Commit Graph

3895 Commits

Author SHA1 Message Date
Johannes Gäßler
41d5d7bb0e CUDA: fix padding of GQA to power of 2 in FA (llama/19115) 2026-01-30 15:56:40 +02:00
Johannes Gäßler
f63848eada CUDA: faster FA for GQA > 1 but not power of 2 (llama/19092) 2026-01-30 15:56:40 +02:00
ccbinn
4372b87b8e metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (llama/19088)
Co-authored-by: chenbin11 <chenbin11@kuaishou.com>
2026-01-30 15:56:40 +02:00
Aman Gupta
1642a4fb60 ggml-cpu: Use tiled FA for prompt-processing (llama/19012)
* ggml-cpu: Use tiled FA for prompt-processing

the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.

* fix out of bounds for mask

* skip rows where there are all masks

* skip tile if mask is inf

* store mask in worksize

* check inf tile earlier
2026-01-30 15:56:40 +02:00
Georgi Gerganov
d2b51404e4 kv-cache : support V-less cache (llama/19067)
* kv-cache : support V-less cache

* cuda : better check for V_is_K_view

* cuda : improve V_is_K_view check

* graph : add comments

* hparams : refactor
2026-01-30 15:56:40 +02:00
Johannes Gäßler
f53eafd745 CUDA: re-use MLA K data for V in MMA FA (llama/19057) 2026-01-30 15:56:40 +02:00
Aman Gupta
13577a6ce4 ggml-cuda: enable cuda-graphs for n-cpu-moe (llama/18934)
* ggml-cuda: add split-wise cuda graph

* add n-cpu-moe compare_llama_bench.py

* fix hip/musa builds
2026-01-30 15:56:40 +02:00
nullname
79f1bb3d35 ggml-hexagon: flash-attn opt (llama/19025)
* optimize flash attention kernel by improving score computation and online softmax update

* wip

* Refactor online softmax update in flash attention kernel for improved performance

* Optimize flash attention kernel by replacing float array with HVX_Vector for score computation

* wip
2026-01-30 15:56:40 +02:00
Neo Zhang
0d9dda5a99 use malloc to support both iGPU and dGPU in same time (llama/18992)
* use malloc to support both iGPU and dGPU in same time

* support windows

---------

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez
e090d91f5e ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (llama/18860)
* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
2026-01-30 15:56:40 +02:00
Georgi Gerganov
3f96a1da0e mla : make the V tensor a view of K (llama/18986)
* mla : pass V as a view of K to the FA op

* cuda : adjust mla logic to new layout

* kv-cache : fix rope shift

* tests : remove comment

* cuda : fix reusable_cutoff

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-30 15:56:40 +02:00
Johannes Gäßler
f21d0cbb1a CUDA: fix alignment check for FA (llama/19023) 2026-01-30 15:56:40 +02:00
lhez
0e030b852a opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (llama/18970)
* opencl: add `copy_to_contiguous` and utilize mm kernels

* opencl: only copy to cont for f32 and f16 tensors

* opencl: use cont mm for fallback when dst is large

* opencl: use nb local to copy-to-cont

* opencl: use local offset as well
2026-01-30 15:56:40 +02:00
Aman Gupta
d4fafcfc6f CUDA: add gqa_ratio 4 for GLM 4.7 flash (llama/18953) 2026-01-30 15:56:40 +02:00
shaofeiqi
167fec69d5 opencl: add TRI op support (llama/18979) 2026-01-30 15:56:40 +02:00
Aleksei Nikiforov
55927d42ef ggml-zdnn : mark zDNN buffers as non-host (llama/18967)
While buffers reside in host memory,
additional transformation is needed to use buffers with zDNN.

Fixes #18848
2026-01-30 15:56:40 +02:00
Jeff Bolz
b7e323f40b vulkan: Remove transfer_ctx, do everything in compute_ctx. (llama/18945)
* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger
2026-01-30 15:56:40 +02:00
Jeff Bolz
b2bc4d810b vulkan: support flash attention GQA/split_k with small batches (llama/18938) 2026-01-30 15:56:40 +02:00
Masato Nakasaka
3bbf4ced47 Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (llama/18831)
This reverts commit 980b7cd17e055c8c587f79ffda7eb4fddf405566.
2026-01-30 15:56:40 +02:00
Jeff Bolz
660d943ff8 vulkan: Use mul_mat_vec_id for small values of n (llama/18918)
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
2026-01-30 15:56:40 +02:00
Oliver Simons
924a9e292c CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (llama/18964)
* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator

Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)

* Unindent as per code review request
2026-01-30 15:56:40 +02:00
Oliver Simons
fdc83ee3c0 CUDA: Replace init_offsets kernel with iterators in cub-based argsort (llama/18930)
* CUDA: Replace `init_offsets` with iterators in argsort

This is a QOL improvement, saving us the cost of materializing the
iterator

* Remove unnecessary include from top-k.cu
2026-01-30 15:56:40 +02:00
Adrien Gallouët
bf71ffa6b3 ggml : cleanup path_str() (llama/18928)
- Remove pragmas as `std::codecvt_utf8` is not used.
- Avoid implicit `strlen()`.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-30 15:56:40 +02:00
Georgi Gerganov
b0517d6912 metal : enable FA for MLA heads (llama/18950) 2026-01-30 15:56:40 +02:00
Georgi Gerganov
47f3e3b927 ggml : add ggml_build_forward_select (llama/18550)
* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment
2026-01-30 15:56:40 +02:00
lhez
62a09b106d opencl: fix q6_K mv for m=1 (llama/18893) 2026-01-30 15:56:40 +02:00
Reese Levine
389dafc7c2 ggml webgpu: support for backend sampling (llama/18880) 2026-01-30 15:56:40 +02:00
Thore Koritzius
511ca7a1f4 ggml : extend ggml_pool_1d + metal (llama/16429)
* chore: resolve conflicts

* feat: ggml metal impl

* fix: ggml_metal_kargs_pool_1d struct

* fix: require contiguous input

* chore: test pool_1d

* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts

* chore: add p0 and s0 to testing

* fix: allow padding for cpu and metal

* Update ggml/src/ggml-metal/ggml-metal.metal

* fix: correct single-threaded loop

* ggml : cleanup

* tests : add ne[1] != 1 tests

* fix: ne[1] handling in np

* cont : fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-30 15:56:40 +02:00
Perry Naseck
ecb4b80c35 ggml-blas: hide warnings from included BLAS headers (llama/18818)
* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set

* ggml-blas: hide warnings from included BLAS headers
2026-01-30 15:56:40 +02:00
Raul Torres
42960b6073 CANN: Remove unused ggml_cann_get_device function (llama/18625) 2026-01-30 15:56:40 +02:00
Chenguang Li
2fceb5a80f CANN: fix an issue where get_env was not fully renamed (llama/18796)
* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph
2026-01-30 15:56:40 +02:00
hipudding
854274a297 CANN: support gated linear attn (llama/18653)
* CANN: support gated linear attn

This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.

Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: optimize OP gla

Optimize gla for high preformance

* Remove unused comments

---------

Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain>
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
2026-01-30 15:56:40 +02:00
shaofeiqi
ed6004d051 OpenCL: add SOLVE_TRI op support (llama/18846) 2026-01-30 15:56:40 +02:00
Georgi Gerganov
290ff3d28d cuda : print less debug logs when disabling cuda graphs (llama/18868) 2026-01-30 15:56:40 +02:00
Johannes Gäßler
f2f0ba0384 CUDA: fix allignment on register spill for FA (llama/18815) 2026-01-30 15:56:40 +02:00
shalinib-ibm
78a23d4830 ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (llama/18837) 2026-01-30 15:56:40 +02:00
Max Krasnyansky
50b7ab3d46 hexagon: support for OP_CPY, host buffers now optional (llama/18822) 2026-01-30 15:56:40 +02:00
Oliver Simons
bc09047405 CUDA: Factor out and re-use block_reduce function (llama/18785)
* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-30 15:56:40 +02:00
Jeff Bolz
4b155e9bfb vulkan: Check maxStorageBufferRange in supports_op (llama/18709)
* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled
2026-01-30 15:56:40 +02:00
Daniel Bevenius
25aeb66a4a CUDA : fix typo in clang pragma comment [no ci] (llama/18830) 2026-01-30 15:56:40 +02:00
Ruben Ortlam
49762e8fb3 vulkan: work around Intel fp16 bug in mmq (llama/18814) 2026-01-30 15:56:40 +02:00
Perry Naseck
17656e56dc ggml-metal: do not copy headers for embedded, use current binary dir for embedded (llama/18705) 2026-01-30 15:56:40 +02:00
yulo
c6a495ae5d HIP: add fattn-mma-f16 for RDNA4 (llama/18481)
* finish VQ mma

* flash_attn_ext_f16_iter

* KQ_rowsum

* correct exp

* fix scale error

* fix softmax scale

* fix softmax scale

* enable fattn on cpu side

* fix random error

* disable fattn-mma-f16 on rdna3

* fix wrong col for rdna

* use identity mat to transpose

* resolve conflicts

* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B

* fix volta compile error

* align rdna4 policy for fattn

* adjust fattn policy

* adjust kernel selection logic

* update as the review comments

* keep fattn-wmma logic

* adjust kernel selection logic

---------

Co-authored-by: zhang hui <you@example.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-30 15:56:40 +02:00
Bráulio Oliveira
7aa8818647
examples : use -dev/--device and WHISPER_ARG_DEVICE (#3557)
Align device selection naming with llama.cpp.
2026-01-21 08:40:30 +01:00
Yshtola
f53dc74843
whisper : Fix UTF-8 character boundary issue in segment wrapping (max_len) (#3592)
The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).
2026-01-16 14:16:05 +02:00
Georgi Gerganov
2eeeba56e9
release : v1.8.3 2026-01-15 11:54:31 +02:00
Georgi Gerganov
21c1765fcb
benches : update 2026-01-15 11:53:09 +02:00
Georgi Gerganov
47af2fb70f sync : ggml 2026-01-14 09:11:59 +02:00
Georgi Gerganov
6ee0eaf531 CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800) 2026-01-14 09:11:59 +02:00
Jeff Bolz
ab1828dc1c vulkan: change memory_logger to be controlled by an env var (llama/18769) 2026-01-14 09:11:59 +02:00