Commit Graph

438 Commits

Author SHA1 Message Date
Jeff Bolz
d4ac481d64 test: mul_mat tests with huge batch size (llama/19519) 2026-02-25 12:32:13 +02:00
Georgi Gerganov
6d4c4d1bf9 ggml : avoid UB in gemm ukernel (llama/19642) 2026-02-25 12:32:13 +02:00
Jeff Bolz
525ca0d104 vulkan: support L2_NORM with contiguous rows (llama/19604) 2026-02-15 00:20:18 +02:00
ymcki
32f763fc0b fix vulkan ggml_acc only works in 3d but not 4d (llama/19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 00:20:18 +02:00
Georgi Gerganov
0a79951dfa metal : support GGML_OP_SET (llama/19548) 2026-02-15 00:20:18 +02:00
Georgi Gerganov
9c80bbfba8 metal : update sum_rows kernel to support float4 (llama/19524) 2026-02-15 00:20:18 +02:00
Georgi Gerganov
ee3d399357 ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511)
* ggml : unary ops support non-cont src0

* metal : support F16 unary ops + fix ELU
2026-02-15 00:20:18 +02:00
Georgi Gerganov
b5b0841322 ggml : extend bin bcast for permuted src1 (llama/19484)
* tests : extend bin bcast for permuted src1

* cont : extend bin support

* cont : s0 is always 1

* tests : simplify
2026-02-15 00:20:18 +02:00
Georgi Gerganov
80acf764cd metal : consolidate unary ops (llama/19490) 2026-02-15 00:20:18 +02:00
Xuan-Son Nguyen
ef84715232 test: fix IMROPE perf test case (llama/19465) 2026-02-15 00:20:18 +02:00
Georgi Gerganov
b07ff83232 cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429)
* cuda : extend GGML_OP_PAD to work with non-cont src0

* tests : add permuted pad
2026-02-15 00:20:18 +02:00
Jeff Bolz
80fab56064 tests: reduce number of FA test permutations (llama/19381)
Only test non-F16 for head size 64 and 72 (one a multiple of QK, one not).
2026-02-07 10:37:38 +02:00
Jeff Bolz
8ab9ddbf6b vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-07 10:37:38 +02:00
Georgi Gerganov
8ecbda9b21 tests : add non-cont, inplace rope tests (llama/19296)
* tests : add non-cont, inplace rope tests

* cont : exercise dim 3

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* cont : more dim3 exercises

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2026-02-07 10:37:38 +02:00
Aman Gupta
218a005716 ggml-cpu: FA split across kv for faster TG (llama/19209)
* ggml-cpu: split across kv for faster TG

* simplify sinks application

* add ref impl
2026-02-07 10:37:38 +02:00
Georgi Gerganov
6203de9940 tests : add GQA=20 FA test (llama/19095) 2026-02-07 10:37:38 +02:00
Johannes Gäßler
43e6c668ee CUDA: fix padding of GQA to power of 2 in FA (llama/19115) 2026-01-30 13:49:29 +02:00
Georgi Gerganov
91d417c595 mla : make the V tensor a view of K (llama/18986)
* mla : pass V as a view of K to the FA op

* cuda : adjust mla logic to new layout

* kv-cache : fix rope shift

* tests : remove comment

* cuda : fix reusable_cutoff

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-30 13:49:29 +02:00
Jeff Bolz
541c0510c7 vulkan: support flash attention GQA/split_k with small batches (llama/18938) 2026-01-30 13:49:29 +02:00
Thore Koritzius
32201fa57c ggml : extend ggml_pool_1d + metal (llama/16429)
* chore: resolve conflicts

* feat: ggml metal impl

* fix: ggml_metal_kargs_pool_1d struct

* fix: require contiguous input

* chore: test pool_1d

* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts

* chore: add p0 and s0 to testing

* fix: allow padding for cpu and metal

* Update ggml/src/ggml-metal/ggml-metal.metal

* fix: correct single-threaded loop

* ggml : cleanup

* tests : add ne[1] != 1 tests

* fix: ne[1] handling in np

* cont : fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-30 13:49:29 +02:00
Oliver Simons
f3a50ebb5b CUDA: Factor out and re-use block_reduce function (llama/18785)
* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-30 13:49:29 +02:00
Jeff Bolz
803ae58a04 vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
2026-01-13 19:09:03 +02:00
Aman Gupta
d8faba2711 test-backend-ops: fix mxfp4 tests on blackwell (llama/18736) 2026-01-11 11:02:08 +02:00
Jeff Bolz
4a25293eb0 vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (llama/18582) 2026-01-11 11:02:08 +02:00
Jeff Bolz
0624ce207f vulkan: handle quantize_q8_1 overflowing the max workgroup count (llama/18515)
* vulkan: handle quantize_q8_1 overflowing the max workgroup count

* vulkan: Fix small tile size matmul on lavapipe

* fix mul_mat_id failures
2026-01-11 11:02:08 +02:00
Chenguang Li
34e1e6cd3a CANN: add operator fusion support for ADD + RMS_NORM (llama/17512)
This commit implements operator fusion for ADD + RMS_NORM operations
in the CANN backend to reduce memory access overhead and improve
performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION
environment variable (default: false).

Changes:
- Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm
- Add ggml_cann_can_fuse() to check fusion eligibility
- Integrate fusion logic into computation graph evaluation
- Add test cases for ADD + RMS_NORM fusion
- Update documentation with new environment variable

The fusion combines ADD and RMS_NORM into a single kernel call,
which is more efficient than executing them separately.
2026-01-11 11:02:08 +02:00
Daniel Bevenius
175f6455f8 sampling : add support for backend sampling (llama/17004)
* sampling : add support for backend sampling

This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.

The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.

It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.

Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.

* llama-cli : add backend sampler configuration

* server : add backend sampling options/configuration

* webui : add backend sampling options

* ggml : add initial cumsum implementation for CUDA

* sampling : enable all backend sampler tests

This commit enables all exisiting backend sampler tests in the
test-backend-sampler. Previously, some tests were disabled because
there were missing ggml operation implementations.

* graph : do not include llama-model.h

* sampling : always expose sampled_ids

This commit precomputes and caches the full-vocab token id list in
llama_context's constructor, so llama_get_backend_sampled_token_ids_ith
always returns a valid pointer.

The motivation for this is that this enables both common/sampling.cpp
and src/llama-sampling.cpp can simplify their logic.

Not all backends samplers that process logits need to set the
sampled_tokens_id as they may not change the order of the logits, for
example the temperature sampler only scales the logits but does not
change their order. Simliar the logit bias sampler only adds bias to
specific token ids but does not change the order of the logits. In
these cases there will not be a device to host copy of the sampled
token ids, and this is the use case where having this precomputed
list is useful.

* sampling : ensure at most one output token per seq

This commit adds a check in the batch allocator to ensure that when
backend sampling is enabled, at most one output token is specified per
sequence.

* CUDA: Optimize argsort for gpu-based token sampling

Argsort is used for top-k currently. WE optimize argsort by 2 things:

1. Use `DeviceRadixSort` for single-row/sequence to parallelize it
   across our SMs
2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the
   correct entrypoint (the function chooses different execution paths,
   it contains `DeviceSegmentedRadixSort` as one of the paths and will
   choose the best one according to heuristics.
   https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview

Some perf numbers for a RTX PRO 6000:

On the kernel level, tested with
`GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf`
Before:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   359.24 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                  8192 runs -   861.34 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -  1020.01 us/run
```

After:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   312.41 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                 16384 runs -    63.48 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -   874.36 us/run
```
2026-01-11 11:02:08 +02:00
Jeff Bolz
1fae2224fb vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (llama/18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-11 11:02:08 +02:00
Jeff Bolz
256ba84f06 vulkan: Support UPSCALE w/antialias (llama/18327) 2025-12-31 12:39:43 +02:00
Jeff Bolz
d063a199a4 vulkan: handle rope with large number of rows (llama/18306) 2025-12-31 12:39:43 +02:00
Jeff Bolz
582cf26561 vulkan: Extend rope fusions to allow mrope (llama/18264)
Extend the test-backend-ops tests as well.
2025-12-31 12:39:43 +02:00
Jeff Bolz
e68c6fb7af vulkan: fix im2col overflowing maxworkgroupcount (llama/18180) 2025-12-31 12:39:43 +02:00
Jeff Bolz
0fb429250e vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071)
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.

CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
2025-12-31 12:39:43 +02:00
Jeff Bolz
9287ae2aa8 tests: Avoid floating point precision false positives in SUM (llama/17471)
* tests: Avoid floating point precision false positives in SUM

* also apply to test_mean
2025-12-31 12:39:43 +02:00
Jeff Bolz
68b14a56e2 test-backend-ops: improve msvc build time (llama/18209) 2025-12-31 12:39:43 +02:00
Xuan-Son Nguyen
cdb1e3f3c0 model : add ASR support for LFM2-Audio-1.5B (conformer) (llama/18106)
* ASR with LFM2-Audio-1.5B

* Set rope_theta

* Fix comment

* Remove rope_theta setting

* Address PR feedback

* rename functions to conformer

* remove some redundant ggml_cont

* fix missing tensor

* add prefix "a." for conv tensors

* remove redundant reshape

* clean up

* add test model

---------

Co-authored-by: Tarek Dakhran <tarek@liquid.ai>
2025-12-31 12:39:43 +02:00
Jeff Bolz
b9730e5719 vulkan: Multi-pass softmax for large number of cols (llama/17892)
When the number of cols is large, split each row across multiple workgroups.
There are three phases that communicate partial results through temp buffers:
(1) compute max partials
(2) take max of partials, compute sum(exp(x-max)) partials
(3) sum partials, compute scaled result
2025-12-14 16:40:47 +02:00
Jeff Bolz
27b5a372b4 vulkan: Allow non-pow2 n_experts in topk_moe (llama/17872) 2025-12-14 16:40:47 +02:00
Piotr Wilkin (ilintar)
43926a80a7 SOLVE_TRI extension to more dimensions (llama/17793)
* Extended TRI

* Fix whitespace

* chore: update webui build output

* Just use cuBLAS for everything...

* Merge both versions

* Remove incorrect imports causing failures for CI

* Still failing... remove all direct cublas imports and rely on common imports from "common.cuh"

* Defines for hipBlas

* Aaaand MUSA defines...

* I hate this job...

* Stupid typo...

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-14 16:40:47 +02:00
Georgi Gerganov
4767bda161
ggml : remove GGML_KQ_MASK_PAD constant (llama/17910)
* ggml : remove GGML_KQ_MASK_PAD constant

* cont : remove comment
2025-12-11 15:33:01 +02:00
Gabe Goodhart
019442facd
metal: SSM kernel improvements (llama/17876)
* feat: Add a batched version of ssm_conv

This was done using Claude Code. It found a number of optimizations around
how the threads were organized, resulting in a huge performance boost!

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Optimized SSM_SCAN kernel for metal

This used Claude Code and resulted in a modest performance improvement
while maintaining correctness.

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: Add test-backend-ops perf tests for SSM_CONV

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: Real representitive tests for SSM_CONV

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use function constant for ssm_conv batch size

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: backend op tests for ssm_scan from granite4 1b-h

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: remove commented out templates

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: float4 version of ssm_conv_batched

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing ggml_metal_cv_free

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-11 15:33:00 +02:00
Piotr Wilkin (ilintar)
9652e04ad1
Add DIAG for CUDA (llama/17873)
* Add DIAG for CUDA

* Refactor parameters
2025-12-11 15:33:00 +02:00
Phylliida Dev
fba011ed76
ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (llama/16985)
* Feat: Added vulkan circular tiling support

* Feat: Added cpu circular

* Feat: Added cuda kernels

* Added tests

* Added tests

* Removed non-pad operations

* Removed unneded changes

* removed backend non pad tests

* Update test-backend-ops.cpp

* Fixed comment on pad test

* removed trailing whitespace

* Removed unneded test in test-backend-ops

* Removed removed test from calls

* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp

Co-authored-by: Ruben Ortlam <picard12@live.de>

* Fixed alignment

* Formatting

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Format pad

* Format

* Clang format

* format

* format

* don't change so much stuff

* clang format and update to bool

* fix duplicates

* don't need to fix the padding

* make circular bool

* duplicate again

* rename vulkan to wrap around

* Don't need indent

* moved to const expr

* removed unneded extra line break

* More readable method calls

* Minor wording changes

* Added final newline

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Added circular pad ext tests

* Gate non circular pad devices

* Cleaned gating of non-circular pad devices

---------

Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-11 15:32:58 +02:00
Jeff Bolz
0124b66315
vulkan: support solve_tri with larger N/K values (llama/17781)
Split N into chunks to fit into shared memory.
If K > 128, use a larger workgroup with enough invocations.
Add perf tests matching qwen3next.
2025-12-11 15:32:57 +02:00
Jeff Bolz
e95bb22ddd
vulkan: fix top_k bug when there are ties in the input (llama/17659)
* vulkan: Reduce temporary memory usage for TOP_K

- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.

* vulkan: fix top_k bug when there are ties in the input

I noticed by inspection a bug in the vulkan top_k shader where if the least
value in the top_k appears multiple times we could end up writing those extra
copies out rather than some larger values (if the larger values are on higher
numbered threads).

I rewrote the test verification to handle this case, where the final index set
is not necessarily the same.

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-11 15:32:56 +02:00
Acly
a61f64a017
vulkan : support conv-2d with large output size (llama/17685) 2025-12-11 15:32:56 +02:00
Piotr Wilkin (ilintar)
3571e69bdd
Add support for CUMSUM and TRI for CUDA. (llama/17584)
* Add support for CUMSUM and TRI for CUDA.

* Minor optimizations.

* Correct warp_prefix_inclusive_sum in float2 variant to return float2

* Optimize TRI

* Whitespace

* Fix strides.

* Implement double loop

* Whitespace

* Fix HIP compilation bugs

* Optimizations + big case performance tests

* Implement using CUB with fallback to custom kernel

* Remove error message.

* Fixes from code review

* Comment out CPU-unsupported F16/BF16 cases to fix CI

* Fine, you win :P

* Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS

* Vary warp-size based on physical warp size

* Add GGML_UNUSED_VARS in tri as well

* Use constexpr and call prefix_inclusive with warp_size template param

* Update ggml/src/ggml-cuda/cumsum.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Change to tid % warp_size

* Fix strides; hardcode mask; add ggml_lane_mask_t

* Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()

* Too hasty...

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-11 15:32:54 +02:00
Reese Levine
e4b2ee160e
ggml webgpu: add support for emscripten builds (llama/17184)
* Faster tensors (llama/8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (llama/9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (llama/4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Fix .gitignore

* Add memory64 option and remove unneeded macros for setting threads to 1

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-12-11 15:32:53 +02:00
Tarek Dakhran
1ab5b45b8a
model: LFM2-VL fixes (llama/17577)
* Adjust to pytorch

* Add antialiasing upscale

* Increase number of patches to 1024

* Handle default marker insertion for LFM2

* Switch to flag

* Reformat

* Cuda implementation of antialias kernel

* Change placement in ops.cpp

* consistent float literals

* Pad only for LFM2

* Address PR feedback

* Rollback default marker placement changes

* Fallback to CPU implementation for antialias implementation of upscale
2025-12-11 15:32:50 +02:00
Jeff Bolz
c12559bcda
vulkan: improve topk perf for large k, fix overflow in unit tests (llama/17582) 2025-12-11 15:32:49 +02:00