* ggml-cpu: Use tiled FA for prompt-processing
the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.
* fix out of bounds for mask
* skip rows where there are all masks
* skip tile if mask is inf
* store mask in worksize
* check inf tile earlier
* Boilerplate for q5_Kx8 REPACK on ARM and fallback
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* q5_K repack gemm and gemv generics
* Gemm and Gemv ARM implementations (i8mm)
* Improved qh manipulation looking at non-repack vec_dot implementation
* Full unroll
* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fix wrong fallback definitions of Q5_K
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fixed comments. Reverted unnecessary formatting
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fixed typo in generic definitions
* Switching AND + Shift with Shift Insert. Better op interleaving.
* Vectorize + unroll the block scales
* Apply gemm optimizations to gemv
* Improve bias calculation
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* mla : pass V as a view of K to the FA op
* cuda : adjust mla logic to new layout
* kv-cache : fix rope shift
* tests : remove comment
* cuda : fix reusable_cutoff
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* opencl: add `copy_to_contiguous` and utilize mm kernels
* opencl: only copy to cont for f32 and f16 tensors
* opencl: use cont mm for fallback when dst is large
* opencl: use nb local to copy-to-cont
* opencl: use local offset as well
* vulkan: Remove transfer_ctx, do everything in compute_ctx.
We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.
Remove transfer_cmd_pool, which was already unused.
* fix crash with perf logger
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.
Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
* CUDA: Replace `init_offsets` with iterators in argsort
This is a QOL improvement, saving us the cost of materializing the
iterator
* Remove unnecessary include from top-k.cu
* CANN: fix an issue where get_env was not fully renamed
* ci: add cann with acl group
* ci: define use_acl_graph using GitHub Action
* ci: update cann dockerfile with acl graph
* CANN: support gated linear attn
This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
* CANN: optimize OP gla
Optimize gla for high preformance
* Remove unused comments
---------
Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain>
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
* CUDA: Refactor and expose two_stage_warp_reduce_* function
* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it
Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Use two_stage_warp_reduce in group_norm_f32
* Use two_stage_warp_reduce in rms_norm_f32
* Fix smem calculation which expects bytes
* Make `two_stage_warp_reduce` accept all values warp_reduce accepts
Also integrate it into norm_f32 function
* Use two_stage_warp_reduce in l2_norm_f32
* Use type traits for block reduction for better legibility
Also adresss other requests by @am17an such as variable renaming
* Make norm tests cover all cuda paths
* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK
Unit-tests passed locally, let's see if they pass in the CI as well
* Use `enum class` for `block_reduce_method`
This is more type-safe than plain enum
* Rename variables as suggested in code review by @am17an
* Rename two_stage_warp_reduce -> block_reduce
* Fix trailing whitespace in common.cuh
* Make condition of static_assert type-dependent
This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:
https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785
* Inline definitions
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).