* Fix delayed AllReduce on Gemma-4 MoE
Skip forward past nodes that don't consume the current one, and allow a chain of MULs.
* Check for all sources before skipping nodes
* Address review comments
* Implemented optimized q1_0 dot for x86 and generic
* Removed redundant helper definition
* Removed two redundant instructions from AVX q1_0 dot
* Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback
* Style cleanup around AVX q1_0 dot
* Replaced explicitly unrolled blocks with inner for loop for q1_0
* Replaced scalar ARM q1_0 impl with new generic one
* merged properly, but slow q3_k and q5_k with u32 indexing
* Start on new mat-vec
* New format float paths working
* Working q4_0
* Work on remaining legacy q-types
* port k-quants to new matvec
* remove old shader
* Remove old constants, format
* remove accidental file
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* llama: fix crash in print_info for GLM-DSA when vocab_only is set
* addressed code review comments
* cont : simplify
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes
The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K
asserted that block_num_y was a multiple of 16 subgroups. Models with
a vocab size not divisible by 16 (for example HY-MT at 120818) aborted
on model load when the output projection tripped the assert.
I replaced the assert with padding: block_num_y now rounds up to a
whole number of subgroup-sized workgroups. The kernel already has the
row bounds check (`if (row >= nrows) return;`) so the extra padded
threads early-exit cleanly. Row values are uniform across a subgroup
so the collective reduce stays safe.
For aligned vocab sizes the padded block_num_y equals the old value,
so the kernel launch is identical and there is no regression.
Thanks to @arthw for flagging the relationship to #21527.
Fixes#22020.
AI assisted coding, tested on Intel B70 hardware.
* sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches
Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec
launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel
target where WARP_SIZE is 16, but makes the relationship to subgroup
size explicit. Per review by @NeoZhangJianyu on #22035.
Assisted by Claude.
* cache subgraph splits when cgraph is unchanged
Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively.
Assign uid to every sub-graph so that CUDA's fast uid check path hits too.
* Address review comments
* Keep the scope as is
* Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code.
* Address review comments
* Update ggml/src/ggml-backend-meta.cpp
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-backend-meta.cpp
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds.
Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform.
Reported-by: oobabooga
Refs: #21630
Co-authored-by: texasich <texasich@users.noreply.github.com>
* rpc : refactor the RPC transport
Move all transport related code into a separate file and use the
socket_t interface to hide all transport implementation details.
* fix win32
* better socket_t construction
* Update workflows to remove dependence on llvmpipe
* Try setting Dawn_DIR
* remove c++20 initializers
* Move to proper guid
* Try avoiding segfaults on vulkan backend process exit
* Remove compiler warnings on parameter casting
* Fix soft_max and update reg_tile accumulation to f32 for better precision
* Refactor flash_attn a bit
* remove c++20 initializers and format
* Increase div precision for NVIDIA
* revert div precision and comment out ggml-ci node for now
* Formatting
* Try debugging on a failing CI node
* Revert "Try debugging on a failing CI node"
This reverts commit 1971e33cba.
* optimize hmx_mat_mul functions by calculating row and column tiles upfront
* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability
* wip
* set scale outside of loop
* wip
* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts
* wip
* wip
* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions
* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation
* wip
* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization
* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking
* refactor core_dot_chunk_fp16 to improve tile stride calculations for output
* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization
* fix compiling error
* wip
* optimize row and column tile indexing in core_mma_chunk_fp16 function
* wip
* Revert "wip"
This reverts commit cde679eff7.
* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions
* wip
* server: tests: fetch random media marker via /apply-template (#21962 fix)
* server: allow pinning media marker via LLAMA_MEDIA_MARKER env var
get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it
as-is if set, falling back to the random marker otherwise.
Tests no longer need to fetch the marker dynamically via /apply-template:
the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts
work as before.
Address review feedback from ngxson
* server: make get_media_marker() thread-safe via magic statics
Use a C++11 static local with a lambda initializer instead of a global
static with an empty-check. The runtime guarantees initialization exactly
once without explicit locking.
Address review feedback from ggerganov
* nits
* nits
* model : refactor QKV into common build_qkv and create_tensor_qkv helpers
* model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns
* fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code