* Add model metadata loading from huggingface for use with other tests
* Add incremental chunking instead of full redownload, fix caching issue and add warning when it fails
* Add support for split models, load metadata from each individual split file, also avoid mmproj
* Code cleanup, revert incremental downloading
* Only compile when cpp-httplib has SSL support
* Fix formatting
This commit changes the runner for the gguf-publish workflow from
ubuntu-slim back to ubuntu-latest, which was updated in Commit
142cbe2ac6 ("ci : use new 1vCPU runner for
lightweight jobs (#19107)").
The motivation for this is that the action used in the workflow depends
on the docker daemon, which does not seem not available in the
ubuntu-slim runner. This is currently causing an error in the workflow
and preventing the gguf-publish workflow from running successfully.
Today was the the first time since the original change (I think) that
publish task has been run which may be why the issue was not noticed
before.
Refs: https://github.com/ggml-org/llama.cpp/actions/runs/22481900566
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags
The binary relies on model files that it tries to find. However, when
configuring the build directory to be parallel to the source tree those
heuristics fail.
This sets the working directory for the test executable to be the
source-tree which resolves this issue.
- adapt ggml-zendnn.cpp to the new lowoha::matmul interface
- update the ZenDNN git tag in CMake to the latest release (ZenDNN‑2026‑WW08)
- add static lib support in CMake
* ggml-virtgpu-backend: validate the consistency of the received objects
This patch adds consistency checks in the
ggml-virtgpu-backend (running on the host side) to ensure that the
data received from the guest is consistent (valid pointers, valid
sizes and offsets).
* ggml-virtgpu-backend: add fallback/skips for optional ggml backend methods
```
1. bck->iface.synchronize(bck)
2. buft->iface.get_alloc_size(buft, op)
3. buft->iface.get_max_size(buft)
```
these three methods are optional in the GGML interface. `get_max_size`
was already properly defaulted, but `backend sychronize` and `butf
get_max_size` would have segfaulted the backend if not implemented.
* ggml-virtgpu-backend: fix log format missing argument
* ggml-virtgpu-backend: improve the abort message
* ggml-virtgpu-backend: more safety checks
* ggml-virtgpu-backend: new error code
* ggml-virtgpu-backend: initialize all the error codes
* ggml-virtgpu: add a missing comment generated by the code generator
* ggml-virtgpu: add the '[virtgpu]' prefix to the device/buffer names
* ggml-virtgpu: apir_device_buffer_from_ptr: improve the error message
* ggml-virtgpu: shared: make it match the latest api_remoting.h of Virglrenderer APIR
(still unmerged)
* ggml-virtgpu: update the code generator to have dispatch_command_name in a host/guest shared file
* ggml-virtgpu: REMOTE_CALL: fail if the backend returns an error
* docs/backend/VirtGPU.md: indicate that the RAM+VRAM size is limed to 64 GB with libkrun
* ggml-virtgpu: turn off clang-format header ordering for some of the files
Compilation breaks when ordered alphabetically.
* ggml-virtgpu: clang-format
* ggml-virtgpu/backend/shared/api_remoting: better comments for the APIR return codes
* WIP: Add EuroBERT support with autoformatting changes
This commit includes:
- EuroBERT model implementation for GGUF conversion
- C++ backend support for EuroBERT architecture
- Unintended autoformatting changes to Python files
Saving before reverting formatting-only changes.
* feat: add back eos assert when not last token pooling
* feat: removed duplicated code and cleanup
* feat: removed not working architectures and unnecessary check
* fix: typo
* fix: dynamic pooling config
* feat: added an example model for eurobert
* feat: proper llama-vocab implementation for jina-v5
* fix: removed unnecessary comments
* Update build command to build llama-* tools not just ggml-hip
* Update rocWMMA headers to 7.2
* Add GFX1150 target
* Correct library paths for AMD libraries in 26.Q1
* server: fix query params lost when proxying requests in multi-model router mode
* server: re-encode query params using httplib::encode_query_component in proxy
* vulkan: allow using fp16 in scalar flash attention shader
* split rows inside of subgroups for faster synchronization
* use row_split when Br >= 4, change reductions to use shared memory if row_split == 1
* use f32 scalar FA if f16 is not supported by device
* fix amd workgroup size issue
* optimize masksh use
* add medium rows FA shader Br size
* fixes
* add padding to mask shmem buffer
* cache q values into registers for KQ
* fuse lf accumulation, pf and v accumulation into a loop
* stage K loads through shmem
* stage V loads through shmem
* only stage through shmem on Nvidia
* default to Bc 32
* also stage V through shmem when this is done for K
* dynamic subgroups for intel
* use vectorized stores
* use float_type for dequantize4 functions
* use smaller scalar rows size for smaller rows count
* relax flash attention split_k condition to allow non-gqa use
* use minimal subgroup size on Intel
* fix shmem support function
* fix rebase issues
* fixes
* Bc 4 for scalar FA is not a valid configuration
* Use wave32 on AMD RDNA for scalar FA
* add Intel shader core count lookup-table
* fix regressions
* device tuning
* tmpsh size fix
* fix editorconfig
* refactor fa tuning logic into a single place
* fix gqa opt logic
* fix block_rows with small n_rows
* amd tuning
* fix hsk=72/80 issue
* tuning
* allow condition skipping for column check
* use float16 for Of if available
* address feedback
* fix bad RDNA performance on head size <= 128 by limiting occupancy
* allow printing pipeline stats
* cleanup and fixes
* limit occupancy for GCN for small batch FA with large HSK
* disable f16 FA for GCN AMD GPUs on the proprietary driver
* hexagon: refactor set/get/sum-rows ops to use local context
* hexagon: refactor ROPE and Softmax Ops to use local context
Improves performance a bit by precomputing things and saving in the context.
* hexagon: refactor activation ops to use local context struct
* hexagon: refactor unary ops to use local context struct and DMA/VTCM
* hexagon: use aligned hvx_scale function
* hexagon: remove unused fields from op_context
* hexagon: rewrite ROPE to use DMA and VTCM scratchpad
* hex-rope: keep N rows in scratchpad (instead of just two)
* hex-rope: introduce rowidx cache
* hex-rope: remove unused fields
* hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute
also removes the need for fastdiv.
* hex-rope: minor formatting
* hex-rope: use indices and unroll the loops
* hex-rope: more updates to cleanup rope-block handling
* hexagon: cleanup supported type/dims checks
* hexagon: all reduce funcs replicated across lanes
There is no need to explicitly replicate the first value.
* snapdragon: update adb and windows scripts to use ubatch-size 256
Updated Ops support handles larger ubatches.
This commit replaces/merges the inspect-org-model.py script with the
contents tensor-info.py script. The merged script has also been updated
to also print tensor sizes which was the only thing that was not done
before (by tensor-info.py that is).
The motivation for this is that tensor-info.py does not load the tensor
weights which can be time consuming for larger models. And also now that
both are doing almost the same thing it makes sense to just have one and
not two scripts to maintain.