* nix: support unified apple-sdk
* Impl roll op for Metal
* Revert "nix: support unified apple-sdk"
This reverts commit abfa473360.
* update ops.md
* update op docs
* [SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM
The Q8_0 reorder optimization (#21527) was missing a reorder-aware
dequantizer for the GEMM code path used during prompt processing.
After token generation reordered Q8_0 weights (via DMMV/MMVQ), the
next prompt processing pass would read them with the standard
dequantizer, producing garbage output.
Add dequantize_block_q8_0_reorder() and wire it into both
ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the
pattern already used by Q4_0, Q4_K, and Q6_K.
Fixes#21589
AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.
* SYCL: fix reorder crash when device memory is full
The reorder optimization allocates a temporary buffer the full size of
the weight tensor on the device. When VRAM is nearly full (large models
on a single GPU), this allocation fails and the subsequent memcpy crashes
on a NULL pointer.
Fix: try device allocation first, fall back to host memory if device
memory is full. The reorder kernel still works correctly reading from
host memory over PCIe. This is slower for the one-time reorder (~21 t/s
vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for
all subsequent inference. If both device and host allocation fail, skip
the reorder and fall back to the unoptimized kernel path.
Also fixes a bug where opt_for_reorder() marked tensors as reordered
even when the reorder was skipped due to allocation failure. This caused
DMMV/MMVQ kernels to read the original AoS data as if it were SoA,
producing garbage output or NaN results.
Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was
AI-assisted (Claude), reviewed and tested on hardware by a human.
Fixes#20478
* SYCL: add RAII temp buffer class + macro guard for host fallback
Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free
functions with sycl_reorder_temp_buffer RAII class. The host_fallback
bool is now a private member, and cleanup happens automatically at
scope exit.
Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard
the host memory fallback code path. Device access to host memory
requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels
can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it.
Addresses arthw's review on PR #21638.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* SYCL: document GGML_SYCL_HOST_MEM_FALLBACK build option in SYCL.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* SYCL: add reorder-aware DMMV dequantizers for Q4_K and Q6_K
Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not
DMMV. When the DMMV path encountered reordered data it would abort.
Add DMMV kernels that read from the SOA reorder layout for both
types. Same math as the non-reorder versions, different memory
access pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it
* use FetchContent to get SPIRV-Headers
* Fetch spirv-headers unconditionally
* remove fetchcontent, rely on installed headers
* fix ubuntu job
* Update docs/build.md
* hexagon: introduce op request batching and rewrite buffer managment
The host now prepares batches of requests and dispatches them via a single dspqueue message.
Buffers are mapped explicitly by NPU while processing batches.
* hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops
* hex-utils: add explicit l2flush and l2clear helpers
* hex-opreq: use fine-grain per tensor l2 management
* hex-opreq: avoid redundant invalidates for tensors we already flushed
* hex-opreq: update debug messages
* htp-opreq: reuse ops_context
* hex-opreq: do not flush or invalidate cache lines beyond buffer boundry
* hex-opreq: fix errors in log message
* Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"
This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.
* hexagon: limit l2 flushes to 1MB which covers l2 cache
* hex-opreq: limit cache flush to 4MB
Looks like 4MB cont. vitual space should cover the 1MB cache.
* hexagon: drop cache flush size to 2MB
* hex-opreq: start reworking opreq packing
* hex-opreq: introduce new way of packing opbatch where tensors are stored separately
* hex-opreq: add a simple fastrpc call to force unmap all buffers
* hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size
* hex-opreq: bump opreq batch size to 256
* hex-mm: place src1 spad at the top of vtcm for easy reuse
* hex-ops: introduce internal types and disable src1 reuse for now
Nothing new just formalizing the repack / qyn.quant types we've been using.
* htp-opreq: use tensor pointers instead of copies
* hex-opreq: introduce more robust way for tracking vtcm/spad reuse
This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.
* hex-cumsum: fix error post opreq merge
* hex-opreq: move request batch handling into the session
Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.
* hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx
* hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers
* hex-buf: add support for allocating shared/pinned buffer for opreqs
* hex-opbatch: make opbatches configurable
* hex-naming: better name for ggml_hexagon_shared_buffer
* hex-naming: add session->c_name() helper
* hex-opbatch: start using shm but still copy for now
* hex-opbatch: use shared buffer for packing opbatch
* hex-opbatch: beter naming for opbatch related classes and code
* hex-opbatch: reuse batched tensors with same data/dims/strides
* hex-opbatch: update logging
* hex-opbatch: add support for vmem limit for op batching
* hex-opbatch: update htp side to properly support dynamic mmap/unmap
* hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing
* hex-opbatch: fixed src1 handling in act ops
* hex-act: fix empty src1 handling in swiglu and friends
Simplify preamble macro while at it
* hex-mm: minor fix vtcm and dma handling in matmul
cleaning up some left-overs from merges
* hex-opbatch: allocate extra 1KB for dspqueue overhead
* hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc
* hex-mm: properly handle hmx_disabled flag
* hex-ops: update comments
* hex-ops: add debug output for get/set-rows
* hex-mmap: optimize un/mapping of buffers
* hex-opreq: global cache flush and invalidate beyond 128KB threshold
* hex-ops: add super simple opfilter regex for debugging
If an Op matches the regex hex backend will reject it.
* hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future
* hexagon: improved vtcm acquision to remove inter-op overhead
Fully compatible with QNN-HTP coex
* hex-mm: fixed hvx fallback path
* hex-mm: lower the vmem threshold a bit further to ~3GB
* hexagon: update debug & error logs
This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.
* hexagon: move ops context into main context
Just a cleanup. We don't need separate contexts at this point.
* hex-opbatch: cleanup naming and headers for opbatch and related descriptors
* hex-fa: it's now better to enable FA during TG to reduce graph splits
* hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var
It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.
* hexagon: fixed editorconfig check
* Update ggml/src/ggml-hexagon/ggml-hexagon.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
The `HSA_OVERRIDE_GFX_VERSION` variable can be used in ROCm to override an unsupported target architecture with a similar but supported target architecture.
This does not and has never worked on Windows. I think the clarification could avoid driving Windows people towards this solution that does not work.
* ggml-zendnn : add MUL_MAT_ID op support for MoE models
- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models
- MUL_MAT_ID op fallback to CPU backend if total experts > 32
- Point ZenDNN lib to latest bits ZenDNN-2026-WW13
* ggml-zendnn : add braces to sgemm failure condition for consistency
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD
Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
* Obtain source tag name from git tag
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* cann: update docker images to 8.5.0
- bump CANN base image from 8.3.rc2 to 8.5.0
- bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0
Move to newer stable releases.
* cann: update CANN.md
* Update CANN.md to include BF16 support
Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions.
* Fix formatting issues in CANN.md
Fix 234: Trailing whitespace
* mtmd: llama.cpp DeepSeekOCR support
init commit
* loading sam tensors
* mtmd: fix vision model processing
* deepseek-ocr clip-vit model impl
* mtmd: add DeepSeek-OCR LM support with standard attention
* mtmd: successfully runs DeepSeek-OCR LM in llama-cli
* mtmd: Fix RoPE type for DeepSeek-OCR LM.
* loading LM
testing Vision model loading
* sam warmup working
* sam erroneous return corrected
* clip-vit: corrected cls_embd concat
* clip-vit: model convert qkv_proj split
* corrected combining of image encoders' results
* fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model
* concat image_newline and image_seperator tokens
* visual_model warmup (technically) works
* window partitioning using standard ggml ops
* sam implementation without using CPU only ops
* clip: fixed warnings
* Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr
* mtmd: fix get_rel_pos
* mtmd: fixed the wrong scaler for get_rel_pos
* image encoding technically works but the output can't be checked singe image decoding fails
* mtmd: minor changed
* mtmd: add native resolution support
* - image encoding debugged
- issues fixed mainly related wrong config like n_patches etc.
- configs need to be corrected in the converter
* mtmd: correct token order
* - dynamic resizing
- changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4
* mtmd: quick fix token order
* mtmd: fix danling pointer
* mtmd: SAM numerically works
* mtmd: debug CLIP-L (vit_pre_ln)
* mtmd: debug CLIP-L & first working DeepSeek-OCR model
* mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work
* mtmd: simplify SAM patch embedding
* mtmd: adapt Pillow image resizing function
* mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing
* mtmd: remove --dsocr-mode argument
* mtmd: refactor code & remove unused helper functions
* mtmd: fix tensor names for image newlines and view separator
* clean up
* reverting automatically removed spaces
* reverting automatically removed spaces
* mtmd: fixed bad ocr check in Deepseek2 (LM)
* mtmd: support combined QKV projection in buid_vit
* using common build_attn in sam
* corrected code-branch when flash-attn disabled
enabling usage of --flash-attn option
* mtmd: minor fix
* minor formatting and style
* fixed flake8 lint issues
* minor editorconfig-check fixes
* minor editorconfig-check fixes
* mtmd: simplify get_rel_pos
* mtmd: make sam hparams configurable
* mtmd: add detailed comments for resize_bicubic_pillow
* mtmd: fixed wrong input setting
* mtmd: convert model in FP16
* mtmd: minor fix
* mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template
* fix: test-1.jpg ORC issue with small (640) resolution
setting min-resolution base (1024) max large (1280) for dynamic-resolution
* minor: editconfig-check fix
* merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909
added new opt to tests.sh to disable flash-attn
* minor: editconfig-check fix
* testing deepseek-ocr
quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR
* quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909
* refactoring, one single builder function and static helpers
* added deepseek-ocr test to tests.sh
* minor formatting fixes
* check with fixed expected resutls
* minor formatting
* editorconfig-check fix
* merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042
* minor
- added GLM-4.6V to big tests
- added missing deps for python test
* convert: minor fix
* mtmd: format code
* convert: quick fix
* convert: quick fix
* minor python formatting
* fixed merge build issue
* merge resolved
- fixed issues in convert
- tested several deepseek models
* minor fix
* minor
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* - removed clip_is_deepseekocr
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions
* - cleaning commented out code
* fixing instabilities issues reintroducing resize_bicubic_pillow
* - use f16 model for deepseek-ocr test
- ignore llama-arch test for deepseek-ocr
* rename fc_w --> mm_fc_w
* add links to OCR discussion
* cleaner loading code
* add missing .weight to some tensors
* add default jinja template (to be used by server)
* move test model to ggml-org
* rolling back upscale change
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: bluebread <hotbread70127@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Regenerate docs/ops/Metal.csv using test-backend-ops on Apple M5
and rebuild docs/ops.md via scripts/create_ops_docs.py.
Five ops were incorrectly marked as not supported (❌) for Metal:
- DIAG: ❌ → ✅
- POOL_1D: ❌ → ✅
- SET: ❌ → ✅
- SOLVE_TRI: ❌ → ✅
- GATED_DELTA_NET:❌ → 🟡 (partial, depends on head_size % 32)
* Update build doc
* Add cgraph tensor output name to OV op name
* Update openvino build instructions
* Add initial NPU support
* draft NPU support version 2: prefill + kvcache
* NPU support version 2: prefill + kvcache
* Change due to ggml cgraph changes, not correct yet
* Change due to ggml cgraph changes, llama-3.2 CPU work
* Add AMD64 to CMakeLists
* Change due to ggml cgraph changes, all device work
* Refactor: clean, fix warning
* Update clang-format
* Statful transformation for CPU GPU
* Add SwiGLU
* Fuse to SDPA
* Replace Concat with Broadcast in MulMat for GQA
* Pull out indices creation for kv cache update
* Refactor: remove past_token_len from extra_inputs
* Fix Phi3 SwiGLU and SoftMax
* Pull out sin cos from rope
* Reduce memory: free ov weights node after graph conversion
* Fix CPY due to cgraph change
* Added OpenVINO CI/CD. Updated docs
* Fix llama-cli
* Fix Phi3 ROPE; Add test-backend-ops
* Fix NPU
* Fix llama-bench; Clang-format
* Fix llama-perplexity
* temp. changes for mark decomp
* matmul in fp32
* mulmat input conversion fix
* mulmat type conversion update
* add mark decomp pass
* Revert changes in fuse_to_sdpa
* Update build.md
* Fix test-backend-ops
* Skip test-thread-safety; Run ctest only in ci/run.sh
* Use CiD for NPU
* Optimize tensor conversion, improve TTFT
* Support op SET_ROWS
* Fix NPU
* Remove CPY
* Fix test-backend-ops
* Minor updates for raising PR
* Perf: RMS fused to OV internal RMS op
* Fix after rebasing
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
* Change openvino device_type to GPU; Enable flash_attn
* Update supports_buft and supports_op for quantized models
* Add quant weight conversion functions from genai gguf reader
* Quant models run with accuracy issue
* Fix accuracy: disable cpu_repack
* Fix CI; Disable test-backend-ops
* Fix Q4_1
* Fix test-backend-ops: Treat quantized tensors as weights
* Add NPU Q4_0 support
* NPU perf: eliminate zp
* Dequantize q4_1 q4_k q6_k for NPU
* Add custom quant type: q8_1_c, q4_0_128
* Set m_is_static=false as default in decoder
* Simpilfy translation of get_rows
* Fix after rebasing
* Improve debug util; Eliminate nop ReshapeReshape
* STYLE: make get_types_to_requant a function
* Support BF16 model
* Fix NPU compile
* WA for npu 1st token acc issue
* Apply EliminateZP only for npu
* Add GeGLU
* Fix Hunyuan
* Support iSWA
* Fix NPU accuracy
* Fix ROPE accuracy when freq_scale != 1
* Minor: not add attention_size_swa for non-swa model
* Minor refactor
* Add Q5_K to support phi-3-q4_k_m
* Requantize Q6_K (gs16) to gs32 on GPU
* Fix after rebasing
* Always apply Eliminate_ZP to fix GPU compile issue on some platforms
* kvcachefusion support
* env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added
* Fix for Phi3
* Fix llama-cli (need to run with --no-warmup)
* Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working
* fix after rebasing
* Fix llama-3-8b and phi3-mini q4_0 NPU
* Update to OV-2025.3 and CMakeLists.txt
* Add OV CI cache
* Apply CISC review and update CI to OV2025.3
* Update CI to run OV dep install before build
* Update OV dockerfile to use OV2025.3 and update build docs
* Style: use switch in supports_ops
* Style: middle ptr and ref align, omit optional struct keyword
* NPU Unify PD (#14)
* Stateless. Fix llama-cli llama-server
* Simplify broadcast op in attention
* Replace get_output_tensor+memcpy with set_output_tensor
* NPU unify PD. Unify dynamic and static dims
* Clean placeholders in ggml-openvino.cpp
* NPU unify PD (handled internally)
* change graph to 4d, support multi sequences
* Fix llama-bench
* Fix NPU
* Update ggml-decoder.cpp
Hitting error while compiling on windows:
error C3861: 'unsetenv': identifier not found
Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.
Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.
This keeps cross-platform compatibility.
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Remove the second decoder for node. Moving the function into the model decoder
* Fix error for naive
* NPU prefill chunking
* NPU fix llama-bench
* fallback naive run with accuracy issue
* NPU support llma-perplexity -b 512 --no-warmup
* Refactor: split ov_graph_compute for dynamic and static
* remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)
* minor update due to ov 2025.4
* remove unused API GgmlOvDecoder::get_output_names()
* remove unused API get_output_shape(const std::string & name)
* Modified API GgmlOvDecoder::get_output_type(const std::string & name)
* Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)
* Removed API get_output_ggml_tensor(const std::string & name)
* Removed API m_outputs
* Removed m_output_names
* Removed API GgmlOvDecoder::get_input_names()
* Removed API GgmlOvDecoder::get_input_stride(const std::string& name)
* Removed API get_input_type
* Removed API get_input_type
* Removed API GgmlOvDecoder::get_input_shape(const std::string & name)
* Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)
* Fix error for decoder cache
* Reuse cached decoder
* GPU remove Q6_K requantization
* NPU fix wrong model output shape
* NPU fix q4 perf regression
* Remove unused variable nodes
* Fix decoder can_reuse for llama-bench
* Update build.md for Windows
* backend buffer: allocate on host
* Use shared_buffer for GPU NPU; Refactor
* Add ov_backend_host_buffer; Use cached remote context
* Put kvcache on GPU
* Use ggml_aligned_malloc
* only use remote tensor for kvcache
* only use remote tensor for kvcache for GPU
* FIX: use remote tensor from singleton
* Update build.md to include OpenCL
* NPU always requant to q4_0_128
* Optimize symmetric quant weight extraction: use single zp
* Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant
* Update build.md
* Support -ctk f32
* Initial stateful graph support
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* code cleanup
* npu perf fix
* requant to f16 for Q6 embed on NPU
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
* Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp
* Create OPENVINO.md in llama.cpp backend docs
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update build.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* kq_mask naming fix
* Syntax correction for workflows build file
* Change ov backend buffer is_host to false
* Fix llama-bench -p -n where p<=256
* Fix --direct-io 0
* Don't put kvcache on GPU in stateful mode
* Remove hardcode names
* Fix stateful shapes
* Simplification for stateful and update output shape processing
* Remove hardcode names
* Avoid re-compilation in llama-bench
* Extract zp directly instead of bias
* Refactor weight tensor processing
* create_weight_node accept non-ov backend buffer
* remove changes in llama-graph.cpp
* stateful masking fix (#38)
Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.
* Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add
* hardcoded name handling for rope_freqs.weight
* Suppress logging and add error handling to allow test-backend-ops to complete
* Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases
* Use bias instead of zp in test-backend-ops
* Update OV in CI, Add OV CI Tests in GH Actions
* Temp fix for multithreading bug
* Update OV CI, fix review suggestions.
* fix editorconfig-checker, update docs
* Fix tabs to spaces for editorconfig-checker
* fix editorconfig-checker
* Update docs
* updated model link to be GGUF model links
* Remove GGML_CPU_REPACK=OFF
* Skip permuted ADD and MUL
* Removed static variables from utils.cpp
* Removed initializing non-existing variable
* Remove unused structs
* Fix test-backend-ops for OV GPU
* unify api calling
* Update utils.cpp
* When the dim is dynamic, throw an error, need to is stastic forst
* Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using
* No need to return
* Fix test-backend-ops for OV GPU LNL
* Fix test-thread-safety
* use the shape from infer request of output tensor create to avoid issue
* fix dynamic output shape issue
* fix issue for the unused node in tests
* Remove unused lock
* Add comment
* Update openvino docs
* update to OV release version 2026.0
* add ci ov-gpu self hosted runner
* fix editorconfig
* Fix perplexity
* Rewrite the model inputs finding mechanism (#54)
* Rewrite the model inputs finding logistic
* Put stateful shape handle in get input shape
* Put the iteration logistic in func
* Added ggml-ci-intel-openvino-gpu and doc update
* .hpp files converted to .h
* fix ggml-ci-x64-intel-openvino-gpu
* Fix for stateful execution bug in llama-bench
* Minor updates after stateful llama-bench fix
* Update ggml/src/ggml-openvino/utils.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* Remove multiple get_shape calls
* Bring back mutex into compute
* Fix VIEW op, which slice the input node
* Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access
* Temp. fix for test requant errors
* Update to OV ggml-ci to low-perf
* ci : temporary disable "test-llama-archs"
* ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag
* docs : update url
* Fix OV link in docker and Update docs
---------
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com>
Co-authored-by: Arshath <arshath.ramzan@intel.com>
Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com>
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml-Vulkan: add ELU support
* ggml-Vulkan: remove extra spaces and variables
* ggml-Vulkan: fix format issue
* ggml-Vulkan: fix format issue
* fix whitespace issue
* Update Vulkan.csv and ops.md
- adapt ggml-zendnn.cpp to the new lowoha::matmul interface
- update the ZenDNN git tag in CMake to the latest release (ZenDNN‑2026‑WW08)
- add static lib support in CMake
* ggml-virtgpu-backend: validate the consistency of the received objects
This patch adds consistency checks in the
ggml-virtgpu-backend (running on the host side) to ensure that the
data received from the guest is consistent (valid pointers, valid
sizes and offsets).
* ggml-virtgpu-backend: add fallback/skips for optional ggml backend methods
```
1. bck->iface.synchronize(bck)
2. buft->iface.get_alloc_size(buft, op)
3. buft->iface.get_max_size(buft)
```
these three methods are optional in the GGML interface. `get_max_size`
was already properly defaulted, but `backend sychronize` and `butf
get_max_size` would have segfaulted the backend if not implemented.
* ggml-virtgpu-backend: fix log format missing argument
* ggml-virtgpu-backend: improve the abort message
* ggml-virtgpu-backend: more safety checks
* ggml-virtgpu-backend: new error code
* ggml-virtgpu-backend: initialize all the error codes
* ggml-virtgpu: add a missing comment generated by the code generator
* ggml-virtgpu: add the '[virtgpu]' prefix to the device/buffer names
* ggml-virtgpu: apir_device_buffer_from_ptr: improve the error message
* ggml-virtgpu: shared: make it match the latest api_remoting.h of Virglrenderer APIR
(still unmerged)
* ggml-virtgpu: update the code generator to have dispatch_command_name in a host/guest shared file
* ggml-virtgpu: REMOTE_CALL: fail if the backend returns an error
* docs/backend/VirtGPU.md: indicate that the RAM+VRAM size is limed to 64 GB with libkrun
* ggml-virtgpu: turn off clang-format header ordering for some of the files
Compilation breaks when ordered alphabetically.
* ggml-virtgpu: clang-format
* ggml-virtgpu/backend/shared/api_remoting: better comments for the APIR return codes