llama.cpp

mirror of https://github.com/ggerganov/llama.cpp synced 2026-04-18 21:26:07 +02:00

History

Katostrofik b1be68e8ca [SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM (#21638 ) * [SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM The Q8_0 reorder optimization (#21527) was missing a reorder-aware dequantizer for the GEMM code path used during prompt processing. After token generation reordered Q8_0 weights (via DMMV/MMVQ), the next prompt processing pass would read them with the standard dequantizer, producing garbage output. Add dequantize_block_q8_0_reorder() and wire it into both ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the pattern already used by Q4_0, Q4_K, and Q6_K. Fixes #21589 AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. * SYCL: fix reorder crash when device memory is full The reorder optimization allocates a temporary buffer the full size of the weight tensor on the device. When VRAM is nearly full (large models on a single GPU), this allocation fails and the subsequent memcpy crashes on a NULL pointer. Fix: try device allocation first, fall back to host memory if device memory is full. The reorder kernel still works correctly reading from host memory over PCIe. This is slower for the one-time reorder (~21 t/s vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for all subsequent inference. If both device and host allocation fail, skip the reorder and fall back to the unoptimized kernel path. Also fixes a bug where opt_for_reorder() marked tensors as reordered even when the reorder was skipped due to allocation failure. This caused DMMV/MMVQ kernels to read the original AoS data as if it were SoA, producing garbage output or NaN results. Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was AI-assisted (Claude), reviewed and tested on hardware by a human. Fixes #20478 * SYCL: add RAII temp buffer class + macro guard for host fallback Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free functions with sycl_reorder_temp_buffer RAII class. The host_fallback bool is now a private member, and cleanup happens automatically at scope exit. Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard the host memory fallback code path. Device access to host memory requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it. Addresses arthw's review on PR #21638. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: document GGML_SYCL_HOST_MEM_FALLBACK build option in SYCL.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add reorder-aware DMMV dequantizers for Q4_K and Q6_K Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not DMMV. When the DMMV path encountered reordered data it would abort. Add DMMV kernels that read from the SOA reorder layout for both types. Same math as the non-reorder versions, different memory access pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>		2026-04-16 08:34:05 +03:00
..
snapdragon	hexagon: add support for linux on snapdragon (#21707 )	2026-04-10 15:57:23 -07:00
VirtGPU	ggml-virtgpu: Fix some build commands (#20341 )	2026-03-12 15:47:45 +08:00
BLIS.md	make : deprecate (#10514 )	2024-12-02 21:22:53 +02:00
CANN.md	CANN: update docker images to 8.5.0 and improve CANN.md (#20801 )	2026-03-27 08:53:00 +08:00
CUDA-FEDORA.md	docs: update: improve the Fedoa CUDA guide (#12536 )	2025-03-24 11:02:26 +00:00
OPENCL.md	docs: add linux to index (#18907 )	2026-01-18 18:03:35 +08:00
OPENVINO.md	docs : fix broken link to ggml-openvino in OPENVINO.md (#21709 )	2026-04-10 09:50:08 +02:00
SYCL.md	[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM (#21638 )	2026-04-16 08:34:05 +03:00
VirtGPU.md	ggml-virtgpu: improve the reliability of the code (#19846 )	2026-02-26 20:00:57 +08:00
zDNN.md	ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690 )	2025-12-07 00:13:33 +08:00
ZenDNN.md	ggml-zendnn : add MUL_MAT_ID op support for MoE models (#21315 )	2026-04-03 12:19:08 +03:00