llama.cpp

mirror of https://github.com/ggerganov/llama.cpp synced 2026-04-25 21:14:49 +02:00

Author	SHA1	Message	Date
Katostrofik	b1be68e8ca	[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM (#21638 ) * [SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM The Q8_0 reorder optimization (#21527) was missing a reorder-aware dequantizer for the GEMM code path used during prompt processing. After token generation reordered Q8_0 weights (via DMMV/MMVQ), the next prompt processing pass would read them with the standard dequantizer, producing garbage output. Add dequantize_block_q8_0_reorder() and wire it into both ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the pattern already used by Q4_0, Q4_K, and Q6_K. Fixes #21589 AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. * SYCL: fix reorder crash when device memory is full The reorder optimization allocates a temporary buffer the full size of the weight tensor on the device. When VRAM is nearly full (large models on a single GPU), this allocation fails and the subsequent memcpy crashes on a NULL pointer. Fix: try device allocation first, fall back to host memory if device memory is full. The reorder kernel still works correctly reading from host memory over PCIe. This is slower for the one-time reorder (~21 t/s vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for all subsequent inference. If both device and host allocation fail, skip the reorder and fall back to the unoptimized kernel path. Also fixes a bug where opt_for_reorder() marked tensors as reordered even when the reorder was skipped due to allocation failure. This caused DMMV/MMVQ kernels to read the original AoS data as if it were SoA, producing garbage output or NaN results. Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was AI-assisted (Claude), reviewed and tested on hardware by a human. Fixes #20478 * SYCL: add RAII temp buffer class + macro guard for host fallback Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free functions with sycl_reorder_temp_buffer RAII class. The host_fallback bool is now a private member, and cleanup happens automatically at scope exit. Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard the host memory fallback code path. Device access to host memory requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it. Addresses arthw's review on PR #21638. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: document GGML_SYCL_HOST_MEM_FALLBACK build option in SYCL.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add reorder-aware DMMV dequantizers for Q4_K and Q6_K Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not DMMV. When the DMMV path encountered reordered data it would abort. Add DMMV kernels that read from the SOA reorder layout for both types. Same math as the non-reorder versions, different memory access pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 08:34:05 +03:00
Neo Zhang	ecac98ee53	[SYCL] Update SYCL.md for binary package for Windows (#20401 ) * add download binary package * update prefix	2026-03-11 22:21:22 +08:00
Neo Zhang	213c4a0b81	[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190 ) * support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT	2026-03-08 12:00:07 +08:00
Marcel Petrick	92f7da00b4	chore : correct typos [no ci] (#20041 ) * fix(docs): correct typos found during code review Non-functional changes only: - Fixed minor spelling mistakes in comments - Corrected typos in user-facing strings - No variables, logic, or functional code was modified. Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> * Update docs/backend/CANN.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8" This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256. * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-05 08:50:21 +01:00
Maciej Lisowski	e99f1083a0	docs: Fix broken links for preparing models in Backends (#19684 )	2026-02-18 23:50:23 +08:00
Neo Zhang	bf38346d13	Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (#19246 ) User can't build up the software for Nvidia & AMD GPU. rm the oneMath since it is only used in NV and AMD code path.	2026-02-02 21:06:21 +08:00
Neo Zhang	2634ed207a	create test.sh to enhance the parameters for testing, update the guide, rm useless script (#19243 )	2026-02-01 18:24:00 +08:00
DDXDB	d284baf1b5	Fix typos in SYCL documentation (#19162 ) * Fix typos in SYCL documentation * Update SYCL.md * Update SYCL.md * Update SYCL.md * Update docs/backend/SYCL.md Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * Update SYCL.md --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-30 09:46:57 +08:00
Jesse Ikonen	1ce0126b18	docs: Fix typos in SYCL documentation (#18269 )	2025-12-24 17:19:47 +08:00
Francisco Herrera	279cef27c2	added note for old Intel hardware pre sycl (#18017 ) * added note for old Intel hardware pre sycl Older hardware used opencl * typo * use consistent terms	2025-12-16 17:45:09 +08:00
Neo Zhang	7d2add51d8	sycl : support to malloc memory on device more than 4GB, update the doc and script (#17566 ) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2025-11-29 14:59:44 +02:00
Neo Zhang Jianyu	2be72c2b12	SYCL: Update to oneAPI 2025.2 (#16371 ) * update oneapi to 2025.2, use deep-learning-essentials to replace base-tool * update to 2025.2 use deeplearn essi to replace base toolkit * add missed dll * add deep learning essentials * add sycl-ls --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-10-02 10:16:25 +03:00
Anton Mitkov	2bf9d539dd	sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973 )	2025-06-25 18:09:55 +02:00
Alberto Cabrera Pérez	725f23f1f3	sycl : backend documentation review (#13544 ) * sycl: reviewing and updating docs * Updates Runtime error codes * Improves OOM troubleshooting entry * Added a llama 3 sample * Updated supported models * Updated releases table	2025-05-19 14:38:20 +01:00
Łukasz Ślusarczyk	9c404ed54c	sycl: use oneDNN for matrices multiplication (#12972 )	2025-05-15 16:53:41 +02:00
Romain Biessy	8ed71242f4	sycl: update documentation to use -no-cnv (#12845 )	2025-04-09 11:22:04 +02:00
Nicolò Scipione	94148ba330	sycl: allow ggml-sycl configuration and compilation using Visual Studio project/solution (#12625 )	2025-04-04 16:00:46 +02:00
Atharva Dubey	2004644b7a	ci : add env variable in ggml-ci and document the same in SYCL.md (#12736 )	2025-04-03 15:12:39 +03:00
Romain Biessy	8293970542	SYCL: Rename oneMKL to oneMath (#12192 ) * Rename oneMKL Interface to oneMath * Use oneMath for Intel vendor * Rename occurences to mkl * clang-format * Silence verbose warnings * Set oneMath HIP_TARGETS * Fix silence warnings * Remove step to build oneMath from build instructions * Use fixed oneMath version * Remove INTEL_CPU * Fold CMake oneDNN conditions * Use Intel oneMKL for Intel devices * Improve CMake message * Link against MKL::MKL_SYCL::BLAS only * Move oneMath documentation to Nvidia and AMD sections	2025-04-01 16:24:29 +08:00
Svetlozar Georgiev	9ffcc9e374	sycl: cleanup oneDNN related code (#12097 )	2025-03-21 10:15:56 +08:00
Łukasz Ślusarczyk	35cae5ba05	SYCL: using graphs is configurable by environment variable and compile option (#12371 ) * alberto changes * enable sycl graphs by env variable * fixed compilation warnings in ggml-sycl.cpp * renamed graph variables * fix markdown in docs/backend/SYCL.md Co-authored-by: Romain Biessy <romain.biessy@codeplay.com> * fix markdown in docs/backend/SYCL.md again * compiling graphs by default, renamed graph_enable to graph_disable --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>	2025-03-18 11:16:31 +01:00
Neo Zhang Jianyu	08d5986290	[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035 ) * opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2025-02-24 22:33:23 +08:00
Georgi Gerganov	68ff663a04	repo : update links to new url (#11886 ) * repo : update links to new url ggml-ci * cont : more urls ggml-ci	2025-02-15 16:40:57 +02:00
Jafar Uruç	a07c2c8a52	docs : Update readme to build targets for local docker build (#11368 )	2025-01-24 14:30:13 +01:00
Neo Zhang Jianyu	ad21c9e1f1	update rel to 4040 (#10395 ) Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2024-11-20 13:54:25 +08:00
Romain Biessy	2a1507c162	sycl : Add option to set the SYCL architecture for all targets (#10266 ) * Add option to set the SYCL architecture for all targets * Convert GGML_SYCL_HIP_TARGET to the more generic GGML_SYCL_ARCH option * Document that setting GGML_SYCL_ARCH can improve the performance	2024-11-19 08:02:23 +00:00
Romain Biessy	5a54af4d4f	sycl: Use syclcompat::dp4a (#10267 ) * sycl: Use syclcompat::dp4a * Using the syclcompat version allow the compiler to optimize the operation with native function * Update news section * Update CI Windows oneAPI version to 2025.0 * Reword doc * Call syclcompat::dp4a inside dpct::dp4a This reverts commit `90cb61d692`.	2024-11-15 11:09:12 +08:00
Zhiyuan Li	3bcd40b3c5	Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (#10133 ) * rwkv6: rename to wkv6 * rwkv6: support avx2 avx512 armv8 armv9 * rwkv6: update cuda file name * rwkv6: rename params * wkv on sycl * sycl: add some ops * sycl: Enhance OP support judgment * wkv6: drop armv9 and tranfer to GGML style ggml-ci * sync : ggml * update the function to use appropriate types * fix define error * Update ggml/src/ggml-cpu.c * add appropriate asserts * move element-wise functions outside * put the declaration outside the loop * rewrite to be more inline with the common pattern for distributing threads * use recommended way GGML_TENSOR_LOCALS --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Plamen Minev <pacominev@gmail.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: Meng, Hengyu <airdldl@163.com>	2024-11-07 15:19:10 +08:00
Alberto Cabrera Pérez	f536f4c439	[SYCL] Initial cmake support of SYCL for AMD GPUs (#9658 ) sycl: initial cmake support of SYCL for AMD GPUs	2024-10-02 13:57:18 +01:00
Neo Zhang Jianyu	faf67b3de4	[SYCL]set context default value to avoid memory issue, update guide (#9476 ) * set context default to avoid memory issue, update guide * Update docs/backend/SYCL.md Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>	2024-09-18 08:30:31 +08:00
蕭澧邦	cddae4884c	Correct typo run_llama2.sh > run-llama2.sh (#9149 )	2024-08-30 22:10:01 +10:00
luoyu-intel	1731d4238f	[SYCL] Add oneDNN primitive support (#9091 ) * add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc	2024-08-22 12:50:10 +08:00
Neo Zhang	a21c6fd450	update guide (#8909 ) Co-authored-by: Neo Zhang <>	2024-08-11 14:07:43 +05:30
Chen Xi	ed67bcb24f	[SYCL] fix multi-gpu issue on sycl (#8554 ) --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>	2024-07-25 19:45:18 +08:00
Xuan Son Nguyen	be20e7f49d	Reorganize documentation pages (#8325 ) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections	2024-07-05 18:08:32 +02:00

35 Commits