llama.cpp

mirror of https://github.com/ggerganov/llama.cpp synced 2026-03-09 08:39:41 +01:00

History

shalinib-ibm 3bb2fcc856 llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (#19709 ) Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8 inputs to FP16 and using FP16×FP16→FP32 MMA. This removes post-processing overhead and improves performance. Performance Impact: 1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models, measured with llama-bench and llama-batched-bench. Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface) Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from f32 model) llama-bench Q8 Model Results: model size params backend threads test Base t/s Patch t/s granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp8 64.48 ± 4.72 73.99 ± 0.27 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp16 80.11 ± 0.32 112.53 ± 0.40 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp32 89.10 ± 0.27 152.95 ± 0.68 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp64 93.65 ± 0.25 187.83 ± 0.83 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp128 99.93 ± 0.02 201.32 ± 0.11 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp256 102.32 ± 0.40 208.32 ± 0.41 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp512 103.42 ± 0.40 209.98 ± 0.14 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 tg128 20.35 ± 0.01 19.57 ± 0.01 llama-bench Q4 Model Results: model size params backend threads test Base t/s Patch t/s llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp8 34.77 ± 0.10 41.23 ± 0.08 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 40.81 ± 0.04 64.55 ± 0.15 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 44.65 ± 0.05 90.84 ± 0.22 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 47.49 ± 0.03 114.39 ± 0.11 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 49.29 ± 0.24 120.13 ± 0.19 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 49.77 ± 0.23 121.51 ± 0.11 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 49.89 ± 0.23 117.52 ± 0.10 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 13.40 ± 0.01 13.37 ± 0.00 Llama perplexity Results: Model Base Final PPL Estimate Patch Final PPL Estimate granite-4.0-h-micro-Q8_0 1.3862 +/- 0.04424 1.3868 +/- 0.04432 Meta-Llama3-8b Q4 1.3801 +/- 0.04116 1.3803 +/- 0.04116 Signed-off-by: Shalini.Salomi.Bodapati <Shalini.Salomi.Bodapati@ibm.com>		2026-02-19 14:28:53 +08:00
..
ggml-blas	ggml : add ggml_build_forward_select (#18550 )	2026-01-19 20:03:19 +02:00
ggml-cann	CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (#18968 )	2026-02-10 14:19:30 +08:00
ggml-cpu	llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (#19709 )	2026-02-19 14:28:53 +08:00
ggml-cuda	cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (#19645 )	2026-02-17 12:31:49 +02:00
ggml-hexagon	hexagon: further optimizations and refactoring for flash attention (#19583 )	2026-02-13 16:27:30 -08:00
ggml-hip	HIP: add mmf for CDNA (#18896 )	2026-01-29 11:10:53 +01:00
ggml-metal	models : optimize qwen3next graph (#19375 )	2026-02-14 12:57:36 +02:00
ggml-musa	CUDA: faster tile FA, add oob checks, more HSs (#16492 )	2025-10-11 20:54:32 +02:00
ggml-opencl	opencl: refactor expm1 and softplus (#19404 )	2026-02-17 14:47:18 -08:00
ggml-rpc	rpc : use unordered_map::reserve and emplace (#18513 )	2026-01-02 12:09:36 +02:00
ggml-sycl	sycl: add F16 support for GGML_OP_CEIL (#19306 )	2026-02-06 23:13:44 +08:00
ggml-virtgpu	ggml-virtgpu: make the code thread safe (#19204 )	2026-02-04 10:46:18 +08:00
ggml-vulkan	vulkan: split mul_mat into multiple dispatches to avoid overflow (#19509 )	2026-02-18 10:47:10 +01:00
ggml-webgpu	ggml webgpu: Fix bug in dispatching large matrix-vector multiplication (#19535 )	2026-02-18 16:06:29 -07:00
ggml-zdnn	ggml-zdnn : mark zDNN buffers as non-host (#18967 )	2026-01-22 01:16:21 +01:00
ggml-zendnn	ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (#19159 )	2026-01-29 12:28:57 +08:00
CMakeLists.txt	hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 )	2026-01-29 12:33:21 -08:00
ggml-alloc.c	ggml : make `ggml_is_view` as API (#19539 )	2026-02-16 17:43:34 +02:00
ggml-backend-dl.cpp	hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 )	2026-01-29 12:33:21 -08:00
ggml-backend-dl.h	hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 )	2026-01-29 12:33:21 -08:00
ggml-backend-impl.h	llama: use host memory if device reports 0 memory (#18587 )	2026-01-09 05:34:56 +08:00
ggml-backend-reg.cpp	ggml : use noexcept overload for is_regular_file in backend registration (#19452 )	2026-02-10 10:57:48 +01:00
ggml-backend.cpp	ggml-backend: fix async set/get fallback sync (#19179 )	2026-02-02 10:00:05 +01:00
ggml-common.h	llama : add gpt-oss (#15091 )	2025-08-05 22:10:36 +03:00
ggml-impl.h	ggml : make `ggml_is_view` as API (#19539 )	2026-02-16 17:43:34 +02:00
ggml-opt.cpp	finetune: SGD optimizer, more CLI args (#13873 )	2025-08-14 12:03:57 +02:00
ggml-quants.c	ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928 )	2025-09-23 10:25:20 +02:00
ggml-quants.h	llama : add gpt-oss (#15091 )	2025-08-05 22:10:36 +03:00
ggml-threading.cpp
ggml-threading.h
ggml.c	ggml : make `ggml_is_view` as API (#19539 )	2026-02-16 17:43:34 +02:00
ggml.cpp	ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)	2025-06-01 13:43:57 +03:00
gguf.cpp	GGUF: check that tensor size is representable (#19072 )	2026-01-24 21:57:51 +01:00