mirror of
https://github.com/ggerganov/llama.cpp
synced 2026-03-09 08:39:41 +01:00
Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8 inputs to FP16 and using FP16×FP16→FP32 MMA. This removes post-processing overhead and improves performance. Performance Impact: 1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models, measured with llama-bench and llama-batched-bench. Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface) Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from f32 model) llama-bench Q8 Model Results: model size params backend threads test Base t/s Patch t/s granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp8 64.48 ± 4.72 73.99 ± 0.27 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp16 80.11 ± 0.32 112.53 ± 0.40 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp32 89.10 ± 0.27 152.95 ± 0.68 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp64 93.65 ± 0.25 187.83 ± 0.83 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp128 99.93 ± 0.02 201.32 ± 0.11 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp256 102.32 ± 0.40 208.32 ± 0.41 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp512 103.42 ± 0.40 209.98 ± 0.14 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 tg128 20.35 ± 0.01 19.57 ± 0.01 llama-bench Q4 Model Results: model size params backend threads test Base t/s Patch t/s llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp8 34.77 ± 0.10 41.23 ± 0.08 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 40.81 ± 0.04 64.55 ± 0.15 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 44.65 ± 0.05 90.84 ± 0.22 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 47.49 ± 0.03 114.39 ± 0.11 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 49.29 ± 0.24 120.13 ± 0.19 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 49.77 ± 0.23 121.51 ± 0.11 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 49.89 ± 0.23 117.52 ± 0.10 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 13.40 ± 0.01 13.37 ± 0.00 Llama perplexity Results: Model Base Final PPL Estimate Patch Final PPL Estimate granite-4.0-h-micro-Q8_0 1.3862 +/- 0.04424 1.3868 +/- 0.04432 Meta-Llama3-8b Q4 1.3801 +/- 0.04116 1.3803 +/- 0.04116 Signed-off-by: Shalini.Salomi.Bodapati <Shalini.Salomi.Bodapati@ibm.com> |
||
|---|---|---|
| .. | ||
| ggml-blas | ||
| ggml-cann | ||
| ggml-cpu | ||
| ggml-cuda | ||
| ggml-hexagon | ||
| ggml-hip | ||
| ggml-metal | ||
| ggml-musa | ||
| ggml-opencl | ||
| ggml-rpc | ||
| ggml-sycl | ||
| ggml-virtgpu | ||
| ggml-vulkan | ||
| ggml-webgpu | ||
| ggml-zdnn | ||
| ggml-zendnn | ||
| CMakeLists.txt | ||
| ggml-alloc.c | ||
| ggml-backend-dl.cpp | ||
| ggml-backend-dl.h | ||
| ggml-backend-impl.h | ||
| ggml-backend-reg.cpp | ||
| ggml-backend.cpp | ||
| ggml-common.h | ||
| ggml-impl.h | ||
| ggml-opt.cpp | ||
| ggml-quants.c | ||
| ggml-quants.h | ||
| ggml-threading.cpp | ||
| ggml-threading.h | ||
| ggml.c | ||
| ggml.cpp | ||
| gguf.cpp | ||