llama.cpp fork with additional SOTA quants and improved performance
Go to file
2025-07-23 13:31:53 +02:00
.devops Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
.github Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
ci Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
cmake Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
common Do not crash when there is no DRY sampler (#578) 2025-07-03 15:26:52 +02:00
docs Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
examples Webui: New Features for Conversations, Settings, and Chat Messages (#618) 2025-07-20 12:33:55 +02:00
ggml Adding IQ1_KT - 1.75 bpw SOTA quants (#616) 2025-07-20 10:05:23 +02:00
gguf-py Adding IQ1_KT - 1.75 bpw SOTA quants (#616) 2025-07-20 10:05:23 +02:00
github-data Add GitHub data: filename sanitization (#640) 2025-07-23 13:31:53 +02:00
grammars Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
include Adding IQ1_KT - 1.75 bpw SOTA quants (#616) 2025-07-20 10:05:23 +02:00
media README: add graphic for matrix multiplication (#6881) 2024-04-24 21:29:13 +02:00
models Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
pocs build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
prompts llama : add Qwen support (#4281) 2023-12-01 20:16:31 +02:00
requirements Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
scripts Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
spm-headers Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
src Fix pauses after a comma (#639) 2025-07-23 11:45:58 +02:00
tests Faster Gemma2 (#27) 2024-08-27 17:40:59 +03:00
.dockerignore build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
.ecrc Nomic Vulkan backend (#4456) 2024-01-29 15:50:50 -05:00
.editorconfig Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
.flake8 py : logging and flake8 suppression refactoring (#7081) 2024-05-05 08:07:48 +03:00
.gitignore Merge vulkan code from mainline up to commit of 6/28/2025 (#563) 2025-07-02 08:49:42 +02:00
.gitmodules Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
.mailmap Add .mailmap 2025-07-22 14:53:50 +03:00
.pre-commit-config.yaml convert.py : add python logging instead of print() (#6511) 2024-05-03 22:36:41 +03:00
AUTHORS Update AUTHORS 2025-06-08 14:41:17 +03:00
CMakeLists.txt Fix non rpc build error (#506) 2025-06-08 17:27:00 +03:00
CMakePresets.json Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
CONTRIBUTING.md Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
convert_hf_to_gguf_update.py kimi-k2 convert script and chat template (#612) 2025-07-15 19:54:04 +02:00
convert_hf_to_gguf.py Fixup kimi-k2 convert indentation (#617) 2025-07-16 15:24:20 +02:00
convert_llama_ggml_to_gguf.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
convert_lora_to_gguf.py Fix missing rope_freqs with convert_hf_to_gguf (#402) 2025-05-09 09:17:41 -05:00
flake.lock Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
flake.nix build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
LICENSE Use links for ggml/llama.cpp authors (#318) 2025-04-07 17:25:06 +02:00
Makefile Merge vulkan code from mainline up to commit of 6/28/2025 (#563) 2025-07-02 08:49:42 +02:00
mypy.ini convert : partially revert PR #4818 (#5041) 2024-01-20 18:14:18 -05:00
Package.swift Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
poetry.lock Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
pyproject.toml Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
pyrightconfig.json Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
README.md Revert "Update README.md" 2025-07-22 15:22:46 +03:00
requirements.txt Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00

ik_llama.cpp: llama.cpp fork with better CPU performance

License: MIT

TL;DR

This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.

Latest News

Model Support

LlaMA-3-Nemotron PR 377, Qwen3 PR 355, GLM-4 PR 344, Command-A PR 341, bitnet-b1.58-2B-4T PR 337, LLaMA-4 PR 321, Gemma3 PR 276, DeepSeek-V3 PR 176

Quantization

Quantization additions

Trellis quants (IQ2_KT, IQ3_KT, IQ4_KT)

Information and the original CUDA implementation in PR 113. Additional implementations: Metal PR 475, Neon PR 471, CPU PR 441

IQK quants

Information can be found in Discussion 8.

Initial implementations (Zen4, AVX2, NEON): IQ5_KS_R4 PR 426, IQ5_KS PR 422, IQ4_KS_R4 PR 150, IQ5_K_R4 PR 149, IQ2_K_R4 PR 146, IQ3_K_R4 PR 145, IQ4_K_R4 PR 138, IQ4_KSS PR 89, IQ2_KS PR 85, IQ4_KS PR 83, IQ6_K PR 14, IQ2_K, IQ3_K and IQ5_K PR 7, IQ4_K PR 6

Cuda implementations: IQ4_KS_R4 and IQ5_KS_R4 PR 493, IQ1_S_R4 PR 492, IQ1_M_R4 PR 494. IQ4_KS_R4 and IQ5_KS_R4 PR 462, IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4 PR 461, IQ4_K, IQ5_K, IQ6_K PR 417, IQ2_KS, IQ2_K, IQ3_K PR 418

Quantization improvements

IQ1_M PR 327, IQ2_XS PR 312, Q2_K, Q4_K, Q5_K, Q4_1, Q5_1 PR 302, Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL PR 295

Quantization performance improvements

  • Faster CPU prompt processing for Trellis quants and MoE models. PR 488
  • Trellis quants: faster CPU prompt processing PR 482.
  • Minor (~2%) iq2_ks TG performance improvement on CUDA PR 468
  • Faster IQ3_KT and IQ4_KT PR 453
  • Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS PR 428
  • Fast GEMM/GEMV for IQ1_S PR 212

Features

  • Legacy quants conversion schemes in convert_hf_to_gguf.py PR 449, Q6_0 in PR 483
  • June 8 2025: Webui updated (legacy still available when --path ./examples/server/public_legacy is passed) PR 481
  • June 8 2025: RPC improvements PR 480
  • June 7 2025: Add an endpoint that lists all the saved prompt caches to server PR 502
  • June 6 2025: Make prompt cache saving and restoring MLA aware PR 497
  • June 3 2025: Added samplers, XTC PR 486, top-n σ PR 489.
  • May 22 2025: Refactor iqk_mul_mat.cpp which speeds up compilation time significantly. PR 435
  • May 17 2025: Option to enable or disable the CPU FA kernels PR 429.
  • May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See PR 405
  • May 12 2025: Compatibility issues with mainline llama.cpp GGUFs for DeepSeek models with MLA enabled were resolved in PR 394. The lower prompt processing performance resulting from using llama.cpp-style MLA GGUFs was recovered in PR 409.
  • April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see PR 336
  • March 1 2025: Smart Expert Reduction for faster DeepSeek inference PR 239
  • Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU) PR 232
  • Feb 23 2025: sweep-bench - better performance benchmarking PR 225
  • Feb 19 2025: Q8_KV - new type for 8-bit KV-cache quantization PR 208
  • March 7 2025: Custom quantization mixes using regular expressions PR 244

Performance improvements

  • May 13 2025: Better CPU FA performance for DeepSeek-Lite. PR 410
  • May 11 2025: Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. PR 408
  • May 4 2025: Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks. PR 370
  • April 17 2025: Better CPU Flash Attention token generation performance. PR 332
  • April 3 2025: Much faster MoE implementation on Metal. PR 307
  • March 25 2025: Better MoE performance on CUDA PR 283
  • March 23 2025: Better batched processing speed for DeepSeek models PR 282
  • March 18 2025: Reduce compute buffer size PR 237
  • March 10 2025: Better TG performance for MoE models on CUDA PR 248
  • Feb 23 2025: Fused FFN ops for faster MoE inference PR 229

Flash-MLA

  • May 7 2025: 🚀 FlashMLA-3 for DeepSeek models on CUDA. PR 386. Caveat: Ampere or newer Nvidia GPU required
  • March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models PR 273
  • March 17 2025: 🚀 FlashMLA-2 performance improvements PR 253
  • March 12 2025: Allow Q8_0 KV cache with FlashMLA-2 on CUDA PR 265
  • March 9 2025: 🚀 FlashMLA on CUDA PR 247
  • March 8 2025: 🚀 Faster FlashMLA CPU implementation PR 243
  • March 3 2025: 🚀 Introducing FlashMLA - MLA with Flash Attention PR 240
  • Feb 27 2025: MLA without transposed cache PR 235
  • Feb 13 2025: Allow Q8_0 quantized cache with MLA PR 206
  • Feb 11 2025: 🚀 Flash Attention support for DeepSeek models PR 200
  • Feb 9 2025: 🚀 MLA for DeepSeek models PR 188

Fixes

  • Fix bug in MMVQ kernel PR 446
  • Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K PR 427
  • Fix standard attention on the CPU PR 421
  • Fix imatrix calculation for MLA models PR 411
  • Fix new CUDA FA on Touring PR 413
  • Fix SER. CPU: PR 415 CUDA: PR 416

Resources

There is no single point of reference describing all new ik_llama.cpp features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition

  • The Wiki page has performance comparisons to mainline llama.cpp
  • This guide is a good place to start if you came here because of DeepSeek models
  • This discussion is about running DeepSeek-V3/R1 on a 16 x 3090 setup
  • This discussion describes the new quantization types available in ik_llama.cpp

Contributing

Contributions in form of pull requests, issue submissions (bug reports, feature requests), or general discussions, are welcome.

License

MIT