llama.cpp fork with additional SOTA quants and improved performance
Go to file
Kawrakow 1ea1df4b2d
Fix FA bug on AVX2 (#364)
* Fix FA bug on AVX2

* Also this was wrong

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02 07:09:09 +02:00
.devops Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
.github Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
ci Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
cmake Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
common Add copyright notices (#317) 2025-04-07 10:43:26 +02:00
docs Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
examples imatrix: collect layer influence statistics (#328) 2025-04-14 19:43:19 +02:00
ggml Fix FA bug on AVX2 (#364) 2025-05-02 07:09:09 +02:00
gguf-py Add missing enum values for qwen3 and qwen3moe (#356) 2025-04-29 10:05:38 +02:00
grammars Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
include LlaMA-4 support (text only) (#321) 2025-04-10 09:05:21 +02:00
media README: add graphic for matrix multiplication (#6881) 2024-04-24 21:29:13 +02:00
models Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
pocs build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
prompts llama : add Qwen support (#4281) 2023-12-01 20:16:31 +02:00
requirements Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
scripts Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
spm-headers Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
src Fix model architecture name (#366) 2025-05-02 07:07:24 +02:00
tests Faster Gemma2 (#27) 2024-08-27 17:40:59 +03:00
.dockerignore build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
.ecrc Nomic Vulkan backend (#4456) 2024-01-29 15:50:50 -05:00
.editorconfig Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
.flake8 py : logging and flake8 suppression refactoring (#7081) 2024-05-05 08:07:48 +03:00
.gitignore Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
.gitmodules Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
.pre-commit-config.yaml convert.py : add python logging instead of print() (#6511) 2024-05-03 22:36:41 +03:00
AUTHORS Update AUTHORS 2025-04-29 07:22:06 +02:00
CMakeLists.txt Move to c++17 projectwide (#80) 2024-10-04 14:43:26 +03:00
CMakePresets.json Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
CONTRIBUTING.md Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
convert_hf_to_gguf_update.py Deepseek V3 support added (#176) 2025-01-23 18:24:10 +02:00
convert_hf_to_gguf.py Apply Qwen3 PR from llama.cpp (#355) 2025-04-29 10:02:08 +02:00
convert_llama_ggml_to_gguf.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
convert_lora_to_gguf.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
flake.lock Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
flake.nix build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
LICENSE Use links for ggml/llama.cpp authors (#318) 2025-04-07 17:25:06 +02:00
Makefile Enable q6_0 for flash attention (#101) 2024-10-22 11:34:49 +02:00
mypy.ini convert : partially revert PR #4818 (#5041) 2024-01-20 18:14:18 -05:00
Package.swift Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
poetry.lock Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
pyproject.toml Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
pyrightconfig.json Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
README.md Update README.md (#352) 2025-04-30 15:11:29 +02:00
requirements.txt Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00

ik_llama.cpp: llama.cpp fork with better CPU performance

License: MIT

TL;DR

This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.

Latest News

  • April 29 2025: Qwen3 support added
  • April 26 2025: GLM-4 support added
  • April 26 2025: Command-A support added
  • April 22 2025: Support for the latest Microsoft Bitnet model added
  • April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux)
  • April 17 2025: Better CPU Flash Attention token generation performance
  • April 13 2025: IQ1_M quantization improvements
  • April 10 2025: LLaMA-4 support added
  • April 7 2025: IQ2_XS quantization improvements
  • April 3 2025: Much faster MoE implementation on Metal
  • April 1 2025: Quantization improvements for Q2_K, Q4_K, Q5_K, Q4_1, Q5_1
  • March 28 2025: Quantization imrovements for Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL
  • March 25 2025: Better MoE performance on CUDA
  • March 23 2025: Better batched processing speed for DeepSeek models
  • March 22 2025: Gemma3 support added
  • March 21 2025: FlashMLA-3: fastest CPU-only inference for DeepSeek models
  • March 18 2025: reduce compute buffer size
  • March 17 2025: FlashMLA-2 performance improvements
  • March 12 2025: Allow Q8_0 KV cache with FlashMLA-2 on CUDA
  • March 10 2025: Better TG performance for MoE models on CUDA
  • March 9 2025: FlashMLA on CUDA
  • March 8 2025: Faster FlashMLA CPU implementation
  • March 7 2025: Custom quantization mixes using regular expressions
  • March 5 2025: FlashMLA on CUDA
  • March 3 2025: Introducing FlashMLA - MLA with Flash Attention
  • March 1 2025: Smart Expert Reduction for faster DeepSeek inference
  • Feb 27 2025: MLA without transposed cache
  • Feb 25 2025: tensor overrides for better control where model weights are stored (GPU or CPU)
  • Feb 23 2025: fused FFN ops for faster MoE inference
  • Feb 23 2025: sweep-bench - better performance benchmarking
  • Feb 20 2025: fast GEMM/GEMV for IQ1_S
  • Feb 19 2025: Q8_KV - new type for 8-bit KV-cache quantization
  • Feb 13 2025: allow Q8_0 quantized cache with MLA
  • Feb 11 2025: Flash Attention support for DeepSeek models
  • Feb 9 2025: MLA for DeepSeek models
  • Jan 23 2025: DeepSeek-V3 support added

Resources

There is no single point of reference describing all new ik_llama.cpp features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition

  • The Wiki page has performance comparisons to mainline llama.cpp
  • This guide is a good place to start if you came here because of DeepSeek models
  • This discussion is about running DeepSeek-V3/R1 on a 16 x 3090 setup
  • This discussion describes the new quantization types available in ik_llama.cpp

Contributing

Contributions in form of pull requests, issue submissions (bug reports, feature requests), or general discussions, are welcome.

License

MIT