ik_llama.cpp

Author	SHA1	Message	Date
Iwan Kawrakow	c8cf128099	Add missing break	2025-06-09 15:18:11 +03:00
Iwan Kawrakow	336bcd4ad7	New iq2_kt: Metal - very slow. It seems Apple Silicon cannot quickly add 4 8-bit ints. Or I don't know how to do it - but I didn't find anything in the Metal Shading Language Specification. So, performance is quite a bit worse than the original trellis.	2025-06-09 15:04:23 +03:00
Iwan Kawrakow	ab802a1881	New iq2_kt: slightly faster NEON GEMM	2025-06-09 12:58:27 +03:00
Iwan Kawrakow	b6069a4e12	New iq2_kt: NEON GEMM/GEMV	2025-06-09 12:20:42 +03:00
Iwan Kawrakow	6da24256b9	Adding forgotten file	2025-06-09 10:26:27 +03:00
Iwan Kawrakow	fcea0045dd	New iq2_kt: AVX2 GEMM/GEMV	2025-06-09 10:26:04 +03:00
Iwan Kawrakow	dbbc7eaec8	New iq2_kt: AVX2 dequantize	2025-06-09 10:26:04 +03:00
Iwan Kawrakow	5ad250ce9d	New iq2_kt: CUDA GEMV	2025-06-09 10:26:04 +03:00
Iwan Kawrakow	44015ad83e	Switching iq2_kt to new trellis - CUDA MMQ	2025-06-09 10:26:04 +03:00
Iwan Kawrakow	f59fe11764	Adding forgottent file	2025-06-09 09:59:39 +03:00
Iwan Kawrakow	07efec0e4c	Cleanup	2025-06-08 13:44:20 +03:00
Iwan Kawrakow	b30bfc1377	Remove the extra 4 bytes of row meta data that is no longer used	2025-06-08 13:36:47 +03:00
Iwan Kawrakow	314674df85	New iq4_kt trellis: not working Metal implementation	2025-06-08 11:47:55 +03:00
Iwan Kawrakow	1e6ff8a788	Minor	2025-06-08 10:29:40 +03:00
Iwan Kawrakow	d334cbf552	New iq4_kt: faster NEON We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.	2025-06-08 10:05:49 +03:00
Iwan Kawrakow	b41f4718ad	New iq4_kt: slightly faster NEON	2025-06-08 09:17:32 +03:00
Iwan Kawrakow	eed22154d1	New iq4_kt: slightly faster NEON	2025-06-08 08:43:02 +03:00
Iwan Kawrakow	8cc8b1e795	New iq4_kt: NEON implementation We get very respectable PP-512 = 120 t/s. TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.	2025-06-08 08:31:56 +03:00
Iwan Kawrakow	9d7bf1c821	New iq4_kt: fix vanilla AVX2	2025-06-07 19:33:33 +03:00
Iwan Kawrakow	65e654a69d	New iq4_kt: AVX2 dot product finally works We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic. Still somewhat slower than other quants, but no longer pathetic.	2025-06-07 19:12:09 +03:00
Iwan Kawrakow	2b6acd5843	Fix iq2_kt that got broken along the way	2025-06-07 18:51:02 +03:00
Iwan Kawrakow	fb776ab7ba	For now have only iq4_kt use the new trellis	2025-06-07 18:35:10 +03:00
Iwan Kawrakow	9b4103ed54	New iq4_kt: CUDA MMQ	2025-06-07 18:21:43 +03:00
Iwan Kawrakow	98f35bfaa2	New iq4_kt: CUDA MMVQ	2025-06-07 17:43:36 +03:00
Iwan Kawrakow	a296134434	Something is not working with the AVX2 dot product	2025-06-07 16:18:58 +03:00
Iwan Kawrakow	eb61f498d1	New iq4_kt trellis The new trellis generates int8_t values via sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126. CUDA dequantize works. AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B. PPL is on par or even slightly lower than original QTIP trellis.	2025-06-07 12:30:37 +03:00
Kawrakow	8ffad187ab	MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4 (#493 ) * MMQ for iq4_ks_r4 * MMQ for iq5_ks_r4 * Add forgotten file * Another forgotten file --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-05 08:31:20 +03:00
Kawrakow	0b10f7418f	Faster CPU prompt processing for Trellis quants and MoE models (#488 ) * Also do the dequantize approach for mul_mat_id * Also do the dequantize approach for iqk_moe_fused_up_gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-05 08:30:35 +03:00
Kawrakow	7e79665a31	CUDA implementation for IQ1_S_R4 (#492 ) * iq1_s_r4: CUDA dequantize * iq1_s_r4: CUDA GEMV * iq1_s_r4: MMQ on CUDA Requires Turing or better (will fall back to dequantize+cuBLAS on older cards). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-05 07:24:31 +03:00
Kawrakow	f6d5fbdc57	Adding top-n-sigma sampler (#489 ) * Adding top-n-sigma sampler * Fix typos in XTC PR * Update README.md for main and server * More README * More README --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-03 17:35:09 +03:00
Kawrakow	ccb265c016	Adding the XTC sampler (#486 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-03 11:32:03 +03:00
Nexes the Elder	4f8b05a0d7	convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483 ) * Direct conversion from fp16 to Q6_0 * forgotten comma * More precise infos	2025-06-03 09:30:30 +03:00
Kawrakow	7a8abe29f7	Minor (~2%) iq2_ks TG performance improvement on CUDA (#468 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-01 15:24:33 +03:00
Kawrakow	3df1a3a44d	Trellis quants: faster CPU prompt processing (#482 ) * Experimenting with dequant + f32 GEMM For iq4_kt this results in a massive PP improvement from PP512 = ~42 t/s to PP512 = 128 t/s. * Experimenting with dequant + f32 GEMM iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s * Experimenting with dequant + f16 GEMM on NEON iq2_kt: PP512 = 79 t/s from 42 t/s iq3_kt: PP512 = 81 t/s from 35 t/s Also, found the reason why the f16 implementation for iq4_kt was not working: it overflows. It works after mltiplying with the row scale before doing the multiply-adds. * Experimenting with dequant + f16 GEMM on NEON iq4_kt: PP512 = 86 t/s from 29 t/s * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-01 15:24:05 +03:00
Kawrakow	35374bc7e8	Metal implementatio for the trellis quants. (#475 ) * iq2_kt: Metal dequantize * iq2_kt: Metal GEMV Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B * iq3_kt: Metal dequantize * iq3_kt: Metal GEMV Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B. Flipping signs is a costly affair. * iq4_kt: Metal dequantize - getting NaNs * iq4_kt: Metal GEMV - also not working * iq4_kt: Metal still not working * Disable iq4_kt on Metal for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-01 15:23:44 +03:00
Nexes the Elder	7239ce6b35	forgotten refs and typo (#478 )	2025-05-31 07:36:50 +03:00
Kawrakow	2cf12eb12d	Replace MLA-specific KV cache with the standard KV cache (#469 ) * Remove kv_l, kvt_l and just use k_l and v_l * Hopefully take care of missing V cache (MLA) * Replace MLA-specific KV cache with the standard KV cache V2 (#473) * Fix save and restore when there is no V cache * Fix double print --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: saood06 <saood05@gmail.com>	2025-05-30 11:08:17 +03:00
Kawrakow	1eac9e8487	NEON implementation for trellis quants (#471 ) * iq2_kt: NEON implementation * iq3_kt: NEON implementation * iq4_kt: not working NEON implementation * iq4_kt: NEON implementation Have to use f32 arithmetic else I get gibberish? Correspondigly ridiculously slow. * Cleanup * iq4_kt: slightly faster TG on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-29 18:57:41 +03:00
saood06	ccd6d9cdf6	set cache_prompt default to true (#465 )	2025-05-28 08:18:25 +03:00
Kawrakow	0976467845	CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4 (#462 ) * CUDA: iq4_ks_r4 GEMV and GEMM * CUDA: iq5_ks_r4 GEMV and GEMM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-27 08:37:44 +03:00
Kawrakow	1429291326	CUDA implementation for IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4 (#461 ) * CUDA: iq4_k_r4 dequantize * CUDA: iq4_k_r4 GEMV ~10% slower than iq4_k. * CUDA: slightly faster iq4_k_r4 GEMV * CUDA: slightly faster iq4_k_r4 GEMV We are now within 3% of iq4_k * CUDA: iq5_k_r4 dequantize * CUDA: iq5_k_r4 GEMV ~3% slower than iq5_k. * CUDA: iq3_k_r4 dequantize * CUDA: iq3_k_r4 GEMV * CUDA: slightly faster iq3_k_r4 GEMV * CUDA: iq2_k_r4 GEMV * CUDA: faster iq2_k_r4 GEMV --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-26 19:34:54 +03:00
Kawrakow	24c010b391	Add missing gguf-py constants (#458 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-25 09:55:36 +03:00
Nexes the Elder	c7ecd4e23a	Legacy quants conversion schemes in convert_hf_to_gguf.py (#449 ) * Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention	2025-05-24 11:49:10 +03:00
Kawrakow	a2c42f9985	Faster IQ3_KT and IQ4_KT (#453 ) * Somewhat faster iq3_kt (AVX2) * Cleanup * Slightly faster iq4_kt * Slightly faster iq4_kt PP is now almost 50% better than original, TG is ~20% better * Cleanup * Very slightly faster iq4_kt TG --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-24 11:48:52 +03:00
Kawrakow	9fb82af3a8	Fix bug in MMVQ kernel (#446 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-23 18:25:11 +03:00
Kawrakow	6b12c2e7e8	Fix MSVC compilation (#448 ) * Fix MSVC compilation * MSVC cannot capture constexpr in lambdas * Arghhh --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-23 16:46:27 +03:00
Kawrakow	7f2edd1a85	Fix typo in non-AVX2 code branch (#445 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-23 12:02:54 +03:00
Andrew Chan	a1c931c30c	Trellis quants with CPU inference (#441 ) * WIP * WIP * WIP * Testing Trellis quantization Using 12 bits per 8 weights I get a better rmse than iq2_xxs. I still need to see how quantizing the group-of-8 scales will affect accuracy. By AVX2 SIMDifying the search for the best code, LLaMA-3.1-8B gets quantized in 130 seconds on the Ryzen-7950X CPU - sluggish but still acceptable. * Testing Trellis quantization: 4-bit quantized block scales rmse increases by just 3%, so this is beating iq2_xss in terms of rmse at the same 2.0625 bpw. * Testing Trellis quantization: playing with scales and generators * iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B). * iq2_kt: CUDA dequantize so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs. * WIP * WIP * WIP - try larger blocks With blocks of 32 and 16 bits per groups of 8 the brute force seach becomes prohibitive in terms of CPU time (30+ minutes for 8B LLaMA after SIMDifying with AVX2). The trick is to group the points in clusters, find the nearest cluster, and only search within the cluster. * iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower. * iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time). * iq2_kt: CUDA dot product Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU. * iq2_kt: very slightly faster CUDA dot product * iq2_kt: f16 CUDA dot product We arrive at 112 t/s. * iq2_kt: faster f16 CUDA dot product We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance. * iq2_kt: faster f16 CUDA dot product We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16. * Minor * Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot. * Forgotten change * WIP * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants. * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892 * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v. * iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX. * iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds! * iq3_kt: CUDA dot product * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * Adding iq4_kt - not competitive at this point * WIP * WIP * iq4_kt: CUDA dot product * iq4_kt: minor tweaks * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297 PPL(LLaMA-2-7B, 4096) = 6.3913 Ah, quantization is faster too. About 20% faster. * iq3_kt: small improvements and faster quantization * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627 PPL(LLaMA-2-7B, 4096) = 6.3825 Quantization is faster too: ~200 seconds for LLaMA-3.1-8B on Ryzen-5975WX. * iq3_kt: small progress * WIP * iq4_kt: go to 4.0 bpw 15 bits per group of 4, plus 8 bit scales ifor blocks of 32. This gives a slightly better PPL than iq4_kss. * iq4_kt: very slightly better at the expense of much longer quantization time. * iq4_kt: failed attemt to adjust CUDA dot product It was working for 4.125 bpw. But after changing to 4.0 bpw there is something wrong and I don't see the bug. * DRY * DRY * iq4_kt: CUDA dot product works * DRY * Report actual bpw * Minor tweaks * Checkpoint Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude plus 1 bpw for the sign. It goves a visible improvement in the PPL vs bpw plot, but that comes at the expense of much longer quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX). I also notices that the 3INST generator is not actually generating a Gaussian distribution. But going to a better generator means readjusting all the hyper-parameters, so leaving it for later. * WIP for IQ2_KT * WIP - working basic iq2_kt * still super slow (0.17t/s eval) * flatten 3inst iters + avx2 (0.3t/s eval) * iq3_kt (0.3t/s eval) and renames * wip buggy iq4_KT * fix (0.22t/s eval) * naming and remove unused fn * cleanup * more cleanup * delete unused and noncompiling mmvq functions * Some performance tweaks * Slighty faster iq2_kt * port Trellis struct to iq3_kt, iq4_kt * oops untracked files --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-23 09:17:52 +03:00
Nexes the Elder	3efdd6df67	gguf-split : update (#444 ) gguf-split : improve --split and --merge logic (#9619) * make sure params --split and --merge are not specified at same time * update gguf-split params parse logic * Update examples/gguf-split/gguf-split.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> --------- gguf-split : add basic checks (#9499) * gguf-split : do not overwrite existing files when merging * gguf-split : error when too many arguments are passed Authored-by: slaren <slarengh@gmail.com>	2025-05-23 08:07:42 +03:00
Nexes the Elder	ec4563221e	Streamline a bit the quant strategies (#443 ) * Streamline a bit the quant strategies No change over the existing patterns, except for the bump for attn_k and attn_v for the models with 4 and 6 experts (several frankensteins seen on HF, and which also use GQA). The rest is applying the existing patterns to the new IQ_K quants. Also, a Q8_0 for attn_q slipped into the MOEs 8 experts rule, I removed it, because that tensor is much bigger than attn_k or attn_v. * remove <=8 experts condition.	2025-05-22 18:04:47 +03:00

1 2 3 4 5 ...

3754 Commits