Commit Graph

3754 Commits

Author SHA1 Message Date
Iwan Kawrakow
c8cf128099 Add missing break 2025-06-09 15:18:11 +03:00
Iwan Kawrakow
336bcd4ad7 New iq2_kt: Metal - very slow.
It seems Apple Silicon cannot quickly add 4 8-bit ints.
Or I don't know how to do it - but I didn't find anything
in the Metal Shading Language Specification.
So, performance is quite a bit worse than the original trellis.
2025-06-09 15:04:23 +03:00
Iwan Kawrakow
ab802a1881 New iq2_kt: slightly faster NEON GEMM 2025-06-09 12:58:27 +03:00
Iwan Kawrakow
b6069a4e12 New iq2_kt: NEON GEMM/GEMV 2025-06-09 12:20:42 +03:00
Iwan Kawrakow
6da24256b9 Adding forgotten file 2025-06-09 10:26:27 +03:00
Iwan Kawrakow
fcea0045dd New iq2_kt: AVX2 GEMM/GEMV 2025-06-09 10:26:04 +03:00
Iwan Kawrakow
dbbc7eaec8 New iq2_kt: AVX2 dequantize 2025-06-09 10:26:04 +03:00
Iwan Kawrakow
5ad250ce9d New iq2_kt: CUDA GEMV 2025-06-09 10:26:04 +03:00
Iwan Kawrakow
44015ad83e Switching iq2_kt to new trellis - CUDA MMQ 2025-06-09 10:26:04 +03:00
Iwan Kawrakow
f59fe11764 Adding forgottent file 2025-06-09 09:59:39 +03:00
Iwan Kawrakow
07efec0e4c Cleanup 2025-06-08 13:44:20 +03:00
Iwan Kawrakow
b30bfc1377 Remove the extra 4 bytes of row meta data that is no longer used 2025-06-08 13:36:47 +03:00
Iwan Kawrakow
314674df85 New iq4_kt trellis: not working Metal implementation 2025-06-08 11:47:55 +03:00
Iwan Kawrakow
1e6ff8a788 Minor 2025-06-08 10:29:40 +03:00
Iwan Kawrakow
d334cbf552 New iq4_kt: faster NEON
We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.
2025-06-08 10:05:49 +03:00
Iwan Kawrakow
b41f4718ad New iq4_kt: slightly faster NEON 2025-06-08 09:17:32 +03:00
Iwan Kawrakow
eed22154d1 New iq4_kt: slightly faster NEON 2025-06-08 08:43:02 +03:00
Iwan Kawrakow
8cc8b1e795 New iq4_kt: NEON implementation
We get very respectable PP-512 = 120 t/s.
TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.
2025-06-08 08:31:56 +03:00
Iwan Kawrakow
9d7bf1c821 New iq4_kt: fix vanilla AVX2 2025-06-07 19:33:33 +03:00
Iwan Kawrakow
65e654a69d New iq4_kt: AVX2 dot product finally works
We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic.
Still somewhat slower than other quants, but no longer pathetic.
2025-06-07 19:12:09 +03:00
Iwan Kawrakow
2b6acd5843 Fix iq2_kt that got broken along the way 2025-06-07 18:51:02 +03:00
Iwan Kawrakow
fb776ab7ba For now have only iq4_kt use the new trellis 2025-06-07 18:35:10 +03:00
Iwan Kawrakow
9b4103ed54 New iq4_kt: CUDA MMQ 2025-06-07 18:21:43 +03:00
Iwan Kawrakow
98f35bfaa2 New iq4_kt: CUDA MMVQ 2025-06-07 17:43:36 +03:00
Iwan Kawrakow
a296134434 Something is not working with the AVX2 dot product 2025-06-07 16:18:58 +03:00
Iwan Kawrakow
eb61f498d1 New iq4_kt trellis
The new trellis generates int8_t values via
sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126.
CUDA dequantize works.
AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B.
PPL is on par or even slightly lower than original QTIP trellis.
2025-06-07 12:30:37 +03:00
Kawrakow
8ffad187ab
MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4 (#493)
* MMQ for iq4_ks_r4

* MMQ for iq5_ks_r4

* Add forgotten file

* Another forgotten file

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 08:31:20 +03:00
Kawrakow
0b10f7418f
Faster CPU prompt processing for Trellis quants and MoE models (#488)
* Also do the dequantize approach for mul_mat_id

* Also do the dequantize approach for iqk_moe_fused_up_gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 08:30:35 +03:00
Kawrakow
7e79665a31
CUDA implementation for IQ1_S_R4 (#492)
* iq1_s_r4: CUDA dequantize

* iq1_s_r4: CUDA GEMV

* iq1_s_r4: MMQ on CUDA

Requires Turing or better (will fall back to dequantize+cuBLAS on older cards).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 07:24:31 +03:00
Kawrakow
f6d5fbdc57
Adding top-n-sigma sampler (#489)
* Adding top-n-sigma sampler

* Fix typos in XTC PR

* Update README.md for main and server

* More README

* More README

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03 17:35:09 +03:00
Kawrakow
ccb265c016
Adding the XTC sampler (#486)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03 11:32:03 +03:00
Nexes the Elder
4f8b05a0d7
convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483)
* Direct conversion from fp16 to Q6_0

* forgotten comma

* More precise infos
2025-06-03 09:30:30 +03:00
Kawrakow
7a8abe29f7
Minor (~2%) iq2_ks TG performance improvement on CUDA (#468)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-01 15:24:33 +03:00
Kawrakow
3df1a3a44d
Trellis quants: faster CPU prompt processing (#482)
* Experimenting with dequant + f32 GEMM

For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.

* Experimenting with dequant + f32 GEMM

iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s

* Experimenting with dequant + f16 GEMM on NEON

iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s

Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.

* Experimenting with dequant + f16 GEMM on NEON

iq4_kt: PP512 = 86 t/s from 29 t/s

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-01 15:24:05 +03:00
Kawrakow
35374bc7e8
Metal implementatio for the trellis quants. (#475)
* iq2_kt: Metal dequantize

* iq2_kt: Metal GEMV

Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B

* iq3_kt: Metal dequantize

* iq3_kt: Metal GEMV

Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B.
Flipping signs is a costly affair.

* iq4_kt: Metal dequantize - getting NaNs

* iq4_kt: Metal GEMV - also not working

* iq4_kt: Metal still not working

* Disable iq4_kt on Metal for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-01 15:23:44 +03:00
Nexes the Elder
7239ce6b35
forgotten refs and typo (#478) 2025-05-31 07:36:50 +03:00
Kawrakow
2cf12eb12d
Replace MLA-specific KV cache with the standard KV cache (#469)
* Remove kv_l, kvt_l and just use k_l and v_l

* Hopefully take care of missing V cache (MLA)

* Replace MLA-specific KV cache with the standard KV cache V2 (#473)

* Fix save and restore when there is no V cache

* Fix double print

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: saood06 <saood05@gmail.com>
2025-05-30 11:08:17 +03:00
Kawrakow
1eac9e8487
NEON implementation for trellis quants (#471)
* iq2_kt: NEON implementation

* iq3_kt: NEON implementation

* iq4_kt: not working NEON implementation

* iq4_kt: NEON implementation

Have to use f32 arithmetic else I get gibberish?
Correspondigly ridiculously slow.

* Cleanup

* iq4_kt: slightly faster TG on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-29 18:57:41 +03:00
saood06
ccd6d9cdf6
set cache_prompt default to true (#465) 2025-05-28 08:18:25 +03:00
Kawrakow
0976467845
CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4 (#462)
* CUDA: iq4_ks_r4 GEMV and GEMM

* CUDA: iq5_ks_r4 GEMV and GEMM

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-27 08:37:44 +03:00
Kawrakow
1429291326
CUDA implementation for IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4 (#461)
* CUDA: iq4_k_r4 dequantize

* CUDA: iq4_k_r4 GEMV

~10% slower than iq4_k.

* CUDA: slightly faster iq4_k_r4 GEMV

* CUDA: slightly faster iq4_k_r4 GEMV

We are now within 3% of iq4_k

* CUDA: iq5_k_r4 dequantize

* CUDA: iq5_k_r4 GEMV

~3% slower than iq5_k.

* CUDA: iq3_k_r4 dequantize

* CUDA: iq3_k_r4 GEMV

* CUDA: slightly faster iq3_k_r4 GEMV

* CUDA: iq2_k_r4 GEMV

* CUDA: faster iq2_k_r4 GEMV

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-26 19:34:54 +03:00
Kawrakow
24c010b391
Add missing gguf-py constants (#458)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-25 09:55:36 +03:00
Nexes the Elder
c7ecd4e23a
Legacy quants conversion schemes in convert_hf_to_gguf.py (#449)
* Legacy quants conversion schemes in convert_hf_to_gguf.py

This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

* forgotten IQ5_KS case mention
2025-05-24 11:49:10 +03:00
Kawrakow
a2c42f9985
Faster IQ3_KT and IQ4_KT (#453)
* Somewhat faster iq3_kt (AVX2)

* Cleanup

* Slightly faster iq4_kt

* Slightly faster iq4_kt

PP is now almost 50% better than original, TG is ~20% better

* Cleanup

* Very slightly faster iq4_kt TG

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-24 11:48:52 +03:00
Kawrakow
9fb82af3a8
Fix bug in MMVQ kernel (#446)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 18:25:11 +03:00
Kawrakow
6b12c2e7e8
Fix MSVC compilation (#448)
* Fix MSVC compilation

* MSVC cannot capture constexpr in lambdas

* Arghhh

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 16:46:27 +03:00
Kawrakow
7f2edd1a85
Fix typo in non-AVX2 code branch (#445)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 12:02:54 +03:00
Andrew Chan
a1c931c30c
Trellis quants with CPU inference (#441)
* WIP

* WIP

* WIP

* Testing Trellis quantization

Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.

* Testing Trellis quantization: 4-bit quantized block scales

rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.

* Testing Trellis quantization: playing with scales and generators

* iq2_kt: quantize / dequantize

I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).

* iq2_kt: CUDA dequantize

so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.

* WIP

* WIP

* WIP - try larger blocks

With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.

* iq2_kt - this is better

Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.

* iq2_kt - even better

Re-quantize after determining block scales
(at the epxense of much longer quantization time).

* iq2_kt: CUDA dot product

Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.

* iq2_kt: very slightly faster CUDA dot product

* iq2_kt: f16 CUDA dot product

We arrive at 112 t/s.

* iq2_kt: faster f16 CUDA dot product

We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.

* iq2_kt: faster f16 CUDA dot product

We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.

* Minor

* Adding iq3_kt

3.125 bpw. So far does not look good on the PPL vs bpw plot.

* Forgotten change

* WIP

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.

* iq3_kt WIP: speed up quantization

Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.

* iq3_kt speed up quantization

Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!

* iq3_kt: CUDA dot product

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* Adding iq4_kt - not competitive at this point

* WIP

* WIP

* iq4_kt: CUDA dot product

* iq4_kt: minor tweaks

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.

* iq3_kt: small improvements and faster quantization

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.

* iq3_kt: small progress

* WIP

* iq4_kt: go to 4.0 bpw

15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.

* iq4_kt: very slightly better

at the expense of much longer quantization time.

* iq4_kt: failed attemt to adjust CUDA dot product

It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.

* DRY

* DRY

* iq4_kt: CUDA dot product works

* DRY

* Report actual bpw

* Minor tweaks

* Checkpoint

Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).

I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.

* WIP for IQ2_KT

* WIP - working basic iq2_kt

* still super slow (0.17t/s eval)

* flatten 3inst iters + avx2 (0.3t/s eval)

* iq3_kt (0.3t/s eval) and renames

* wip buggy iq4_KT

* fix (0.22t/s eval)

* naming and remove unused fn

* cleanup

* more cleanup

* delete unused and noncompiling mmvq functions

* Some performance tweaks

* Slighty faster iq2_kt

* port Trellis struct to iq3_kt, iq4_kt

* oops untracked files

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 09:17:52 +03:00
Nexes the Elder
3efdd6df67
gguf-split : update (#444)
gguf-split : improve --split and --merge logic (#9619)

* make sure params --split and --merge are not specified at same time

* update gguf-split params parse logic

* Update examples/gguf-split/gguf-split.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>

---------

gguf-split : add basic checks (#9499)

* gguf-split : do not overwrite existing files when merging

* gguf-split : error when too many arguments are passed

Authored-by: slaren <slarengh@gmail.com>
2025-05-23 08:07:42 +03:00
Nexes the Elder
ec4563221e
Streamline a bit the quant strategies (#443)
* Streamline a bit the quant strategies

No change over the existing patterns, except for the bump for attn_k and attn_v for the models with 4 and 6 experts (several frankensteins seen on HF, and which also use GQA).
The rest is applying the existing patterns to the new IQ_K quants.
Also, a Q8_0 for attn_q slipped into the MOEs 8 experts rule, I removed it, because that tensor is much bigger than attn_k or attn_v.

* remove <=8 experts condition.
2025-05-22 18:04:47 +03:00