Commit Graph

13 Commits

Author SHA1 Message Date
Eve
a549c86de4 ci: run the x64 and arm ci on the github machines instead (llama/16183)
* run the x64 ci on regular machines

* set up the same thing for arm

fix test-quantize-perf just like #12306

* try to disable sve

* add another sve run
2025-09-25 11:56:34 +03:00
Georgi Gerganov
5cbbe51096 ggml : inttypes.h -> cinttypes (llama/0)
ggml-ci
2024-11-18 10:56:51 +02:00
Diego Devesa
813a12ce9e ggml : build backends as libraries (llama/10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-15 22:51:53 +02:00
Diego Devesa
9d0762e2af ggml : move CPU backend to a separate file (llama/10144) 2024-11-04 19:42:09 +02:00
Diego Devesa
d6a7a6856b
ggml : fix BLAS with unsupported types (llama/9775)
* ggml : do not use BLAS with types without to_float

* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies

* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits

it's not really internal if everybody uses it
2024-10-16 11:28:38 +03:00
Georgi Gerganov
9f59af6088 ggml : minor naming changes (llama/8433)
* ggml : minor naming changes

ggml-ci

* ggml : use PRId64 [no ci]

* ggml : revert FA K/Q names
2024-07-27 18:26:12 +03:00
snadampal
3655bf83a9
ggml : add mmla kernels for quantized GEMM (llama/4966)
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info
2024-02-12 09:25:26 +02:00
Kawrakow
f7b408495c
SOTA 3-bit quants (llama/5196)
* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 21:21:10 +02:00
Georgi Gerganov
845d01bab3
sync : llama.cpp (ggml_scale, ggml_row_size, ggml_mul_mat_set_prec) (#662)
* sync : llama.cpp (ggml_scale, ggml_row_size, ggml_mul_mat_set_prec)

ggml-ci

* ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203)

* llama : fix platforms without mmap (#4578)

* llama : fix platforms without mmap

* win32 : limit prefetch size to the file size

* fix win32 error clobber, unnecessary std::string in std::runtime_error

* ggml-alloc : fix ggml_tallocr_is_own

* whisper : minor

* ggml : cuda jetson + arm quants warnings

ggml-ci

---------

Co-authored-by: Herman Semenov <GermanAizek@yandex.ru>
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-22 17:53:50 +02:00
Georgi Gerganov
5c7bd24f84
sync : llama (mul_mat_id + get_rows kernels, typos) (#649)
* sync : llama (mul_mat_id + get_rows kernels, typos)

ggml-ci

* cuda : restore im2col

ggml-ci

* metal : fix accuracy of dequantization kernels

* cuda : restore correct im2col kernel

ggml-ci

* metal : fix moe test by reducing the expert size

ggml-ci

* cuda : fix bin bcast when src1 and dst have different types

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-13 21:53:20 +02:00
Georgi Gerganov
ef336850d5
sync : llama.cpp (training, refactoring) (#548)
* sync : llama.cpp (training, refactoring)

* examples : fix ggml_rope

* ggml : better optimizer cancel handling

ggml-ci

* ggml : fix UBs

ggml-ci

* ggml : add TODO for refactoring the opt cancellation
2023-10-04 15:53:05 +03:00
Georgi Gerganov
c06cb61f66
sync : whisper (POSIX) (#511)
* sync : whisper (POSIX)

ggml-ci

* sync : llama (HBM + Metal + style)

ggml-ci
2023-09-08 17:57:04 +03:00
Georgi Gerganov
d8fbf15c60
tests : sync from llama.cpp and disable some obsolete tests 2023-07-05 20:38:20 +03:00