git/ggml - ggml - Gitea: Git with a cup of tea

git/ggml

mirror of https://github.com/ggerganov/ggml synced 2026-03-06 07:02:51 +01:00

Author	SHA1	Message	Date
Eve	a549c86de4	ci: run the x64 and arm ci on the github machines instead (llama/16183) * run the x64 ci on regular machines * set up the same thing for arm fix test-quantize-perf just like #12306 * try to disable sve * add another sve run	2025-09-25 11:56:34 +03:00
Georgi Gerganov	5cbbe51096	ggml : inttypes.h -> cinttypes (llama/0) ggml-ci	2024-11-18 10:56:51 +02:00
Diego Devesa	813a12ce9e	ggml : build backends as libraries (llama/10256) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-15 22:51:53 +02:00
Diego Devesa	9d0762e2af	ggml : move CPU backend to a separate file (llama/10144)	2024-11-04 19:42:09 +02:00
Diego Devesa	d6a7a6856b	ggml : fix BLAS with unsupported types (llama/9775) * ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it	2024-10-16 11:28:38 +03:00
Georgi Gerganov	9f59af6088	ggml : minor naming changes (llama/8433) * ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names	2024-07-27 18:26:12 +03:00
snadampal	3655bf83a9	ggml : add mmla kernels for quantized GEMM (llama/4966) * ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info	2024-02-12 09:25:26 +02:00
Kawrakow	f7b408495c	SOTA 3-bit quants (llama/5196) * iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-30 21:21:10 +02:00
Georgi Gerganov	845d01bab3	sync : llama.cpp (ggml_scale, ggml_row_size, ggml_mul_mat_set_prec) (#662 ) * sync : llama.cpp (ggml_scale, ggml_row_size, ggml_mul_mat_set_prec) ggml-ci * ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203) * llama : fix platforms without mmap (#4578) * llama : fix platforms without mmap * win32 : limit prefetch size to the file size * fix win32 error clobber, unnecessary std::string in std::runtime_error * ggml-alloc : fix ggml_tallocr_is_own * whisper : minor * ggml : cuda jetson + arm quants warnings ggml-ci --------- Co-authored-by: Herman Semenov <GermanAizek@yandex.ru> Co-authored-by: slaren <slarengh@gmail.com>	2023-12-22 17:53:50 +02:00
Georgi Gerganov	5c7bd24f84	sync : llama (mul_mat_id + get_rows kernels, typos) (#649 ) * sync : llama (mul_mat_id + get_rows kernels, typos) ggml-ci * cuda : restore im2col ggml-ci * metal : fix accuracy of dequantization kernels * cuda : restore correct im2col kernel ggml-ci * metal : fix moe test by reducing the expert size ggml-ci * cuda : fix bin bcast when src1 and dst have different types --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-13 21:53:20 +02:00
Georgi Gerganov	ef336850d5	sync : llama.cpp (training, refactoring) (#548 ) * sync : llama.cpp (training, refactoring) * examples : fix ggml_rope * ggml : better optimizer cancel handling ggml-ci * ggml : fix UBs ggml-ci * ggml : add TODO for refactoring the opt cancellation	2023-10-04 15:53:05 +03:00
Georgi Gerganov	c06cb61f66	sync : whisper (POSIX) (#511 ) * sync : whisper (POSIX) ggml-ci * sync : llama (HBM + Metal + style) ggml-ci	2023-09-08 17:57:04 +03:00
Georgi Gerganov	d8fbf15c60	tests : sync from llama.cpp and disable some obsolete tests	2023-07-05 20:38:20 +03:00

13 Commits