ik_llama.cpp

History

Kawrakow 3f7899c250 Faster Gemma2 (#27 ) * soft_cap_max: initial CPU version of fused softcap + soft_max With this vanilla CPU implementation I'm already getting a ~3% speedup for Gemma-2-9b and a prompt of 8192 tokens. * soft_cap_max: WIP - something is wrong with CUDA * soft_cap_max: looks good on CPU and CUDA * Add softcap to flash attention Just CPU and CUDA for now (but, as we know, flash attention on the CPU is useless in llama.cpp). On CUDA this improves PP performance quite a bit, especially for long contexts. E.g., for PP-16384, I now get 3777 t/s. Without this change, one cannot use FA, and one gets 2300 t/s (after fusing softcap and softmax), or 2000 t/s without the fused softcap+softmax. In comparison, mainline llama.cpp has PP-16384 = 1549 t/s before PR-8542 (where Johannes Gaessler has also added softcap to FA), and PP-16384 = 3097 t/s after this PR. * soft_cap_max: Metal * Flash attention with softcap: Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>		2024-08-27 17:40:59 +03:00
..
.gitignore	tests : gitignore ggml-common.h	2024-03-09 14:17:11 +02:00
CMakeLists.txt	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
get-model.cpp	ci : add model tests + script wrapper (#4586 )	2024-01-26 14:18:00 +02:00
get-model.h	ci : add model tests + script wrapper (#4586 )	2024-01-26 14:18:00 +02:00
run-json-schema-to-grammar.mjs	json-schema-to-grammar improvements (+ added to server) (#5978 )	2024-03-21 11:50:43 +00:00
test-autorelease.cpp	ggml : add numa options (#5377 )	2024-02-16 11:31:07 +02:00
test-backend-ops.cpp	Faster Gemma2 (#27 )	2024-08-27 17:40:59 +03:00
test-c.c	Nomic Vulkan backend (#4456 )	2024-01-29 15:50:50 -05:00
test-chat-template.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-double-float.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-grad0.cpp	ggml : refactor rope norm/neox (#7634 )	2024-06-05 11:29:20 +03:00
test-grammar-integration.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-grammar-parser.cpp	grammars: x{min,max} repetition operator (#6640 )	2024-06-06 10:07:06 +01:00
test-json-schema-to-grammar.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-llama-grammar.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-model-load-cancel.cpp	ggml : add numa options (#5377 )	2024-02-16 11:31:07 +02:00
test-opt.cpp	code : normalize enum names (#5697 )	2024-02-25 12:09:09 +02:00
test-quantize-fns.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-quantize-perf.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-rope.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-sampling.cpp	Merge mainline - Aug 12 2024 (#17 )	2024-08-12 15:14:32 +02:00
test-tokenizer-0.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-tokenizer-0.py	py : logging and flake8 suppression refactoring (#7081 )	2024-05-05 08:07:48 +03:00
test-tokenizer-0.sh	tests : fix test-tokenizer-0.sh	2024-05-28 15:04:09 +03:00
test-tokenizer-1-bpe.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-tokenizer-1-spm.cpp	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
test-tokenizer-random.py	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00