ik_llama.cpp/tests
Kawrakow 3f7899c250 Faster Gemma2 (#27)
* soft_cap_max: initial CPU version of fused softcap + soft_max

With this vanilla CPU implementation I'm already getting a ~3% speedup
for Gemma-2-9b and a prompt of 8192 tokens.

* soft_cap_max: WIP - something is wrong with CUDA

* soft_cap_max: looks good on CPU and CUDA

* Add softcap to flash attention

Just CPU and CUDA for now (but, as we know, flash attention
on the CPU is useless in llama.cpp).

On CUDA this improves PP performance quite a bit, especially for
long contexts. E.g., for PP-16384, I now get 3777 t/s.
Without this change, one cannot use FA, and one gets 2300 t/s
(after fusing softcap and softmax), or 2000 t/s without the
fused softcap+softmax.

In comparison, mainline llama.cpp has PP-16384 = 1549 t/s before
PR-8542 (where Johannes Gaessler has also added softcap to FA),
and PP-16384 = 3097 t/s after this PR.

* soft_cap_max: Metal

* Flash attention with softcap: Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-27 17:40:59 +03:00
..
.gitignore tests : gitignore ggml-common.h 2024-03-09 14:17:11 +02:00
CMakeLists.txt Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
get-model.cpp ci : add model tests + script wrapper (#4586) 2024-01-26 14:18:00 +02:00
get-model.h ci : add model tests + script wrapper (#4586) 2024-01-26 14:18:00 +02:00
run-json-schema-to-grammar.mjs json-schema-to-grammar improvements (+ added to server) (#5978) 2024-03-21 11:50:43 +00:00
test-autorelease.cpp ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
test-backend-ops.cpp Faster Gemma2 (#27) 2024-08-27 17:40:59 +03:00
test-c.c Nomic Vulkan backend (#4456) 2024-01-29 15:50:50 -05:00
test-chat-template.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-double-float.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-grad0.cpp ggml : refactor rope norm/neox (#7634) 2024-06-05 11:29:20 +03:00
test-grammar-integration.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-grammar-parser.cpp grammars: x{min,max} repetition operator (#6640) 2024-06-06 10:07:06 +01:00
test-json-schema-to-grammar.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-llama-grammar.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-model-load-cancel.cpp ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
test-opt.cpp code : normalize enum names (#5697) 2024-02-25 12:09:09 +02:00
test-quantize-fns.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-quantize-perf.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-rope.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-sampling.cpp Merge mainline - Aug 12 2024 (#17) 2024-08-12 15:14:32 +02:00
test-tokenizer-0.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-tokenizer-0.py py : logging and flake8 suppression refactoring (#7081) 2024-05-05 08:07:48 +03:00
test-tokenizer-0.sh tests : fix test-tokenizer-0.sh 2024-05-28 15:04:09 +03:00
test-tokenizer-1-bpe.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-tokenizer-1-spm.cpp Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00
test-tokenizer-random.py Merge mainline llama.cpp (#3) 2024-07-27 07:55:01 +02:00