llama.cpp

mirror of https://github.com/ggerganov/llama.cpp synced 2026-03-01 21:00:04 +01:00

History

Jayant Lohia ecbcb7ea9d CUDA: add CDNA3 MFMA support for flash attention MMA kernel (#19806 ) * CUDA: add CDNA3 MFMA support for flash attention MMA kernel Add MI300X (gfx942) MFMA tensor core flash attention using v_mfma_f32_16x16x16_f16 (FP16 in, FP32 accumulate). - Add FATTN_WARP_SIZE=64 for CDNA wavefront64 - Add CDNA config for head sizes 64, 80, 96, 112, 128 - Add FP16 MFMA intrinsic path in mma.cuh - Add manual V transpose load for MFMA register layout - Route CDNA to MMA for prompt processing, VEC for token generation - Fix Q loading and combine stride granularity for non-power-of-2 heads Benchmarks (Qwen2.5-1.5B Q4_K_M, MI300X): pp512 +7%, pp1024 +13%, pp2048 +23%, pp4096 +39% tg128 -10% (FA overhead, VEC used for both) All 2480 flash attention tests pass. Ref: https://github.com/ggml-org/llama.cpp/issues/17917 * address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch - Replace #define FATTN_WARP_SIZE with constexpr int warp_size = ggml_cuda_get_physical_warp_size() in each device function - Use ne[1]gqa_ratio threshold for MMA vs tile dispatch. Benchmarked crossover on MI300X @ d32768 with power-of-2 GQA models: hsk=64 (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%) hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%) Unified threshold: eff_nq >= 128 for all head sizes. - Remove VEC fallback; small batches fall through to tile kernel Update ggml/src/ggml-cuda/fattn.cu * use ggml_cuda_info().devices warp_size instead of hardcoded check --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>		2026-02-27 19:37:26 +01:00
..
cmake
include	ggml/gguf : prevent integer overflows (#19856 )	2026-02-24 20:17:11 +02:00
src	CUDA: add CDNA3 MFMA support for flash attention MMA kernel (#19806 )	2026-02-27 19:37:26 +01:00
.gitignore
CMakeLists.txt	ggml : bump version to 0.9.7 (ggml/1425)	2026-02-15 22:24:29 +02:00