ik_llama.cpp

History

Kawrakow 0d7eb34185 Graph parallel: the next generation (#1080 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Implement the reduce op without NCCL available * Be able to build without NCCL cmake -DGGML_NCCL=OFF disables it * Make --max-gpu work again * Slightly better for 4 GPUs without NCCL * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>		2025-12-24 08:31:48 +01:00
..
arm64-windows-llvm.cmake	ggml : prevent builds with -ffinite-math-only (#7726 )	2024-06-04 17:01:09 +10:00
arm64-windows-msvc.cmake	Add support for properly optimized Windows ARM64 builds with LLVM and MSVC (#7191 )	2024-05-16 12:47:36 +10:00
build-info.cmake	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
FindNCCL.cmake	Graph parallel: the next generation (#1080 )	2025-12-24 08:31:48 +01:00
git-vars.cmake	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
llama-config.cmake.in	Merge mainline llama.cpp (#3 )	2024-07-27 07:55:01 +02:00
llama.pc.in	cmake : add pkg-config spec file for llama.cpp (#7702 )	2024-06-03 11:06:24 +03:00