mirror of
https://github.com/ggerganov/llama.cpp
synced 2026-04-19 05:36:29 +02:00
* cmake : allow libcommon to be shared * cmake : rename libcommon to libllama-common * cont : set -fPIC for httplib * cont : export all symbols * cont : fix build_info exports * libs : add libllama-common-base * log : add common_log_get_verbosity_thold() |
||
|---|---|---|
| .. | ||
| CMakeLists.txt | ||
| parallel.cpp | ||
| README.md | ||
llama.cpp/example/parallel
Simplified simulation of serving incoming requests in parallel
Example
Generate 128 client requests (-ns 128), simulating 8 concurrent clients (-np 8). The system prompt is shared (-pps), meaning that it is computed once at the start. The client requests consist of up to 10 junk questions (--junk 10) followed by the actual question.
llama-parallel -m model.gguf -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384
Note
It's recommended to use base models with this example. Instruction tuned models might not be able to properly follow the custom chat template specified here, so the results might not be as expected.