ik_llama.cpp/github-data/issues/380 - Drop at the start of generation.md

197 KiB
Raw Permalink Blame History

📝 #380 - Drop at the start of generation

Author intulint
State Closed
Created 2025-05-04
Updated 2025-05-25

Description

After the generation starts, the server crashes. This only happens on the Qwen3-30B-A3B, and I checked different quant. Regular dense models work, including other dense qwen3. What could be the problem? I liked the acceleration in dense models, I thought moe would fly. But it doesn't work. It crashes without an error, it just goes to the command line when generation starts.

win10, Microsoft Visual Studio\2022, main branch

cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF cmake --build ./build --config Release -j 16

./llama-server.exe -t 7 -c 4096 -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf


💬 Conversation

👤 ikawrakow commented the 2025-05-05 at 05:12:28:

Can you post the output of the above commands (including the cmake commands)? Thanks.


👤 intulint commented the 2025-05-05 at 10:10:19:

Sure, but it turned out to be a lot of text. I also noticed that it takes a long time to assemble in a single thread of unicode.cpp unicode-data.cpp. I don't know if this is normal or not. From a third-party frontend, generation does not occur at all and the program exits. If you connect from the native server, then about 140 tokens are generated and again it crashes without messages.


** Visual Studio 2022 Developer Command Prompt v17.13.6 ** Copyright (c) 2022 Microsoft Corporation


C:\Program Files\Microsoft Visual Studio\2022\Community>cd C:\neuro\ik_llama.cpp

C:\neuro\ik_llama.cpp>git pull Already up to date.

C:\neuro\ik_llama.cpp>cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF -- Building for: Visual Studio 17 2022 -- Selecting Windows SDK version 10.0.20348.0 to target Windows 10.0.19045. -- The C compiler identification is MSVC 19.43.34810.0 -- The CXX compiler identification is MSVC 19.43.34810.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.43.34808/bin/Hostx64/x64/cl.exe - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.43.34808/bin/Hostx64/x64/cl.exe - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.47.1.windows.2") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - not found -- Found Threads: TRUE -- Found OpenMP_C: -openmp (found version "2.0") -- Found OpenMP_CXX: -openmp (found version "2.0") -- Found OpenMP: TRUE (found version "2.0") -- OpenMP found -- Using optimized iqk matrix multiplications -- Using llamafile -- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF. -- CMAKE_SYSTEM_PROCESSOR: AMD64 -- CMAKE_GENERATOR_PLATFORM: -- x86 detected -- Performing Test HAS_AVX_1 -- Performing Test HAS_AVX_1 - Success -- Performing Test HAS_AVX2_1 -- Performing Test HAS_AVX2_1 - Success -- Performing Test HAS_FMA_1 -- Performing Test HAS_FMA_1 - Success -- Performing Test HAS_AVX512_1 -- Performing Test HAS_AVX512_1 - Failed -- Performing Test HAS_AVX512_2 -- Performing Test HAS_AVX512_2 - Failed -- Configuring done (24.9s) -- Generating done (1.9s) -- Build files have been written to: C:/neuro/ik_llama.cpp/build

C:\neuro\ik_llama.cpp>cmake --build ./build --config Release -j 16 Версия MSBuild 17.13.19+0d9f5a35a для .NET Framework

1>Checking Build System Building Custom Rule C:/neuro/ik_llama.cpp/examples/gguf-hash/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/gguf-hash/CMakeLists.txt Generating build details from Git Building Custom Rule C:/neuro/ik_llama.cpp/ggml/src/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/gguf-hash/CMakeLists.txt -- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.47.1.windows.2") sha1.c xxhash.c sha256.c ggml.c Building Custom Rule C:/neuro/ik_llama.cpp/common/CMakeLists.txt build-info.cpp ggml-alloc.c sha1.vcxproj -> C:\neuro\ik_llama.cpp\build\examples\gguf-hash\sha1.dir\Release\sha1.lib build_info.vcxproj -> C:\neuro\ik_llama.cpp\build\common\build_info.dir\Release\build_info.lib sha256.vcxproj -> C:\neuro\ik_llama.cpp\build\examples\gguf-hash\sha256.dir\Release\sha256.lib ggml-backend.c xxhash.vcxproj -> C:\neuro\ik_llama.cpp\build\examples\gguf-hash\xxhash.dir\Release\xxhash.lib ggml-quants.c C:\Program Files (x86)\Windows Kits\10\Include\10.0.20348.0\ucrt\assert.h(21,9): warning C4005: 'static_assert': mac ro redefinition [C:\neuro\ik_llama.cpp\build\ggml\src\ggml.vcxproj] (compiling source file '../../../ggml/src/ggml-quants.c') C:\neuro\ik_llama.cpp\ggml\src\ggml-common.h(69,9): see previous definition of 'static_assert'

ggml-aarch64.c C:\Program Files (x86)\Windows Kits\10\Include\10.0.20348.0\ucrt\assert.h(21,9): warning C4005: 'static_assert': mac ro redefinition [C:\neuro\ik_llama.cpp\build\ggml\src\ggml.vcxproj] (compiling source file '../../../ggml/src/ggml-aarch64.c') C:\neuro\ik_llama.cpp\ggml\src\ggml-common.h(69,9): see previous definition of 'static_assert'

Generating Code... sgemm.cpp iqk_mul_mat.cpp C:\neuro\ik_llama.cpp\ggml\src\iqk\iqk_mul_mat.cpp(177,16): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\ggml\src\ggml.vcxproj] C:\neuro\ik_llama.cpp\ggml\src\iqk\iqk_mul_mat.cpp(260,16): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\ggml\src\ggml.vcxproj] C:\neuro\ik_llama.cpp\ggml\src\iqk\iqk_mul_mat.cpp(9584,9): warning C4065: switch statement contains 'default' but n o 'case' labels [C:\neuro\ik_llama.cpp\build\ggml\src\ggml.vcxproj] C:\neuro\ik_llama.cpp\ggml\src\iqk\iqk_mul_mat.cpp(3049,84): warning C4244: 'argument': conversion from 'const uint1 6_t' to 'char', possible loss of data [C:\neuro\ik_llama.cpp\build\ggml\src\ggml.vcxproj] C:\neuro\ik_llama.cpp\ggml\src\iqk\iqk_mul_mat.cpp(3049,84): the template instantiation context (the oldest one first) is C:\neuro\ik_llama.cpp\ggml\src\iqk\iqk_mul_mat.cpp(9649,21): see reference to function template instantiation 'void anonymous-namespace'::MulMat::set_functions<anony mous-namespace'::DequantizerIQ2KS>(anonymous-namespace'::MulMat &)' being compiled C:\neuro\ik_llama.cpp\ggml\src\iqk\iqk_mul_mat.cpp(9511,30): see reference to function template instantiation 'void anonymous-namespace'::mul_mat_qX_K_q8_K_T<Dequanti zer,1>(int,const void *,size_t,const anonymous-namespace'::DataInfo &,int)' being compiled with [ Dequantizer=anonymous-namespace'::DequantizerIQ2KS ] C:\neuro\ik_llama.cpp\ggml\src\iqk\iqk_mul_mat.cpp(3240,35): see reference to function template instantiation '__m256i anonymous-namespace'::DequantizerIQ2KS::new_blo ck<anonymous-namespace'::Q8<1,block_q8_K>>(int,const Q8 &,__m256 *)' being compiled with [ Q8=`anonymous-namespace'::Q8<1,block_q8_K> ]

iqk_flash_attn.cpp C:\neuro\ik_llama.cpp\ggml\src\iqk\iqk_flash_attn.cpp(88,24): warning C4244: '=': conversion from 'uint64_t' to 'int ', possible loss of data [C:\neuro\ik_llama.cpp\build\ggml\src\ggml.vcxproj] iqk_quantize.cpp Generating Code... Auto build dll exports Creating library C:/neuro/ik_llama.cpp/build/ggml/src/Release/ggml.lib and object C:/neuro/ik_llama.cpp/build/g gml/src/Release/ggml.exp ggml.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\ggml.dll Building Custom Rule C:/neuro/ik_llama.cpp/src/CMakeLists.txt llama.cpp C:\neuro\ik_llama.cpp\src\llama.cpp(2635,40): warning C4305: 'initializing': truncation from 'double' to 'float' [C: \neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama.cpp(5511,17): warning C4065: switch statement contains 'default' but no 'case' label s [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama.cpp(5520,17): warning C4065: switch statement contains 'default' but no 'case' label s [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): warning C4477: 'printf' : format string '%ld' requires an argument of type 'long', but variadic argument 2 has type 'int64_t' [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): consider using '%lld' in the format string C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): consider using '%Id' in the format string C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): consider using '%I64d' in the format string

C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): warning C4477: 'printf' : format string '%ld' requires an argument of type 'long', but variadic argument 3 has type 'int64_t' [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): consider using '%lld' in the format string C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): consider using '%Id' in the format string C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): consider using '%I64d' in the format string

C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): warning C4477: 'printf' : format string '%ld' requires an argument of type 'long', but variadic argument 4 has type 'int64_t' [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): consider using '%lld' in the format string C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): consider using '%Id' in the format string C:\neuro\ik_llama.cpp\src\llama.cpp(8970,24): consider using '%I64d' in the format string

llama-vocab.cpp C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(138,26): warning C4244: 'return': conversion from 'long' to 'uint8_t', pos sible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(211,35): warning C4267: 'argument': conversion from 'size_t' to 'int', pos sible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(211,30): warning C4267: 'argument': conversion from 'size_t' to 'int', pos sible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(543,39): warning C4267: 'argument': conversion from 'size_t' to 'int', pos sible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(543,34): warning C4267: 'argument': conversion from 'size_t' to 'int', pos sible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(583,82): warning C4267: '=': conversion from 'size_t' to 'llm_symbol::inde x', possible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(586,61): warning C4267: '=': conversion from 'size_t' to 'int', possible l oss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(680,37): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(680,25): warning C4267: 'initializing': conversion from 'size_t' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-vocab.cpp(1543,20): warning C4267: 'return': conversion from 'size_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] llama-grammar.cpp llama-sampling.cpp C:\neuro\ik_llama.cpp\src\llama-sampling.cpp(26,20): warning C4244: '=': conversion from 'time_t' to 'uint32_t', pos sible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-sampling.cpp(70,23): warning C4267: '=': conversion from 'size_t' to 'int32_t', poss ible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-sampling.cpp(405,33): warning C4244: '=': conversion from 'double' to 'float', possi ble loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-sampling.cpp(409,34): warning C4244: '/=': conversion from 'double' to 'float', poss ible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-sampling.cpp(510,34): warning C4244: 'initializing': conversion from 'float' to 'int 32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-sampling.cpp(510,27): warning C4244: 'initializing': conversion from 'float' to 'con st int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] C:\neuro\ik_llama.cpp\src\llama-sampling.cpp(530,61): warning C4244: 'argument': conversion from 'const int32_t' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\src\llama.vcxproj] unicode.cpp unicode-data.cpp Generating Code... Auto build dll exports Creating library C:/neuro/ik_llama.cpp/build/src/Release/llama.lib and object C:/neuro/ik_llama.cpp/build/src/R elease/llama.exp llama.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama.dll Building Custom Rule C:/neuro/ik_llama.cpp/examples/llava/CMakeLists.txt llava.cpp C:\neuro\ik_llama.cpp\examples\llava\llava.cpp(346,24): warning C4244: 'initializing': conversion from 'double' to ' float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] clip.cpp C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(590,32): warning C4267: 'initializing': conversion from 'size_t' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(590,26): warning C4267: 'initializing': conversion from 'size_t' to 'c onst int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(824,149): warning C4244: 'argument': conversion from 'int64_t' to 'int ', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(824,130): warning C4244: 'argument': conversion from 'int64_t' to 'int ', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(824,111): warning C4244: 'argument': conversion from 'int64_t' to 'int ', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(824,92): warning C4244: 'argument': conversion from 'int64_t' to 'int' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(838,23): warning C4244: 'initializing': conversion from 'int64_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(838,43): warning C4244: 'initializing': conversion from 'int64_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(872,149): warning C4244: 'argument': conversion from 'int64_t' to 'int ', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(872,130): warning C4244: 'argument': conversion from 'int64_t' to 'int ', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(872,111): warning C4244: 'argument': conversion from 'int64_t' to 'int ', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(872,92): warning C4244: 'argument': conversion from 'int64_t' to 'int' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(887,23): warning C4244: 'initializing': conversion from 'int64_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(887,43): warning C4244: 'initializing': conversion from 'int64_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1218,27): warning C4267: 'initializing': conversion from 'size_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1010,9): warning C4297: 'clip_model_load': function assumed not to thr ow an exception but does [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1010,9): __declspec(nothrow), throw(), noexcept(true), or noexcept was specified on the function

C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1429,13): warning C4297: 'clip_model_load': function assumed not to th row an exception but does [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1429,13): __declspec(nothrow), throw(), noexcept(true), or noexcept was specified on the function

C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1529,48): warning C4267: 'argument': conversion from 'size_t' to 'int' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,58): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,46): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,88): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,77): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,98): warning C4244: 'argument': conversion from 'float' to 'const unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,137): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,125): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,163): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,154): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,173): warning C4244: 'argument': conversion from 'float' to 'cons t unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1627,103): warning C4244: '=': conversion from 'int' to 'float', possi ble loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,58): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,46): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,88): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,77): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,98): warning C4244: 'argument': conversion from 'float' to 'const unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,137): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,125): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,163): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,154): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,173): warning C4244: 'argument': conversion from 'float' to 'cons t unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1628,103): warning C4244: '=': conversion from 'int' to 'float', possi ble loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,58): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,46): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,88): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,77): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,98): warning C4244: 'argument': conversion from 'float' to 'const unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,137): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,125): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,163): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,154): warning C4244: 'argument': conversion from 'int' to 'float' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,173): warning C4244: 'argument': conversion from 'float' to 'cons t unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1629,103): warning C4244: '=': conversion from 'int' to 'float', possi ble loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1630,58): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1630,46): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1630,84): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1630,75): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1630,94): warning C4244: 'argument': conversion from 'float' to 'const unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1632,45): warning C4244: '=': conversion from 'double' to 'float', pos sible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1633,40): warning C4244: '=': conversion from 'double' to 'float', pos sible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1634,60): warning C4244: '=': conversion from 'double' to 'float', pos sible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1642,45): warning C4244: '=': conversion from 'double' to 'float', pos sible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1643,40): warning C4244: '=': conversion from 'double' to 'float', pos sible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1644,60): warning C4244: '=': conversion from 'double' to 'float', pos sible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,49): warning C4244: 'initializing': conversion from 'const _Ty' t o 'uint8_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,49): warning C4244: with [C:\neuro\ik_llama.cpp\build\exa mples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,49): warning C4244: [ [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,49): warning C4244: _Ty=float [C:\neuro\ik_llama.cpp
build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,49): warning C4244: ] [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,39): warning C4244: 'initializing': conversion from 'const _Ty' t o 'const uint8_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,39): warning C4244: with [C:\neuro\ik_llama.cpp\build\exa mples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,39): warning C4244: [ [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,39): warning C4244: _Ty=float [C:\neuro\ik_llama.cpp
build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1647,39): warning C4244: ] [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1648,68): warning C4244: '=': conversion from 'float' to '_Ty', possib le loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1648,68): warning C4244: with [C:\neuro\ik_llama.cpp\build\exa mples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1648,68): warning C4244: [ [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1648,68): warning C4244: _Ty=uint8_t [C:\neuro\ik_llama.cp p\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1648,68): warning C4244: ] [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1821,21): warning C4244: 'initializing': conversion from 'double' to ' float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1838,32): warning C4244: 'initializing': conversion from 'double' to ' float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1838,27): warning C4244: 'initializing': conversion from 'double' to ' const float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1839,63): warning C4244: 'initializing': conversion from 'double' to ' float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1839,23): warning C4244: 'initializing': conversion from 'double' to ' const float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1840,30): warning C4244: 'initializing': conversion from 'double' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1840,24): warning C4244: 'initializing': conversion from 'double' to ' const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1903,32): warning C4244: 'initializing': conversion from 'double' to ' float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1903,27): warning C4244: 'initializing': conversion from 'double' to ' const float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1904,63): warning C4244: 'initializing': conversion from 'double' to ' float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1904,23): warning C4244: 'initializing': conversion from 'double' to ' const float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1905,30): warning C4244: 'initializing': conversion from 'double' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(1905,24): warning C4244: 'initializing': conversion from 'double' to ' const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,44): warning C4244: 'initializing': conversion from 'const _Ty' t o 'uint8_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,44): warning C4244: with [C:\neuro\ik_llama.cpp\build\exa mples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,44): warning C4244: [ [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,44): warning C4244: _Ty=float [C:\neuro\ik_llama.cpp
build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,44): warning C4244: ] [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,34): warning C4244: 'initializing': conversion from 'const _Ty' t o 'const uint8_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,34): warning C4244: with [C:\neuro\ik_llama.cpp\build\exa mples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,34): warning C4244: [ [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,34): warning C4244: _Ty=float [C:\neuro\ik_llama.cpp
build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2077,34): warning C4244: ] [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2157,11): warning C4267: 'initializing': conversion from 'size_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2158,11): warning C4267: 'initializing': conversion from 'size_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2162,24): warning C4244: '=': conversion from 'double' to '_Ty', possi ble loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2162,24): warning C4244: with [C:\neuro\ik_llama.cpp\build\exa mples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2162,24): warning C4244: [ [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2162,24): warning C4244: _Ty=float [C:\neuro\ik_llama.cpp
build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2162,24): warning C4244: ] [C:\neuro\ik_llama.cpp\build\exampl es\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2184,11): warning C4267: 'initializing': conversion from 'size_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2185,11): warning C4267: 'initializing': conversion from 'size_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2259,20): warning C4267: 'initializing': conversion from 'size_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2320,47): warning C4244: '=': conversion from 'double' to 'int', possi ble loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2539,68): warning C4244: 'return': conversion from 'int64_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2542,56): warning C4244: 'return': conversion from 'int64_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2545,46): warning C4244: 'return': conversion from 'int64_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2548,46): warning C4244: 'return': conversion from 'int64_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2555,5): warning C4297: 'clip_n_mmproj_embd': function assumed not to throw an exception but does [C:\neuro\ik_llama.cpp\build\examples\llava\llava.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\clip.cpp(2555,5): __declspec(nothrow), throw(), noexcept(true), or noexcept was specified on the function

Generating Code... llava.vcxproj -> C:\neuro\ik_llama.cpp\build\examples\llava\llava.dir\Release\llava.lib Building Custom Rule C:/neuro/ik_llama.cpp/common/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/benchmark/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/quantize-stats/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/llava/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/llava/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/gguf/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt common.cpp benchmark-matmult.cpp gguf.cpp quantize-stats.cpp Creating library C:/neuro/ik_llama.cpp/build/examples/llava/Release/llava_shared.lib and object C:/neuro/ik_lla ma.cpp/build/examples/llava/Release/llava_shared.exp llava_static.vcxproj -> C:\neuro\ik_llama.cpp\build\examples\llava\Release\llava_static.lib test-c.c Building Custom Rule C:/neuro/ik_llama.cpp/examples/gguf-hash/CMakeLists.txt llava_shared.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llava_shared.dll C:\neuro\ik_llama.cpp\examples\gguf\gguf.cpp(69,31): warning C4244: '=': conversion from 'int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gguf\llama-gguf.vcxproj] test-c.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-c.exe gguf-hash.cpp llama-gguf.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-gguf.exe llama-bench-matmult.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-bench-matmult.exe C:\neuro\ik_llama.cpp\common\common.cpp(328,30): warning C4996: 'strdup': The POSIX name for this item is deprecated . Instead, use the ISO C and C++ conformant name: _strdup. See online help for details. [C:\neuro\ik_llama.cpp\build \common\common.vcxproj] C:\neuro\ik_llama.cpp\examples\gguf-hash\gguf-hash.cpp(383,55): warning C4267: 'argument': conversion from 'size_t' to 'uint32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gguf-hash\llama-gguf-hash.vcxproj] C:\neuro\ik_llama.cpp\examples\gguf-hash\gguf-hash.cpp(412,80): warning C4267: 'argument': conversion from 'size_t' to 'uint32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gguf-hash\llama-gguf-hash.vcxproj] C:\neuro\ik_llama.cpp\examples\gguf-hash\gguf-hash.cpp(453,78): warning C4267: 'argument': conversion from 'size_t' to 'uint32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gguf-hash\llama-gguf-hash.vcxproj] llama-gguf-hash.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-gguf-hash.exe llama-quantize-stats.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-quantize-stats.exe sampling.cpp C:\neuro\ik_llama.cpp\common\sampling.cpp(105,45): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\sampling.cpp(105,20): warning C4267: 'initializing': conversion from 'size_t' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] console.cpp C:\neuro\ik_llama.cpp\common\console.cpp(253,30): warning C4267: 'initializing': conversion from 'size_t' to 'DWORD' , possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\console.cpp(407,28): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] grammar-parser.cpp json-schema-to-grammar.cpp C:\neuro\ik_llama.cpp\common\json-schema-to-grammar.cpp(139,46): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\json-schema-to-grammar.cpp(139,37): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\json-schema-to-grammar.cpp(154,50): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\json-schema-to-grammar.cpp(154,41): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\json-schema-to-grammar.cpp(234,29): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\json-schema-to-grammar.cpp(245,33): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\json-schema-to-grammar.cpp(558,60): warning C4101: 'e': unreferenced local variable [C: \neuro\ik_llama.cpp\build\common\common.vcxproj] train.cpp ngram-cache.cpp C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(20,50): warning C4244: 'argument': conversion from 'int64_t' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(100,16): warning C4267: 'initializing': conversion from 'size_t' to 'in t', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(147,34): warning C4267: 'initializing': conversion from 'size_t' to 'in t', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(147,24): warning C4267: 'initializing': conversion from 'size_t' to 'co nst int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(156,82): warning C4267: 'initializing': conversion from 'size_t' to 'in t', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(156,38): warning C4267: 'initializing': conversion from 'size_t' to 'co nst int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(170,77): warning C4267: 'initializing': conversion from 'size_t' to 'in t', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(170,38): warning C4267: 'initializing': conversion from 'size_t' to 'co nst int', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(202,50): warning C4267: 'initializing': conversion from 'size_t' to 'in t32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] C:\neuro\ik_llama.cpp\common\ngram-cache.cpp(202,31): warning C4267: 'initializing': conversion from 'size_t' to 'co nst int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\common\common.vcxproj] Generating Code... common.vcxproj -> C:\neuro\ik_llama.cpp\build\common\Release\common.lib Building Custom Rule C:/neuro/ik_llama.cpp/examples/llava/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/lookup/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/gguf-split/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/sweep-bench/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/tokenize/CMakeLists.txt lookup-merge.cpp llava-cli.cpp test-sampling.cpp test-json-schema-to-grammar.cpp test-quantize-fns.cpp test-quantize-perf.cpp Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(157,34): warning C4244: 'argument': conversion from 'llama_token' to ' float', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(164,45): warning C4267: 'initializing': conversion from 'size_t' to 'l lama_token', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(164,36): warning C4267: 'initializing': conversion from 'size_t' to 'c onst llama_token', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(179,38): warning C4267: 'initializing': conversion from 'size_t' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(179,24): warning C4267: 'initializing': conversion from 'size_t' to 'c onst int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(189,67): warning C4267: 'initializing': conversion from 'size_t' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(189,39): warning C4267: 'initializing': conversion from 'size_t' to 'c onst int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(190,55): warning C4244: 'initializing': conversion from 'float' to 'in t', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(190,48): warning C4244: 'initializing': conversion from 'float' to 'co nst int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(192,33): warning C4267: '=': conversion from 'size_t' to 'llama_token' , possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(212,31): warning C4244: 'initializing': conversion from 'float' to 'in t', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(216,34): warning C4244: '=': conversion from 'float' to 'llama_token', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(229,12): warning C4477: 'printf' : format string '%05ld' requires an a rgument of type 'long', but variadic argument 2 has type 'const size_t' [C:\neuro\ik_llama.cpp\build\tests\test-samp ling.vcxproj] C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(229,12): consider using '%zd' in the format string

Building Custom Rule C:/neuro/ik_llama.cpp/examples/export-lora/CMakeLists.txt C:\neuro\ik_llama.cpp\tests\test-sampling.cpp(275,49): warning C4305: 'argument': truncation from 'double' to 'const float' [C:\neuro\ik_llama.cpp\build\tests\test-sampling.vcxproj] test-tokenizer-1-spm.cpp Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt test-rope.cpp gguf-split.cpp test-tokenizer-0.cpp test-model-load-cancel.cpp get-model.cpp Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Generating Code... get-model.cpp get-model.cpp Generating Code... Generating Code... C:\neuro\ik_llama.cpp\examples\llava\llava-cli.cpp(89,105): warning C4267: 'argument': conversion from 'size_t' to ' int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llama-llava-cli.vcxproj] get-model.cpp get-model.cpp Generating Code... Generating Code... sweep-bench.cpp export-lora.cpp tokenize.cpp test-backend-ops.cpp test-grad0.cpp test-chat-template.cpp get-model.cpp Generating Code... test-grammar-integration.cpp Building Custom Rule C:/neuro/ik_llama.cpp/examples/passkey/CMakeLists.txt test-tokenizer-1-bpe.cpp C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(601,20): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(632,24): warning C4244: 'initializing': conversion from 'int64_t' t o 'double', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,87): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,87): warning C4244: with [C:\neuro\ik_llama.cpp\build\t ests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,87): warning C4244: [ [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,87): warning C4244: _Ty=int64_t [C:\neuro\ik_llama. cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,87): warning C4244: ] [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,75): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,75): warning C4244: with [C:\neuro\ik_llama.cpp\build\t ests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,75): warning C4244: [ [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,75): warning C4244: _Ty=int64_t [C:\neuro\ik_llama. cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,75): warning C4244: ] [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,63): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,63): warning C4244: with [C:\neuro\ik_llama.cpp\build\t ests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,63): warning C4244: [ [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,63): warning C4244: _Ty=int64_t [C:\neuro\ik_llama. cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,63): warning C4244: ] [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,51): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,51): warning C4244: with [C:\neuro\ik_llama.cpp\build\t ests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,51): warning C4244: [ [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,51): warning C4244: _Ty=int64_t [C:\neuro\ik_llama. cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(778,51): warning C4244: ] [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,87): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,87): warning C4244: with [C:\neuro\ik_llama.cpp\build\t ests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,87): warning C4244: [ [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,87): warning C4244: _Ty=int64_t [C:\neuro\ik_llama. cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,87): warning C4244: ] [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,75): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,75): warning C4244: with [C:\neuro\ik_llama.cpp\build\t ests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,75): warning C4244: [ [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,75): warning C4244: _Ty=int64_t [C:\neuro\ik_llama. cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,75): warning C4244: ] [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,63): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,63): warning C4244: with [C:\neuro\ik_llama.cpp\build\t ests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,63): warning C4244: [ [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,63): warning C4244: _Ty=int64_t [C:\neuro\ik_llama. cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,63): warning C4244: ] [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,51): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,51): warning C4244: with [C:\neuro\ik_llama.cpp\build\t ests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,51): warning C4244: [ [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,51): warning C4244: _Ty=int64_t [C:\neuro\ik_llama. cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(814,51): warning C4244: ] [C:\neuro\ik_llama.cpp\build\test s\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1280,85): warning C4244: 'argument': conversion from 'const int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1280,81): warning C4244: 'argument': conversion from 'const int' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1431,35): warning C4244: '=': conversion from 'int' to '_Ty', possi ble loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1431,35): warning C4244: with [C:\neuro\ik_llama.cpp\build
tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1431,35): warning C4244: [ [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1431,35): warning C4244: _Ty=float [C:\neuro\ik_llama.c pp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1431,35): warning C4244: ] [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,94): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,94): warning C4244: with [C:\neuro\ik_llama.cpp\build
tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,94): warning C4244: [ [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,94): warning C4244: _Ty=int64_t [C:\neuro\ik_llama .cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,94): warning C4244: ] [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,83): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,83): warning C4244: with [C:\neuro\ik_llama.cpp\build
tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,83): warning C4244: [ [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,83): warning C4244: _Ty=int64_t [C:\neuro\ik_llama .cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,83): warning C4244: ] [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,73): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,73): warning C4244: with [C:\neuro\ik_llama.cpp\build
tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,73): warning C4244: [ [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,73): warning C4244: _Ty=int64_t [C:\neuro\ik_llama .cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,73): warning C4244: ] [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,62): warning C4244: 'argument': conversion from 'const _Ty' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,62): warning C4244: with [C:\neuro\ik_llama.cpp\build
tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,62): warning C4244: [ [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,62): warning C4244: _Ty=int64_t [C:\neuro\ik_llama .cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1504,62): warning C4244: ] [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(1677,77): warning C4244: 'argument': conversion from 'const int64_t ' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2377,32): warning C4244: 'initializing': conversion from 'const _El em' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2377,32): warning C4244: with [C:\neuro\ik_llama.cpp\build
tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2377,32): warning C4244: [ [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2377,32): warning C4244: _Elem=int [C:\neuro\ik_llama.c pp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2377,32): warning C4244: ] [C:\neuro\ik_llama.cpp\build\tes ts\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2383,125): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2386,129): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2387,129): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2388,129): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2392,129): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2393,129): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2394,129): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2395,129): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2396,129): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-backend-ops.cpp(2399,125): warning C4244: 'argument': conversion from 'float' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-backend-ops.vcxproj] C:\neuro\ik_llama.cpp\tests\test-chat-template.cpp(117,143): warning C4267: 'argument': conversion from 'size_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-chat-template.vcxproj] C:\neuro\ik_llama.cpp\tests\test-chat-template.cpp(131,32): warning C4267: 'argument': conversion from 'size_t' to ' int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-chat-template.vcxproj] C:\neuro\ik_llama.cpp\examples\gguf-split\gguf-split.cpp(257,68): warning C4267: 'argument': conversion from 'size_t ' to 'uint16_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gguf-split\llama-gguf-split.vcxproj] C:\neuro\ik_llama.cpp\examples\gguf-split\gguf-split.cpp(278,16): warning C4477: 'printf' : format string '%ld' requ ires an argument of type 'long', but variadic argument 1 has type 'unsigned __int64' [C:\neuro\ik_llama.cpp\build\ex amples\gguf-split\llama-gguf-split.vcxproj] C:\neuro\ik_llama.cpp\examples\gguf-split\gguf-split.cpp(278,16): consider using '%zd' in the format string

C:\neuro\ik_llama.cpp\examples\gguf-split\gguf-split.cpp(288,20): warning C4477: 'printf' : format string '%ld' requ ires an argument of type 'long', but variadic argument 3 has type 'size_t' [C:\neuro\ik_llama.cpp\build\examples\ggu f-split\llama-gguf-split.vcxproj] C:\neuro\ik_llama.cpp\examples\gguf-split\gguf-split.cpp(288,20): consider using '%zd' in the format string

C:\neuro\ik_llama.cpp\examples\gguf-split\gguf-split.cpp(295,21): warning C4267: 'initializing': conversion from 'si ze_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gguf-split\llama-gguf-split.vcxproj] C:\neuro\ik_llama.cpp\examples\gguf-split\gguf-split.cpp(369,17): warning C4267: 'initializing': conversion from 'si ze_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gguf-split\llama-gguf-split.vcxproj] Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/save-load-state/CMakeLists.txt test-llama-grammar.cpp Building Custom Rule C:/neuro/ik_llama.cpp/examples/simple/CMakeLists.txt C:\neuro\ik_llama.cpp\examples\export-lora\export-lora.cpp(254,16): warning C4477: 'printf' : format string '%ld' re quires an argument of type 'long', but variadic argument 2 has type 'size_t' [C:\neuro\ik_llama.cpp\build\examples\e xport-lora\llama-export-lora.vcxproj] C:\neuro\ik_llama.cpp\examples\export-lora\export-lora.cpp(254,16): consider using '%zd' in the format string

C:\neuro\ik_llama.cpp\examples\export-lora\export-lora.cpp(255,16): warning C4477: 'printf' : format string '%ld' re quires an argument of type 'long', but variadic argument 2 has type 'unsigned __int64' [C:\neuro\ik_llama.cpp\build
examples\export-lora\llama-export-lora.vcxproj] C:\neuro\ik_llama.cpp\examples\export-lora\export-lora.cpp(255,16): consider using '%zd' in the format string

C:\neuro\ik_llama.cpp\examples\export-lora\export-lora.cpp(337,24): warning C4477: 'printf' : format string '%ld' re quires an argument of type 'long', but variadic argument 2 has type 'size_t' [C:\neuro\ik_llama.cpp\build\examples\e xport-lora\llama-export-lora.vcxproj] C:\neuro\ik_llama.cpp\examples\export-lora\export-lora.cpp(337,24): consider using '%zd' in the format string

C:\neuro\ik_llama.cpp\examples\tokenize\tokenize.cpp(94,77): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\tokenize\llama-tokenize.vcxproj] C:\neuro\ik_llama.cpp\examples\tokenize\tokenize.cpp(98,57): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\tokenize\llama-tokenize.vcxproj] C:\neuro\ik_llama.cpp\examples\tokenize\tokenize.cpp(150,91): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\tokenize\llama-tokenize.vcxproj] C:\neuro\ik_llama.cpp\examples\tokenize\tokenize.cpp(155,25): warning C4267: 'initializing': conversion from 'size_t ' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\tokenize\llama-tokenize.vcxproj] C:\neuro\ik_llama.cpp\examples\tokenize\tokenize.cpp(172,52): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\tokenize\llama-tokenize.vcxproj] C:\neuro\ik_llama.cpp\examples\tokenize\tokenize.cpp(185,31): warning C4267: 'initializing': conversion from 'size_t ' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\tokenize\llama-tokenize.vcxproj] C:\neuro\ik_llama.cpp\examples\tokenize\tokenize.cpp(185,20): warning C4267: 'initializing': conversion from 'size_t ' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\tokenize\llama-tokenize.vcxproj] C:\neuro\ik_llama.cpp\examples\tokenize\tokenize.cpp(399,16): warning C4477: 'printf' : format string '%ld' requires an argument of type 'long', but variadic argument 1 has type 'unsigned __int64' [C:\neuro\ik_llama.cpp\build\exampl es\tokenize\llama-tokenize.vcxproj] C:\neuro\ik_llama.cpp\examples\tokenize\tokenize.cpp(399,16): consider using '%zd' in the format string

get-model.cpp passkey.cpp test-autorelease.cpp save-load-state.cpp simple.cpp Generating Code... get-model.cpp test-tokenizer-1-spm.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-tokenizer-1-spm.exe llama-lookup-merge.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-lookup-merge.exe C:\neuro\ik_llama.cpp\tests\test-llama-grammar.cpp(205,20): warning C4267: '=': conversion from 'size_t' to 'uint32_ t', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-llama-grammar.vcxproj] get-model.cpp Generating Code... Generating Code... get-model.cpp C:\neuro\ik_llama.cpp\examples\save-load-state\save-load-state.cpp(45,69): warning C4267: 'argument': conversion fro m 'size_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\save-load-state\llama-save-load -state.vcxproj] C:\neuro\ik_llama.cpp\examples\save-load-state\save-load-state.cpp(46,26): warning C4267: '+=': conversion from 'siz e_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\save-load-state\llama-save-load-state.vcx proj] C:\neuro\ik_llama.cpp\examples\simple\simple.cpp(64,45): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\simple\llama-simple.vcxproj] C:\neuro\ik_llama.cpp\examples\simple\simple.cpp(64,24): warning C4267: 'initializing': conversion from 'size_t' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\simple\llama-simple.vcxproj] C:\neuro\ik_llama.cpp\examples\simple\simple.cpp(92,48): warning C4267: 'argument': conversion from 'size_t' to 'lla ma_pos', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\simple\llama-simple.vcxproj] Generating Code... C:\neuro\ik_llama.cpp\examples\passkey\passkey.cpp(29,23): warning C4244: 'argument': conversion from 'time_t' to 'u nsigned int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\passkey\llama-passkey.vcxproj] C:\neuro\ik_llama.cpp\examples\passkey\passkey.cpp(94,80): warning C4267: 'initializing': conversion from 'size_t' t o 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\passkey\llama-passkey.vcxproj] C:\neuro\ik_llama.cpp\examples\passkey\passkey.cpp(94,31): warning C4267: 'initializing': conversion from 'size_t' t o 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\passkey\llama-passkey.vcxproj] C:\neuro\ik_llama.cpp\examples\passkey\passkey.cpp(96,46): warning C4267: 'initializing': conversion from 'size_t' t o 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\passkey\llama-passkey.vcxproj] C:\neuro\ik_llama.cpp\examples\passkey\passkey.cpp(96,28): warning C4267: 'initializing': conversion from 'size_t' t o 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\passkey\llama-passkey.vcxproj] get-model.cpp Building Custom Rule C:/neuro/ik_llama.cpp/examples/lookup/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/pocs/vdot/CMakeLists.txt Generating Code... Building Custom Rule C:/neuro/ik_llama.cpp/examples/retrieval/CMakeLists.txt Creating library C:/neuro/ik_llama.cpp/build/examples/llava/Release/llama-llava-cli.lib and object C:/neuro/ik_ llama.cpp/build/examples/llava/Release/llama-llava-cli.exp lookup.cpp test-sampling.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-sampling.exe q8dot.cpp test-grad0.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-grad0.exe test-rope.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-rope.exe llama-llava-cli.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-llava-cli.exe retrieval.cpp test-quantize-fns.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-quantize-fns.exe test-tokenizer-1-bpe.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-tokenizer-1-bpe.exe test-autorelease.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-autorelease.exe llama-tokenize.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-tokenize.exe test-tokenizer-0.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-tokenizer-0.exe get-model.cpp C:\neuro\ik_llama.cpp\examples\lookup\lookup.cpp(56,102): warning C4267: 'argument': conversion from 'size_t' to 'in t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup.vcxproj] C:\neuro\ik_llama.cpp\examples\lookup\lookup.cpp(92,33): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup.vcxproj] C:\neuro\ik_llama.cpp\examples\lookup\lookup.cpp(92,23): warning C4267: 'initializing': conversion from 'size_t' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup.vcxproj] C:\neuro\ik_llama.cpp\examples\lookup\lookup.cpp(105,16): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup.vcxproj] C:\neuro\ik_llama.cpp\examples\lookup\lookup.cpp(210,57): warning C4267: 'argument': conversion from 'size_t' to 'll ama_pos', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup.vcxproj] C:\neuro\ik_llama.cpp\examples\lookup\lookup.cpp(214,35): warning C4267: '+=': conversion from 'size_t' to 'int', po ssible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup.vcxproj] C:\neuro\ik_llama.cpp\examples\retrieval\retrieval.cpp(79,43): warning C4267: 'argument': conversion from 'size_t' t o 'llama_pos', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\retrieval\llama-retrieval.vcxproj] C:\neuro\ik_llama.cpp\examples\retrieval\retrieval.cpp(146,12): warning C4477: 'printf' : format string '%ld' requir es an argument of type 'long', but variadic argument 1 has type 'unsigned __int64' [C:\neuro\ik_llama.cpp\build\exam ples\retrieval\llama-retrieval.vcxproj] C:\neuro\ik_llama.cpp\examples\retrieval\retrieval.cpp(146,12): consider using '%zd' in the format string

C:\neuro\ik_llama.cpp\examples\retrieval\retrieval.cpp(214,37): warning C4267: 'initializing': conversion from 'size _t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\retrieval\llama-retrieval.vcxproj] C:\neuro\ik_llama.cpp\examples\retrieval\retrieval.cpp(214,24): warning C4267: 'initializing': conversion from 'size _t' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\retrieval\llama-retrieval.vcxproj] C:\neuro\ik_llama.cpp\examples\retrieval\retrieval.cpp(215,49): warning C4244: 'argument': conversion from 'const ui nt64_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\retrieval\llama-retrieval.vcxproj] C:\neuro\ik_llama.cpp\examples\retrieval\retrieval.cpp(263,59): warning C4244: 'argument': conversion from 'const ui nt64_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\retrieval\llama-retrieval.vcxproj] Generating Code... Building Custom Rule C:/neuro/ik_llama.cpp/examples/gritlm/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/llava/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/main/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/pocs/vdot/CMakeLists.txt test-chat-template.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-chat-template.exe Building Custom Rule C:/neuro/ik_llama.cpp/examples/perplexity/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/cvector-generator/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/embedding/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/tests/CMakeLists.txt gritlm.cpp minicpmv-cli.cpp vdot.cpp main.cpp perplexity.cpp Building Custom Rule C:/neuro/ik_llama.cpp/examples/convert-llama2c-to-ggml/CMakeLists.txt C:\neuro\ik_llama.cpp\examples\gritlm\gritlm.cpp(23,43): warning C4267: 'initializing': conversion from 'size_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gritlm\llama-gritlm.vcxproj] C:\neuro\ik_llama.cpp\examples\gritlm\gritlm.cpp(23,30): warning C4267: 'initializing': conversion from 'size_t' to 'const int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gritlm\llama-gritlm.vcxproj] C:\neuro\ik_llama.cpp\examples\gritlm\gritlm.cpp(30,82): warning C4267: 'initializing': conversion from 'size_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gritlm\llama-gritlm.vcxproj] C:\neuro\ik_llama.cpp\examples\gritlm\gritlm.cpp(30,30): warning C4267: 'initializing': conversion from 'size_t' to 'const int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gritlm\llama-gritlm.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\minicpmv-cli.cpp(198,27): warning C4244: 'initializing': conversion from 'doubl e' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llama-minicpmv-cli.vcxproj] C:\neuro\ik_llama.cpp\examples\llava\minicpmv-cli.cpp(204,30): warning C4244: 'initializing': conversion from 'doubl e' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llava\llama-minicpmv-cli.vcxproj] C:\neuro\ik_llama.cpp\examples\gritlm\gritlm.cpp(77,65): warning C4244: 'argument': conversion from 'uint64_t' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\gritlm\llama-gritlm.vcxproj] Building Custom Rule C:/neuro/ik_llama.cpp/examples/speculative/CMakeLists.txt cvector-generator.cpp embedding.cpp test-quantize-perf.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-quantize-perf.exe convert-llama2c-to-ggml.cpp test-model-load-cancel.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-model-load-cancel.exe llama-gguf-split.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-gguf-split.exe llama-retrieval.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-retrieval.exe test-backend-ops.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-backend-ops.exe test-json-schema-to-grammar.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-json-schema-to-grammar.exe test-grammar-parser.cpp llama-q8dot.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-q8dot.exe llama-lookup.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-lookup.exe llama-simple.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-simple.exe llama-export-lora.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-export-lora.exe speculative.cpp C:\neuro\ik_llama.cpp\tests\test-grammar-parser.cpp(39,73): warning C4267: 'argument': conversion from 'size_t' to ' unsigned int', possible loss of data [C:\neuro\ik_llama.cpp\build\tests\test-grammar-parser.vcxproj] C:\neuro\ik_llama.cpp\examples\main\main.cpp(399,19): warning C4804: '>': unsafe use of type 'bool' in operation [C: \neuro\ik_llama.cpp\build\examples\main\llama-cli.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\pca.hpp(29,43): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cvector-generator.vcx proj] (compiling source file '../../examples/cvector-generator/cvector-generator.cpp')

C:\neuro\ik_llama.cpp\examples\cvector-generator\pca.hpp(41,23): warning C4305: 'initializing': truncation from 'dou ble' to 'float' [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cvector-generator.vcxproj] (compiling source file '../../examples/cvector-generator/cvector-generator.cpp')

C:\neuro\ik_llama.cpp\examples\cvector-generator\pca.hpp(318,26): warning C4267: '=': conversion from 'size_t' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cvector-generator.vcxproj] (compiling source file '../../examples/cvector-generator/cvector-generator.cpp')

C:\neuro\ik_llama.cpp\examples\cvector-generator\pca.hpp(319,39): warning C4267: '=': conversion from 'size_t' to 'i nt', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cvector-generator.vcxproj] (compiling source file '../../examples/cvector-generator/cvector-generator.cpp')

C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(99,41): warning C4244: 'argument': conversion from 'float' to 'const unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-gener ator\llama-cvector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(100,41): warning C4244: 'argument': conversio n from 'float' to 'const unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-gene rator\llama-cvector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(101,50): warning C4244: 'argument': conversio n from 'float' to 'const unsigned __int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-gene rator\llama-cvector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(106,60): warning C4244: 'argument': conversio n from 'float' to 'const unsigned int64', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-gene rator\llama-cvector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(117,24): warning C4244: 'initializing': conve rsion from 'int64_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-c vector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(127,45): warning C4305: 'argument': truncatio n from 'double' to 'float' [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cvector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(133,28): warning C4267: 'initializing': conve rsion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cv ector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(135,20): warning C4244: 'initializing': conve rsion from 'int64_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-c vector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(232,24): warning C4267: 'initializing': conve rsion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cv ector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(342,73): warning C4267: 'argument': conversio n from 'size_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cv ector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(355,71): warning C4267: 'argument': conversio n from 'size_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cv ector-generator.vcxproj] C:\neuro\ik_llama.cpp\examples\cvector-generator\cvector-generator.cpp(450,29): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\cvector-generator\llama-cvector-gener ator.vcxproj] get-model.cpp Generating Code... Building Custom Rule C:/neuro/ik_llama.cpp/examples/eval-callback/CMakeLists.txt Building Custom Rule C:/neuro/ik_llama.cpp/examples/gbnf-validator/CMakeLists.txt Generating colorthemes.css.hpp test-llama-grammar.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-llama-grammar.exe C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(47,27): warning C4244: '=': conversion from 'time_t' to ' uint32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\speculative\llama-speculative.vcxproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(154,33): warning C4267: 'initializing': conversion from ' size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\speculative\llama-speculative.vcxproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(154,23): warning C4267: 'initializing': conversion from ' size_t' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\speculative\llama-speculative.vc xproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(175,20): warning C4267: 'initializing': conversion from ' size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\speculative\llama-speculative.vcxproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(176,20): warning C4267: 'initializing': conversion from ' size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\speculative\llama-speculative.vcxproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(244,102): warning C4267: 'argument': conversion from 'siz e_t' to 'Ty', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\speculative\llama-speculative.vcxproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(244,102): warning C4267: with [C:\neuro\ik_llama. cpp\build\examples\speculative\llama-speculative.vcxproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(244,102): warning C4267: [ [C:\neuro\ik_llama.cpp \build\examples\speculative\llama-speculative.vcxproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(244,102): warning C4267: Ty=unsigned int [C: \neuro\ik_llama.cpp\build\examples\speculative\llama-speculative.vcxproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(244,102): warning C4267: ] [C:\neuro\ik_llama.cpp \build\examples\speculative\llama-speculative.vcxproj] C:\neuro\ik_llama.cpp\examples\speculative\speculative.cpp(260,33): warning C4244: 'initializing': conversion from ' double' to 'float', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\speculative\llama-speculative.vcxpro j] Generating style.css.hpp llama-passkey.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-passkey.exe Building Custom Rule C:/neuro/ik_llama.cpp/examples/lookup/CMakeLists.txt eval-callback.cpp gbnf-validator.cpp Generating theme-beeninorder.css.hpp test-grammar-integration.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-grammar-integration.exe Generating theme-ketivah.css.hpp Creating library C:/neuro/ik_llama.cpp/build/examples/llava/Release/llama-minicpmv-cli.lib and object C:/neuro/ ik_llama.cpp/build/examples/llava/Release/llama-minicpmv-cli.exp lookup-stats.cpp Building Custom Rule C:/neuro/ik_llama.cpp/examples/infill/CMakeLists.txt Generating theme-mangotango.css.hpp llama-gritlm.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-gritlm.exe llama-minicpmv-cli.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-minicpmv-cli.exe infill.cpp Generating theme-playground.css.hpp test-grammar-parser.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\test-grammar-parser.exe llama-embedding.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-embedding.exe C:\neuro\ik_llama.cpp\examples\eval-callback\eval-callback.cpp(134,73): warning C4267: 'argument': conversion from ' size_t' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\eval-callback\llama-eval-callback. vcxproj] llama-save-load-state.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-save-load-state.exe Building Custom Rule C:/neuro/ik_llama.cpp/examples/batched/CMakeLists.txt C:\neuro\ik_llama.cpp\examples\lookup\lookup-stats.cpp(66,33): warning C4267: 'initializing': conversion from 'size t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup-stats.vcxproj] C:\neuro\ik_llama.cpp\examples\lookup\lookup-stats.cpp(66,23): warning C4267: 'initializing': conversion from 'size t' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup-stats.vcxproj] Generating theme-polarnight.css.hpp C:\neuro\ik_llama.cpp\examples\lookup\lookup-stats.cpp(92,39): warning C4267: '+=': conversion from 'size_t' to 'int ', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup-stats.vcxproj] Building Custom Rule C:/neuro/ik_llama.cpp/examples/batched-bench/CMakeLists.txt Generating theme-snowstorm.css.hpp Generating index.html.hpp batched.cpp batched-bench.cpp llama-convert-llama2c-to-ggml.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-convert-llama2c-to-ggml.exe llama-cvector-generator.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-cvector-generator.exe llama-gbnf-validator.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-gbnf-validator.exe llama-perplexity.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-perplexity.exe llama-sweep-bench.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-sweep-bench.exe llama-eval-callback.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-eval-callback.exe Generating index-new.html.hpp C:\neuro\ik_llama.cpp\examples\batched\batched.cpp(57,45): warning C4267: 'initializing': conversion from 'size_t' t o 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\batched\llama-batched.vcxproj] C:\neuro\ik_llama.cpp\examples\batched\batched.cpp(57,24): warning C4267: 'initializing': conversion from 'size_t' t o 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\batched\llama-batched.vcxproj] C:\neuro\ik_llama.cpp\examples\batched\batched.cpp(96,50): warning C4267: 'argument': conversion from 'size_t' to 'i nt32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\batched\llama-batched.vcxproj] C:\neuro\ik_llama.cpp\examples\batched\batched.cpp(105,48): warning C4267: 'argument': conversion from 'size_t' to ' llama_pos', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\batched\llama-batched.vcxproj] Building Custom Rule C:/neuro/ik_llama.cpp/examples/lookahead/CMakeLists.txt Generating index.js.hpp Generating completion.js.hpp Generating system-prompts.js.hpp llama-lookup-stats.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-lookup-stats.exe lookahead.cpp Building Custom Rule C:/neuro/ik_llama.cpp/examples/baby-llama/CMakeLists.txt Generating prompt-formats.js.hpp llama-batched-bench.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-batched-bench.exe Generating json-schema-to-grammar.mjs.hpp baby-llama.cpp llama-batched.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-batched.exe Building Custom Rule C:/neuro/ik_llama.cpp/examples/server/CMakeLists.txt server.cpp llama-infill.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-infill.exe C:\neuro\ik_llama.cpp\examples\lookahead\lookahead.cpp(90,33): warning C4267: 'initializing': conversion from 'size t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookahead\llama-lookahead.vcxproj] C:\neuro\ik_llama.cpp\examples\lookahead\lookahead.cpp(90,23): warning C4267: 'initializing': conversion from 'size t' to 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookahead\llama-lookahead.vcxproj] C:\neuro\ik_llama.cpp\examples\lookahead\lookahead.cpp(107,16): warning C4267: 'initializing': conversion from 'size _t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookahead\llama-lookahead.vcxproj] llama-speculative.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-speculative.exe C:\neuro\ik_llama.cpp\examples\lookahead\lookahead.cpp(364,129): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookahead\llama-lookahead.vcxproj] llama-cli.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-cli.exe llama-vdot.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-vdot.exe Building Custom Rule C:/neuro/ik_llama.cpp/examples/quantize/CMakeLists.txt llama-baby-llama.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-baby-llama.exe C:\neuro\ik_llama.cpp\examples\server\utils.hpp(171,16): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] (compiling source file '../../../examples/server/server.cpp')

C:\neuro\ik_llama.cpp\examples\server\utils.hpp(182,52): warning C4267: '=': conversion from 'size_t' to 'uint8_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] (compiling source file '../../../examples/server/server.cpp')

C:\neuro\ik_llama.cpp\examples\server\utils.hpp(203,48): warning C4267: '=': conversion from 'size_t' to 'uint8_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] (compiling source file '../../../examples/server/server.cpp')

quantize.cpp Building Custom Rule C:/neuro/ik_llama.cpp/examples/parallel/CMakeLists.txt parallel.cpp Building Custom Rule C:/neuro/ik_llama.cpp/examples/lookup/CMakeLists.txt llama-lookahead.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-lookahead.exe C:\neuro\ik_llama.cpp\examples\parallel\parallel.cpp(163,21): warning C4267: '=': conversion from 'size_t' to 'int32 _t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\parallel\llama-parallel.vcxproj] C:\neuro\ik_llama.cpp\examples\parallel\parallel.cpp(169,55): warning C4267: 'initializing': conversion from 'size_t ' to 'int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\parallel\llama-parallel.vcxproj] C:\neuro\ik_llama.cpp\examples\parallel\parallel.cpp(169,35): warning C4267: 'initializing': conversion from 'size_t ' to 'const int32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\parallel\llama-parallel.vcxproj] C:\neuro\ik_llama.cpp\examples\parallel\parallel.cpp(263,68): warning C4267: 'argument': conversion from 'size_t' to 'llama_pos', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\parallel\llama-parallel.vcxproj] C:\neuro\ik_llama.cpp\examples\parallel\parallel.cpp(271,58): warning C4267: '=': conversion from 'size_t' to 'int32 _t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\parallel\llama-parallel.vcxproj] lookup-create.cpp Building Custom Rule C:/neuro/ik_llama.cpp/examples/imatrix/CMakeLists.txt imatrix.cpp C:\neuro\ik_llama.cpp\examples\server\server.cpp(361,48): warning C4244: '+=': conversion from 'const double' to 'ui nt64_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(362,48): warning C4244: '+=': conversion from 'const double' to 'ui nt64_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(368,43): warning C4244: '+=': conversion from 'const double' to 'ui nt64_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(369,43): warning C4244: '+=': conversion from 'const double' to 'ui nt64_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(842,37): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(845,29): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] Building Custom Rule C:/neuro/ik_llama.cpp/examples/llama-bench/CMakeLists.txt C:\neuro\ik_llama.cpp\examples\server\server.cpp(1570,73): warning C4267: 'initializing': conversion from 'size_t' t o 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(1570,32): warning C4267: 'initializing': conversion from 'size_t' t o 'const int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\lookup\lookup-create.cpp(39,96): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\lookup\llama-lookup-create.vcxproj] llama-bench.cpp C:\neuro\ik_llama.cpp\examples\server\server.cpp(1969,103): warning C4267: 'argument': conversion from 'size_t' to ' llama_pos', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(2001,71): warning C4267: 'argument': conversion from 'size_t' to 'l lama_pos', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(2083,66): warning C4267: '=': conversion from 'size_t' to 'int32_t' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(2143,74): warning C4267: '=': conversion from 'size_t' to 'int32_t' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(2167,58): warning C4267: '=': conversion from 'size_t' to 'int32_t' , possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(2203,46): warning C4805: '!=': unsafe mix of type 'int32_t' and typ e 'bool' in operation [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(2253,97): warning C4267: 'argument': conversion from 'size_t' to 'l lama_pos', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(2421,57): warning C4267: 'argument': conversion from 'size_t' to 'i nt32_t', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] C:\neuro\ik_llama.cpp\examples\server\server.cpp(3363,21): warning C4267: 'initializing': conversion from 'size_t' t o 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\server\llama-server.vcxproj] llama-parallel.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-parallel.exe llama-quantize.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-quantize.exe C:\neuro\ik_llama.cpp\examples\llama-bench\llama-bench.cpp(409,30): warning C4996: 'strdup': The POSIX name for this item is deprecated. Instead, use the ISO C and C++ conformant name: _strdup. See online help for details. [C:\neuro \ik_llama.cpp\build\examples\llama-bench\llama-bench.vcxproj] llama-lookup-create.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-lookup-create.exe C:\neuro\ik_llama.cpp\examples\llama-bench\llama-bench.cpp(1235,31): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llama-bench\llama-bench.vcxproj] C:\neuro\ik_llama.cpp\examples\llama-bench\llama-bench.cpp(92,13): warning C4244: 'initializing': conversion from 'd ouble' to 'T', possible loss of data [C:\neuro\ik_llama.cpp\build\examples\llama-bench\llama-bench.vcxproj] C:\neuro\ik_llama.cpp\examples\llama-bench\llama-bench.cpp(92,13): warning C4244: with [C:\neuro\ik_llama.cp p\build\examples\llama-bench\llama-bench.vcxproj] C:\neuro\ik_llama.cpp\examples\llama-bench\llama-bench.cpp(92,13): warning C4244: [ [C:\neuro\ik_llama.cpp\b uild\examples\llama-bench\llama-bench.vcxproj] C:\neuro\ik_llama.cpp\examples\llama-bench\llama-bench.cpp(92,13): warning C4244: T=uint64_t [C:\neuro\i k_llama.cpp\build\examples\llama-bench\llama-bench.vcxproj] C:\neuro\ik_llama.cpp\examples\llama-bench\llama-bench.cpp(92,13): warning C4244: ] [C:\neuro\ik_llama.cpp\b uild\examples\llama-bench\llama-bench.vcxproj] C:\neuro\ik_llama.cpp\examples\llama-bench\llama-bench.cpp(92,13): the template instantiation context (the oldest one first) is C:\neuro\ik_llama.cpp\examples\llama-bench\llama-bench.cpp(1145,18): see reference to function template instantiation 'T stdev<uint64_t>(const std::vector<uint64_t,std::alloca tor<uint64_t>> &)' being compiled with [ T=uint64_t ]

llama-imatrix.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-imatrix.exe llama-bench.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-bench.exe llama-server.vcxproj -> C:\neuro\ik_llama.cpp\build\bin\Release\llama-server.exe Building Custom Rule C:/neuro/ik_llama.cpp/CMakeLists.txt

C:\neuro\ik_llama.cpp>


PS C:\neuro\ik_llama.cpp\build\bin\Release> ./llama-server.exe -t 7 -c 4096 -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf INFO [ main] build info | tid="11116" timestamp=1746438993 build=3667 commit="e3fec173" INFO [ main] system info | tid="11116" timestamp=1746438993 n_threads=7 n_threads_batch=-1 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 30B-A3B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["─а ─а", "─а─а ─а─а", "i n", "─а t",... llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 17 llama_model_loader: - kv 31: quantize.imatrix.file str = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 32: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B.txt llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 384 llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 32 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0.9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3moe llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 40960 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 6144 llm_load_print_meta: n_expert = 128 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 40960 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 30.532 B llm_load_print_meta: model size = 20.228 GiB (5.691 BPW) llm_load_print_meta: repeating layers = 19.791 GiB (5.684 BPW, 29.910 B parameters) llm_load_print_meta: general.name = Qwen3-30B-A3B llm_load_print_meta: BOS token = 11 ',' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151654 '<|vision_pad|>' llm_load_print_meta: LF token = 148848 '├Д─м' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_ff_exp = 768 llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: CPU buffer size = 20713.44 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CPU output buffer size = 1.16 MiB llama_new_context_with_model: CPU compute buffer size = 304.75 MiB llama_new_context_with_model: graph nodes = 2165 llama_new_context_with_model: graph splits = 1 INFO [ init] initializing slots | tid="11116" timestamp=1746439008 n_slots=1 INFO [ init] new slot | tid="11116" timestamp=1746439008 id_slot=0 n_ctx_slot=4096 INFO [ main] model loaded | tid="11116" timestamp=1746439008 INFO [ main] chat template | tid="11116" timestamp=1746439008 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true INFO [ main] HTTP server listening | tid="11116" timestamp=1746439008 hostname="127.0.0.1" port="8080" n_threads_http="15" INFO [ update_slots] all slots are idle | tid="11116" timestamp=1746439008 INFO [ log_server_request] request | tid="19268" timestamp=1746439081 remote_addr="127.0.0.1" remote_port=63234 status=404 method="GET" path="/models" params={} INFO [ launch_slot_with_task] slot is processing task | tid="11116" timestamp=1746439086 id_slot=0 id_task=0 INFO [ update_slots] kv cache rm [p0, end) | tid="11116" timestamp=1746439086 id_slot=0 id_task=0 p0=0 PS C:\neuro\ik_llama.cpp\build\bin\Release>


PS C:\neuro\ik_llama.cpp\build\bin\Release> ./llama-server.exe -t 7 -c 4096 -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf INFO [ main] build info | tid="21556" timestamp=1746439373 build=3667 commit="e3fec173" INFO [ main] system info | tid="21556" timestamp=1746439373 n_threads=7 n_threads_batch=-1 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 30B-A3B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["─а ─а", "─а─а ─а─а", "i n", "─а t",... llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 17 llama_model_loader: - kv 31: quantize.imatrix.file str = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 32: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B.txt llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 384 llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 32 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0.9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3moe llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 40960 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 6144 llm_load_print_meta: n_expert = 128 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 40960 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 30.532 B llm_load_print_meta: model size = 20.228 GiB (5.691 BPW) llm_load_print_meta: repeating layers = 19.791 GiB (5.684 BPW, 29.910 B parameters) llm_load_print_meta: general.name = Qwen3-30B-A3B llm_load_print_meta: BOS token = 11 ',' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151654 '<|vision_pad|>' llm_load_print_meta: LF token = 148848 '├Д─м' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_ff_exp = 768 llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: CPU buffer size = 20713.44 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CPU output buffer size = 1.16 MiB llama_new_context_with_model: CPU compute buffer size = 304.75 MiB llama_new_context_with_model: graph nodes = 2165 llama_new_context_with_model: graph splits = 1 INFO [ init] initializing slots | tid="21556" timestamp=1746439379 n_slots=1 INFO [ init] new slot | tid="21556" timestamp=1746439379 id_slot=0 n_ctx_slot=4096 INFO [ main] model loaded | tid="21556" timestamp=1746439379 INFO [ main] chat template | tid="21556" timestamp=1746439379 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true INFO [ main] HTTP server listening | tid="21556" timestamp=1746439379 hostname="127.0.0.1" port="8080" n_threads_http="15" INFO [ update_slots] all slots are idle | tid="21556" timestamp=1746439379 INFO [ log_server_request] request | tid="16816" timestamp=1746439384 remote_addr="127.0.0.1" remote_port=57484 status=200 method="GET" path="/" params={} INFO [ log_server_request] request | tid="15152" timestamp=1746439384 remote_addr="127.0.0.1" remote_port=61232 status=200 method="GET" path="/completion.js" params={} INFO [ log_server_request] request | tid="19108" timestamp=1746439384 remote_addr="127.0.0.1" remote_port=61590 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={} INFO [ log_server_request] request | tid="16816" timestamp=1746439384 remote_addr="127.0.0.1" remote_port=57484 status=200 method="GET" path="/index.js" params={} INFO [ log_server_request] request | tid="16816" timestamp=1746439384 remote_addr="127.0.0.1" remote_port=57484 status=404 method="GET" path="/favicon.ico" params={} INFO [ launch_slot_with_task] slot is processing task | tid="21556" timestamp=1746439391 id_slot=0 id_task=0 INFO [ update_slots] kv cache rm [p0, end) | tid="21556" timestamp=1746439391 id_slot=0 id_task=0 p0=0 INFO [ print_timings] prompt eval time = 1253.52 ms / 50 tokens ( 25.07 ms per token, 39.89 tokens per second) | tid="21556" timestamp=1746439402 id_slot=0 id_task=0 t_prompt_processing=1253.524 n_prompt_tokens_processed=50 t_token=25.070479999999996 n_tokens_second=39.88754902179775 INFO [ print_timings] generation eval time = 10483.45 ms / 120 runs ( 87.36 ms per token, 11.45 tokens per second) | tid="21556" timestamp=1746439402 id_slot=0 id_task=0 t_token_generation=10483.451 n_decoded=120 t_token=87.36209166666666 n_tokens_second=11.44661237983561 INFO [ print_timings] total time = 11736.97 ms | tid="21556" timestamp=1746439402 id_slot=0 id_task=0 t_prompt_processing=1253.524 t_token_generation=10483.451 t_total=11736.974999999999 INFO [ update_slots] slot released | tid="21556" timestamp=1746439402 id_slot=0 id_task=0 n_ctx=4096 n_past=169 n_system_tokens=0 n_cache_tokens=169 truncated=false INFO [ update_slots] all slots are idle | tid="21556" timestamp=1746439402 INFO [ log_server_request] request | tid="17584" timestamp=1746439402 remote_addr="127.0.0.1" remote_port=64288 status=200 method="POST" path="/completion" params={} INFO [ update_slots] all slots are idle | tid="21556" timestamp=1746439402 INFO [ launch_slot_with_task] slot is processing task | tid="21556" timestamp=1746439409 id_slot=0 id_task=122 INFO [ update_slots] kv cache rm [p0, end) | tid="21556" timestamp=1746439409 id_slot=0 id_task=122 p0=49 PS C:\neuro\ik_llama.cpp\build\bin\Release>


👤 intulint commented the 2025-05-05 at 10:14:49:

Even the benchmark crashes during generation. I don't know what the problem is, but it seems to be related to what happens during generation.

PS C:\neuro\ik_llama.cpp\build\bin\Release> .\llama-sweep-bench.exe -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf -c 4096 -t 7 llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 30B-A3B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["─а ─а", "─а─а ─а─а", "i n", "─а t",... llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 17 llama_model_loader: - kv 31: quantize.imatrix.file str = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 32: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B.txt llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 384 llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 32 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0.9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3moe llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 40960 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 6144 llm_load_print_meta: n_expert = 128 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 40960 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 30.532 B llm_load_print_meta: model size = 20.228 GiB (5.691 BPW) llm_load_print_meta: repeating layers = 19.791 GiB (5.684 BPW, 29.910 B parameters) llm_load_print_meta: general.name = Qwen3-30B-A3B llm_load_print_meta: BOS token = 11 ',' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151654 '<|vision_pad|>' llm_load_print_meta: LF token = 148848 '├Д─м' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_ff_exp = 768 llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: CPU buffer size = 20713.44 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 304.75 MiB llama_new_context_with_model: graph nodes = 2165 llama_new_context_with_model: graph splits = 1

main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, n_gpu_layers = -1, n_threads = 7, n_threads_batch = 7

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 10.780 47.49 8.250 15.51
PS C:\neuro\ik_llama.cpp\build\bin\Release>

👤 ikawrakow commented the 2025-05-05 at 10:22:33:

Can you try running with -t 8?

If that works, try also adding -fa -rtr -fmoe.


👤 ikawrakow commented the 2025-05-05 at 10:22:33:

Can you try running with -t 8?

If that works, try also adding -fa -rtr.


👤 intulint commented the 2025-05-05 at 10:42:45:

8 cores make no difference. -fa -rtr -fmoe Finally it works, but I noticed that every time before writing a comma the generation stops for half a second. The first time I see this. In the llama.cpp avx2 release, generation is much faster.

PS C:\neuro\ik_llama.cpp\build\bin\Release> ./llama-server.exe -t 8 -c 4096 -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf INFO [ main] build info | tid="11244" timestamp=1746440931 build=3667 commit="e3fec173" INFO [ main] system info | tid="11244" timestamp=1746440931 n_threads=8 n_threads_batch=-1 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 30B-A3B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["─а ─а", "─а─а ─а─а", "i n", "─а t",... llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 17 llama_model_loader: - kv 31: quantize.imatrix.file str = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 32: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B.txt llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 384 llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 32 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0.9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3moe llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 40960 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 6144 llm_load_print_meta: n_expert = 128 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 40960 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 30.532 B llm_load_print_meta: model size = 20.228 GiB (5.691 BPW) llm_load_print_meta: repeating layers = 19.791 GiB (5.684 BPW, 29.910 B parameters) llm_load_print_meta: general.name = Qwen3-30B-A3B llm_load_print_meta: BOS token = 11 ',' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151654 '<|vision_pad|>' llm_load_print_meta: LF token = 148848 '├Д─м' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_ff_exp = 768 llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: CPU buffer size = 20713.44 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CPU output buffer size = 1.16 MiB llama_new_context_with_model: CPU compute buffer size = 304.75 MiB llama_new_context_with_model: graph nodes = 2165 llama_new_context_with_model: graph splits = 1 INFO [ init] initializing slots | tid="11244" timestamp=1746440937 n_slots=1 INFO [ init] new slot | tid="11244" timestamp=1746440937 id_slot=0 n_ctx_slot=4096 INFO [ main] model loaded | tid="11244" timestamp=1746440937 INFO [ main] chat template | tid="11244" timestamp=1746440937 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true INFO [ main] HTTP server listening | tid="11244" timestamp=1746440937 hostname="127.0.0.1" port="8080" n_threads_http="15" INFO [ update_slots] all slots are idle | tid="11244" timestamp=1746440937 INFO [ launch_slot_with_task] slot is processing task | tid="11244" timestamp=1746440956 id_slot=0 id_task=0 INFO [ update_slots] kv cache rm [p0, end) | tid="11244" timestamp=1746440956 id_slot=0 id_task=0 p0=0 PS C:\neuro\ik_llama.cpp\build\bin\Release>


PS C:\neuro\ik_llama.cpp\build\bin\Release> ./llama-server.exe -t 8 -c 4096 -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf -fa -rtr -fmoe INFO [ main] build info | tid="12376" timestamp=1746441162 build=3667 commit="e3fec173" INFO [ main] system info | tid="12376" timestamp=1746441162 n_threads=8 n_threads_batch=-1 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 30B-A3B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["─а ─а", "─а─а ─а─а", "i n", "─а t",... llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 17 llama_model_loader: - kv 31: quantize.imatrix.file str = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 32: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B.txt llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 384 llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 32 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0.9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3moe llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 40960 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 6144 llm_load_print_meta: n_expert = 128 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 40960 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 30.532 B llm_load_print_meta: model size = 20.228 GiB (5.691 BPW) llm_load_print_meta: repeating layers = 19.791 GiB (5.684 BPW, 29.910 B parameters) llm_load_print_meta: general.name = Qwen3-30B-A3B llm_load_print_meta: BOS token = 11 ',' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151654 '<|vision_pad|>' llm_load_print_meta: LF token = 148848 '├Д─м' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_ff_exp = 768 llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: CPU buffer size = 20713.44 MiB ................................................................................................... ============ Repacked 337 tensors llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 1 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CPU output buffer size = 1.16 MiB llama_new_context_with_model: CPU compute buffer size = 300.75 MiB llama_new_context_with_model: graph nodes = 1878 llama_new_context_with_model: graph splits = 1 INFO [ init] initializing slots | tid="12376" timestamp=1746441190 n_slots=1 INFO [ init] new slot | tid="12376" timestamp=1746441190 id_slot=0 n_ctx_slot=4096 INFO [ main] model loaded | tid="12376" timestamp=1746441190 INFO [ main] chat template | tid="12376" timestamp=1746441190 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true INFO [ main] HTTP server listening | tid="12376" timestamp=1746441190 hostname="127.0.0.1" port="8080" n_threads_http="15" INFO [ update_slots] all slots are idle | tid="12376" timestamp=1746441190 INFO [ launch_slot_with_task] slot is processing task | tid="12376" timestamp=1746441214 id_slot=0 id_task=0 INFO [ update_slots] kv cache rm [p0, end) | tid="12376" timestamp=1746441214 id_slot=0 id_task=0 p0=0 INFO [ print_timings] prompt eval time = 767.18 ms / 51 tokens ( 15.04 ms per token, 66.48 tokens per second) | tid="12376" timestamp=1746441236 id_slot=0 id_task=0 t_prompt_processing=767.178 n_prompt_tokens_processed=51 t_token=15.04270588235294 n_tokens_second=66.47740159389348 INFO [ print_timings] generation eval time = 21654.80 ms / 288 runs ( 75.19 ms per token, 13.30 tokens per second) | tid="12376" timestamp=1746441236 id_slot=0 id_task=0 t_token_generation=21654.802 n_decoded=288 t_token=75.19028472222222 n_tokens_second=13.299590548091828 INFO [ print_timings] total time = 22421.98 ms | tid="12376" timestamp=1746441236 id_slot=0 id_task=0 t_prompt_processing=767.178 t_token_generation=21654.802 t_total=22421.98 INFO [ update_slots] slot released | tid="12376" timestamp=1746441236 id_slot=0 id_task=0 n_ctx=4096 n_past=338 n_system_tokens=0 n_cache_tokens=338 truncated=false INFO [ update_slots] all slots are idle | tid="12376" timestamp=1746441236 INFO [ log_server_request] request | tid="21628" timestamp=1746441236 remote_addr="127.0.0.1" remote_port=65237 status=200 method="POST" path="/completion" params={} INFO [ update_slots] all slots are idle | tid="12376" timestamp=1746441236 INFO [ launch_slot_with_task] slot is processing task | tid="12376" timestamp=1746441247 id_slot=0 id_task=290 INFO [ update_slots] kv cache rm [p0, end) | tid="12376" timestamp=1746441247 id_slot=0 id_task=290 p0=50 INFO [ print_timings] prompt eval time = 4001.53 ms / 296 tokens ( 13.52 ms per token, 73.97 tokens per second) | tid="12376" timestamp=1746441271 id_slot=0 id_task=290 t_prompt_processing=4001.527 n_prompt_tokens_processed=296 t_token=13.518672297297297 n_tokens_second=73.9717612801313 INFO [ print_timings] generation eval time = 19925.00 ms / 245 runs ( 81.33 ms per token, 12.30 tokens per second) | tid="12376" timestamp=1746441271 id_slot=0 id_task=290 t_token_generation=19924.999 n_decoded=245 t_token=81.32652653061224 n_tokens_second=12.296111031172448 INFO [ print_timings] total time = 23926.53 ms | tid="12376" timestamp=1746441271 id_slot=0 id_task=290 t_prompt_processing=4001.527 t_token_generation=19924.999 t_total=23926.525999999998 INFO [ update_slots] slot released | tid="12376" timestamp=1746441271 id_slot=0 id_task=290 n_ctx=4096 n_past=590 n_system_tokens=0 n_cache_tokens=590 truncated=false INFO [ update_slots] all slots are idle | tid="12376" timestamp=1746441271 INFO [ log_server_request] request | tid="21948" timestamp=1746441271 remote_addr="127.0.0.1" remote_port=50253 status=200 method="POST" path="/completion" params={} INFO [ update_slots] all slots are idle | tid="12376" timestamp=1746441271 INFO [ launch_slot_with_task] slot is processing task | tid="12376" timestamp=1746441283 id_slot=0 id_task=537 INFO [ update_slots] kv cache rm [p0, end) | tid="12376" timestamp=1746441283 id_slot=0 id_task=537 p0=3 INFO [ print_timings] prompt eval time = 7425.26 ms / 523 tokens ( 14.20 ms per token, 70.44 tokens per second) | tid="12376" timestamp=1746441292 id_slot=0 id_task=537 t_prompt_processing=7425.256 n_prompt_tokens_processed=523 t_token=14.197430210325049 n_tokens_second=70.43528196199566 INFO [ print_timings] generation eval time = 1970.69 ms / 24 runs ( 82.11 ms per token, 12.18 tokens per second) | tid="12376" timestamp=1746441292 id_slot=0 id_task=537 t_token_generation=1970.687 n_decoded=24 t_token=82.11195833333333 n_tokens_second=12.178494098758453 INFO [ print_timings] total time = 9395.94 ms | tid="12376" timestamp=1746441292 id_slot=0 id_task=537 t_prompt_processing=7425.256 t_token_generation=1970.687 t_total=9395.943 INFO [ update_slots] slot released | tid="12376" timestamp=1746441292 id_slot=0 id_task=537 n_ctx=4096 n_past=549 n_system_tokens=0 n_cache_tokens=549 truncated=false INFO [ update_slots] all slots are idle | tid="12376" timestamp=1746441292 INFO [ log_server_request] request | tid="14164" timestamp=1746441292 remote_addr="127.0.0.1" remote_port=55394 status=200 method="POST" path="/completion" params={} INFO [ update_slots] all slots are idle | tid="12376" timestamp=1746441292 INFO [ log_server_request] request | tid="20768" timestamp=1746441292 remote_addr="127.0.0.1" remote_port=64794 status=200 method="POST" path="/tokenize" params={} INFO [ log_server_request] request | tid="18372" timestamp=1746441301 remote_addr="127.0.0.1" remote_port=51189 status=404 method="GET" path="/models" params={} INFO [ log_server_request] request | tid="18372" timestamp=1746441303 remote_addr="127.0.0.1" remote_port=51189 status=404 method="GET" path="/models" params={} INFO [ launch_slot_with_task] slot is processing task | tid="12376" timestamp=1746441304 id_slot=0 id_task=563 INFO [ update_slots] kv cache rm [p0, end) | tid="12376" timestamp=1746441304 id_slot=0 id_task=563 p0=0 INFO [ print_timings] prompt eval time = 6708.66 ms / 512 tokens ( 13.10 ms per token, 76.32 tokens per second) | tid="12376" timestamp=1746441368 id_slot=0 id_task=563 t_prompt_processing=6708.662 n_prompt_tokens_processed=512 t_token=13.10285546875 n_tokens_second=76.3192421976245 INFO [ print_timings] generation eval time = 56613.50 ms / 647 runs ( 87.50 ms per token, 11.43 tokens per second) | tid="12376" timestamp=1746441368 id_slot=0 id_task=563 t_token_generation=56613.499 n_decoded=647 t_token=87.50154404945904 n_tokens_second=11.428369760364042 INFO [ print_timings] total time = 63322.16 ms | tid="12376" timestamp=1746441368 id_slot=0 id_task=563 t_prompt_processing=6708.662 t_token_generation=56613.499 t_total=63322.16100000001 INFO [ update_slots] slot released | tid="12376" timestamp=1746441368 id_slot=0 id_task=563 n_ctx=4096 n_past=1158 n_system_tokens=0 n_cache_tokens=0 truncated=false INFO [ update_slots] all slots are idle | tid="12376" timestamp=1746441368 INFO [ log_server_request] request | tid="18372" timestamp=1746441368 remote_addr="127.0.0.1" remote_port=51189 status=200 method="POST" path="/chat/completions" params={} INFO [ update_slots] all slots are idle | tid="12376" timestamp=1746441368


PS C:\neuro\llama-avx2> ./llama-server.exe -t 8 -c 4096 -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf build: 5273 (8ae5ebcf) with MSVC 19.43.34808.0 for x64 system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15 main: loading model srv load_model: loading model 'F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf' llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 30B-A3B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["─а ─а", "─а─а ─а─а", "i n", "─а t",... llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 17 llama_model_loader: - kv 31: quantize.imatrix.file str = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 32: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B.txt llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 384 llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 32 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q5_K - Medium print_info: file size = 20.23 GiB (5.69 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 2048 print_info: n_layer = 48 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 128 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 30B.A3B print_info: model params = 30.53 B print_info: general.name = Qwen3-30B-A3B print_info: n_ff_exp = 768 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 11 ',' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151654 '<|vision_pad|>' print_info: LF token = 198 '─К' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/49 layers to GPU load_tensors: CPU_Mapped model buffer size = 20713.44 MiB ................................................................................................... llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.58 MiB llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 32 llama_kv_cache_unified: CPU KV buffer size = 384.00 MiB llama_kv_cache_unified: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_context: CPU compute buffer size = 300.75 MiB llama_context: graph nodes = 3126 llama_context: graph splits = 1 common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) srv log_server_r: request: GET / 127.0.0.1 503 srv log_server_r: request: GET / 127.0.0.1 503 srv init: initializing slots, n_slots = 1 slot init: id 0 | task -1 | new slot n_ctx_slot = 4096 main: model loaded main: chat template, chat_template: {%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0].role == 'system' %} {{- messages[0].content + '\n\n' }} {%- endif %} {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": , "arguments": }\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0].role == 'system' %} {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %} {%- for forward_message in messages %} {%- set index = (messages|length - 1) - loop.index0 %} {%- set message = messages[index] %} {%- set tool_start = '<tool_response>' %} {%- set tool_start_length = tool_start|length %} {%- set start_of_message = message.content[:tool_start_length] %} {%- set tool_end = '</tool_response>' %} {%- set tool_end_length = tool_end|length %} {%- set start_pos = (message.content|length) - tool_end_length %} {%- if start_pos < 0 %} {%- set start_pos = 0 %} {%- endif %} {%- set end_of_message = message.content[start_pos:] %} {%- if ns.multi_step_tool and message.role == "user" and not(start_of_message == tool_start and end_of_message == tool_end) %} {%- set ns.multi_step_tool = false %} {%- set ns.last_query_index = index %} {%- endif %} {%- endfor %} {%- for message in messages %} {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" %} {%- set content = message.content %} {%- set reasoning_content = '' %} {%- if message.reasoning_content is defined and message.reasoning_content is not none %} {%- set reasoning_content = message.reasoning_content %} {%- else %} {%- if '' in message.content %} {%- set content = (message.content.split('')|last).lstrip('\n') %} {%- set reasoning_content = (message.content.split('')|first).rstrip('\n') %} {%- set reasoning_content = (reasoning_content.split('')|last).lstrip('\n') %} {%- endif %} {%- endif %} {%- if loop.index0 > ns.last_query_index %} {%- if loop.last or (not loop.last and reasoning_content) %} {{- '<|im_start|>' + message.role + '\n\n' + reasoning_content.strip('\n') + '\n\n\n' + content.lstrip('\n') }} {%- else %} {{- '<|im_start|>' + message.role + '\n' + content }} {%- endif %} {%- else %} {{- '<|im_start|>' + message.role + '\n' + content }} {%- endif %} {%- if message.tool_calls %} {%- for tool_call in message.tool_calls %} {%- if (loop.first and content) or (not loop.first) %} {{- '\n' }} {%- endif %} {%- if tool_call.function %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {%- if tool_call.arguments is string %} {{- tool_call.arguments }} {%- else %} {{- tool_call.arguments | tojson }} {%- endif %} {{- '}\n</tool_call>' }} {%- endfor %} {%- endif %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- message.content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- if enable_thinking is defined and enable_thinking is false %} {{- '\n\n\n\n' }} {%- endif %} {%- endif %}, example_format: '<|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant ' main: server is listening on http://127.0.0.1:8080 - starting the main loop srv update_slots: all slots are idle srv log_server_r: request: GET / 127.0.0.1 200 srv params_from_: Chat format: Content-only slot launch_slot_: id 0 | task 0 | processing task slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 20 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 20, n_tokens = 20 slot release: id 0 | task 0 | stop processing: n_past = 67, truncated = 0 slot print_timing: id 0 | task 0 | prompt eval time = 713.89 ms / 20 tokens ( 35.69 ms per token, 28.02 tokens per second) eval time = 3163.91 ms / 48 tokens ( 65.91 ms per token, 15.17 tokens per second) total time = 3877.80 ms / 68 tokens srv update_slots: all slots are idle srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200


👤 ikawrakow commented the 2025-05-05 at 11:00:59:

So, with -rtr -fa -fmoe it works, but TG is slow (slower than llama.cpp). How much slower? What about prompt processing, or when you have a few thousand tokens in the KV cache? Is the llama.cpp build done with MSVC or with GCC/clang?

Without these flags it does not work. If you try -rtr -fmoe and -fa -fmoe separately, this will help me pinpoint the issue.


👤 intulint commented the 2025-05-05 at 11:05:55:

The speeds are in my message above, it is of course long, but I tried to give all the information


👤 intulint commented the 2025-05-05 at 11:15:26:

-fa -fmoe - works, but also pauses before displaying commas. The speed is also low

INFO [ print_timings] prompt eval time = 9586.72 ms / 512 tokens ( 18.72 ms per token, 53.41 tokens per second) | tid="16952" timestamp=1746443401 id_slot=0 id_task=354 t_prompt_processing=9586.721 n_prompt_tokens_processed=512 t_token=18.724064453125 n_tokens_second=53.407207740790625 INFO [ print_timings] generation eval time = 40935.66 ms / 426 runs ( 96.09 ms per token, 10.41 tokens per second) | tid="16952" timestamp=1746443401 id_slot=0 id_task=354 t_token_generation=40935.658 n_decoded=426 t_token=96.09309389671363 n_tokens_second=10.406575118445634

-rtr -fmoe - falling


👤 ikawrakow commented the 2025-05-05 at 11:15:51:

Ah, OK. I see

  • ik_llama.cpp: PP = 76.3 t/s (512 tokens), TG = 11.4 t/s (647 tokens)
  • llama.cpp: PP = 28.02 t/s (20 tokens), TG = 15.17 t/s (48 tokens)

Correct? I think it would be more fair to compare for the same (or at least similar) number of tokens generated and same number of tokens in the prompt.


👤 intulint commented the 2025-05-05 at 11:35:12:

llama.cpp ~ 1000 - 500 prompt eval time = 35744.63 ms / 1053 tokens ( 33.95 ms per token, 29.46 tokens per second) eval time = 33454.47 ms / 426 tokens ( 78.53 ms per token, 12.73 tokens per second)

ik_llama.cpp -fa -fmoe ~ 1000 - 500

INFO [ print_timings] prompt eval time = 20147.56 ms / 1057 tokens ( 19.06 ms per token, 52.46 tokens per second) | tid="5624" timestamp=1746444960 id_slot=0 id_task=0 t_prompt_processing=20147.559 n_prompt_tokens_processed=1057 t_token=19.06107757805109 n_tokens_second=52.46293111736265 INFO [ print_timings] generation eval time = 40472.90 ms / 422 runs ( 95.91 ms per token, 10.43 tokens per second) | tid="5624" timestamp=1746444960 id_slot=0 id_task=0 t_token_generation=40472.905 n_decoded=422 t_token=95.90735781990522 n_tokens_second=10.426728696642853


👤 ikawrakow commented the 2025-05-05 at 11:41:03:

OK, thanks. I'll look into the failure without flash attention.

-fa -rtr -fmoe Finally it works, but I noticed that every time before writing a comma the generation stops for half a second.

Sorry for asking, but in what language is your conversation? I'm asking because a pause before a comma may indicate a performance issue in the token id -> utf-8 conversion code. I haven't looked at that since I forked llama.cpp last June, and they may have improved since then.


👤 intulint commented the 2025-05-05 at 11:43:33:

This is a good question, I somehow didn't pay attention to what language the pauses in generation are in. Usually Russian, but also English. I'll check now. We need generation in English, right? Or is it important that the entire context is in one language?


👤 ikawrakow commented the 2025-05-05 at 11:46:02:

Or is it important that the entire context is in one language?

I don't know. Just looking for clues what could be slowing it down.


👤 intulint commented the 2025-05-05 at 11:54:19:

I launched it only in English and looked more closely, a pause in generation appears after or before the comma is displayed. It lasts a noticeable fraction of a second, and generation continues. Usually in such places - "Okay, the", "So, if", "than B, the"


👤 intulint commented the 2025-05-05 at 11:56:28:

To avoid confusion, I checked in 2 frontends. I noticed pauses only on commas.


👤 ikawrakow commented the 2025-05-05 at 11:57:24:

Interesting. I don't observe such effects on my Linux box. Are the sampling parameters exactly the same?


👤 intulint commented the 2025-05-05 at 12:01:40:

In the native front the servers are standard as far as I understand. I only changed the max tokens when measuring the speed. It didn't affect the pauses.

Image

Image


👤 intulint commented the 2025-05-05 at 12:16:22:

Maybe it's a compiler version? I don't know much, but as I understand it, a fresh one was used during assembly. I remember there were messages during assembly about changing the format of variables and that data loss could occur.


👤 ikawrakow commented the 2025-05-05 at 12:17:11:

For reference, here is what I get on my vanilla AVX2 Linux box using 8 threads with the commands

./bin/llama-sweep-bench -m Qwen_Qwen3-30B-A3B-Q5_K_M.gguf -c 4096 -t 8 -fa -ctk q8_0 -ctv q8_0 -rtr -fmoe
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.081 166.16 5.223 24.51
512 128 512 3.331 153.69 5.502 23.26
512 128 1024 3.606 141.97 5.740 22.30
512 128 1536 3.873 132.20 5.984 21.39
512 128 2048 4.154 123.25 6.212 20.61
512 128 2560 4.419 115.87 6.443 19.87
512 128 3072 4.691 109.15 6.685 19.15
512 128 3584 4.959 103.26 6.906 18.54

The model is this one from Bartowski

The CPU has a Zen3 core, so I'm not expecting it to be faster than a reasonably up-to-date AVX2 capable CPU.

In my case it also works without issues with just -c 4096 -t 8.

So, something goes seriously wrong with the Windows build.

Not sure how to debug. I don't have access to a Windows box.


👤 intulint commented the 2025-05-05 at 12:23:26:

Got it. I'll try to figure out how and by how much to downgrade the compiler, maybe that will help. If not, I don't know what to do next, I'll run it with llama.cpp.


👤 ikawrakow commented the 2025-05-05 at 12:31:36:

You can try building with GCC or clang. I cannot give you instructions how one does that as it is a long time since I last did that, so I have forgotten. But IIRC, the GCC build was running ~40% faster than the MSVC build. It wasn't an LLM, but it did involve algorithms with heavy number crunching. It must have been around 2017-2018, so don't know if MSVC has improved since then.


👤 intulint commented the 2025-05-05 at 12:33:50:

Is the llama.cpp build done with MSVC or with GCC/clang?

I have written a script that downloads the latest official releases; I have never compiled such large projects myself before.

By the way, yes, we found the parameters under which it starts. PS C:\neuro\ik_llama.cpp\build\bin\Release> .\llama-sweep-bench.exe -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf -c 4096 -t 8 -fa -fmoe

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 9.384 54.56 8.596 14.89
512 128 512 10.704 47.83 8.700 14.71
512 128 1024 10.833 47.26 8.572 14.93
512 128 1536 11.697 43.77 8.849 14.47
512 128 2048 12.257 41.77 9.372 13.66
512 128 2560 13.290 38.53 9.859 12.98
512 128 3072 14.514 35.28 11.724 10.92
512 128 3584 14.406 35.54 10.795 11.86

👤 intulint commented the 2025-05-05 at 12:33:50:

Is the llama.cpp build done with MSVC or with GCC/clang? I have written a script that downloads the latest official releases; I have never compiled such large projects myself before.

By the way, yes, we found the parameters under which it starts. PS C:\neuro\ik_llama.cpp\build\bin\Release> .\llama-sweep-bench.exe -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf -c 4096 -t 8 -fa -fmoe

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 9.384 54.56 8.596 14.89
512 128 512 10.704 47.83 8.700 14.71
512 128 1024 10.833 47.26 8.572 14.93
512 128 1536 11.697 43.77 8.849 14.47
512 128 2048 12.257 41.77 9.372 13.66
512 128 2560 13.290 38.53 9.859 12.98
512 128 3072 14.514 35.28 11.724 10.92
512 128 3584 14.406 35.54 10.795 11.86

👤 intulint commented the 2025-05-05 at 12:35:11:

Got it, I'll try it in the evening if I figure it out.


👤 ikawrakow commented the 2025-05-05 at 12:46:18:

You didn't say what your CPU was, so here another reference point from me on a more recent CPU (Ryzen-7950X). Again using 8 threads to be comparable to yours, same command as above:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 1.874 273.19 5.253 24.37
512 128 512 1.993 256.92 5.414 23.64
512 128 1024 2.131 240.24 5.523 23.17
512 128 1536 2.273 225.30 5.620 22.77
512 128 2048 2.417 211.83 5.721 22.37
512 128 2560 2.549 200.86 5.821 21.99
512 128 3072 2.688 190.46 5.925 21.60
512 128 3584 2.828 181.02 6.013 21.29

In comparison, mainline llama.cpp on the same computer (just pulled and rebuilt)

With flash attention

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 6.668 76.79 5.408 23.67
512 128 512 8.692 58.91 6.007 21.31
512 128 1024 10.831 47.27 6.781 18.88
512 128 1536 12.907 39.67 7.603 16.84
512 128 2048 14.947 34.26 8.544 14.98
512 128 2560 16.958 30.19 9.603 13.33
512 128 3072 19.009 26.93 10.614 12.06
512 128 3584 21.115 24.25 11.577 11.06

Without flash attnetion

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 6.246 81.98 5.522 23.18
512 128 512 6.696 76.46 5.781 22.14
512 128 1024 7.157 71.54 6.009 21.30
512 128 1536 7.639 67.02 6.207 20.62
512 128 2048 8.089 63.30 6.468 19.79
512 128 2560 8.577 59.70 6.708 19.08
512 128 3072 9.010 56.82 7.012 18.25
512 128 3584 9.498 53.91 7.144 17.92

👤 intulint commented the 2025-05-05 at 12:59:45:

Ah, indeed. This is an assembly on an old server processor 1660v4 with 4 memory channels, 32 GB in total. The speeds during generation are quite good, since the memory gives somewhere around 55 GB/s. Of course, this is not comparable with modern processors.


👤 saood06 commented the 2025-05-05 at 22:30:50:

You can try building with GCC or clang. I cannot give you instructions how one does that as it is a long time since I last did that, so I have forgotten.

The easiest way I found to use non MSVC to compile this on Windows was with https://github.com/skeeto/w64devkit but I don't use that as I can't compile there with CUDA (and my Nvidia GPU is the only advantage of my Windows machine), and it wasn't any faster on my machine even for CPU only from what I remember.


👤 alex1284B commented the 2025-05-14 at 16:37:33:

I think I have a similar problem, Qwen3 does not produce valid output after two lines of tokens, I tried different quantz IQ_K Q6, the same problems. But Qwen2.5 is fine. Base llama.cpp works fine also. Linux, only CPU. I'm not sure but the line of samplers is different than base llama.cpp CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature vs sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist

`ik_llama.cpp$ ./build/bin/llama-cli --color -m /home/ollama/models/gguf/Qwen3-30B-A3B-Q6_K_L.gguf --threads 12 --temp 0.6 --min-p 0 --top-k 20 --top-p 0.95 -p "<|im_start|>user\nA drinks machine offers three selections - Tea, Coffee or Random but the machine has been wired up wrongly so that each button does not give what it claims. If each drink costs 50p, how much minimum money do you have to put into the machine to work out which button gives which selection ?<|im_end|>\n<|im_start|>assistant\n" Log start main: build = 3693 (0435b68e) main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu main: seed = 1747238169 llama_model_loader: loaded meta data with 41 key-value pairs and 579 tensors from /home/ollama/models/gguf/Qwen3-30B-A3B-Q6_K_L.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 30B-A3B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/Qwen3-30B... llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 30B A3B Base llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-30B... llama_model_loader: - kv 11: general.tags arr[str,1] = ["text-generation"] llama_model_loader: - kv 12: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 13: qwen3moe.context_length u32 = 32768 llama_model_loader: - kv 14: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 15: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 16: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 17: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 18: qwen3moe.rope.freq_base f32 = 1000000,000000 llama_model_loader: - kv 19: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0,000001 llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 21: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 22: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 23: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 24: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 34: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - kv 36: general.file_type u32 = 18 llama_model_loader: - kv 37: quantize.imatrix.file str = /models_out/Qwen3-30B-A3B-GGUF/Qwen_Q... llama_model_loader: - kv 38: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt llama_model_loader: - kv 39: quantize.imatrix.entries_count i32 = 384 llama_model_loader: - kv 40: quantize.imatrix.chunks_count i32 = 209 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q8_0: 50 tensors llama_model_loader: - type q6_K: 288 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0,9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3moe llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0,0e+00 llm_load_print_meta: f_norm_rms_eps = 1,0e-06 llm_load_print_meta: f_clamp_kqv = 0,0e+00 llm_load_print_meta: f_max_alibi_bias = 0,0e+00 llm_load_print_meta: f_logit_scale = 0,0e+00 llm_load_print_meta: n_ff = 6144 llm_load_print_meta: n_expert = 128 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000,0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 30,532 B llm_load_print_meta: model size = 23,515 GiB (6,616 BPW) llm_load_print_meta: repeating layers = 22,900 GiB (6,577 BPW, 29,910 B parameters) llm_load_print_meta: general.name = Qwen3 30B A3B llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_ff_exp = 768 llm_load_tensors: ggml ctx size = 0,25 MiB llm_load_tensors: CPU buffer size = 24079,77 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000,0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 3072,00 MiB llama_new_context_with_model: KV self size = 3072,00 MiB, K (f16): 1536,00 MiB, V (f16): 1536,00 MiB llama_new_context_with_model: CPU output buffer size = 0,58 MiB llama_new_context_with_model: CPU compute buffer size = 2136,01 MiB llama_new_context_with_model: graph nodes = 2165 llama_new_context_with_model: graph splits = 1

system_info: n_threads = 12 / 24 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000 top_k = 20, tfs_z = 1,000, top_p = 0,950, min_p = 0,000, typical_p = 1,000, temp = 0,600 mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 32768, n_batch = 2048, n_predict = -1, n_keep = 0

user A drinks machine offers three selections - Tea, Coffee or Random but the machine has been wired up wrongly so that each button does not give what it claims. If each drink costs 50p, how much minimum money do you have to put into the machine to work out which button gives which selection ? assistant Okay, so there's this drinks machine with three buttons: Tea, Coffee, and Random. But the problem is, each button is wired up incorrectly. That means if you press Tea, it's not going to give Tea; same with Coffee. And Random is also not giving a random selection. So, the challenge is to figure out how much money you need to put in to determine which button actually gives which drink. Each drink costs 50p, so we need to find the minimum amount required.

First, let me try to understand the problem better. The machine has three buttons, each labeled incorrectly. So, the Tea button doesn't give Tea, Coffee doesn't give Coffee, and Random doesn't give a random selection. Wait, but what does. So… well... okay. [. So. So,. Wait, well? Wait,. Wait, but then. Wait,. Wait, but now. Wait, So,,. Wait, Hey,. But... So, let me.? So. So,. So, Also,ab. But the.\ the.. Let of, Wait,. Hmm. Let is the ,... So,. Let is probably. So, let. Let, let, actually., and go.,.,.,,., But, wait,… So,\n). So,... etc. So, \ If,… but… but I'm, \ is the the same thing., , \ The is the the you.,., \ But, \ So, , , . So, ,.\a. So,!\n't sure, , \ But, ,.\ but the same. the question. So, \ is the problem. So, \ I'm,.\

But,., \ So, \ you can you can you have to figure out of, \ the same, , , , \ but there's a lot. So, , \ Let you get the problem. So, \ I think that's, , \ but I think that is the problem. So,.\

But, . So, \ the question. you are you can you can you know,, , \ but I'm. But, ,.\ The problem is the answer the problem. But, \ I'm the answer the previous. but I'm. But, so you need to be careful, but I am I'm a problem. But, the problem. I have to see, I'm, I'm a bit, I know, but the actual, but, that. Then, but, but I'm a lot. So, but, but I don. So, but, I don! So, but I'm not, but, but, but, the number. But, but I can it's a bit, let me, I don… let me. So, but I can you need to see, in your answer. Let, but, but I need to the, but, but, I don, so, but, but I think it's a bit, but I think I have to be, I'm just that's not, and! So, in my, I have to be you know, but I can you need to solve this is a bit of \ what's the

Okay! It to be, I have to see, etc, I'm, you are you, andab, and, etc, I'll, I'm not. So, and \ I can you are you, I need to the other than, but I can you, I know, I need to make. But, I don, I think of. But, but, I have to make it's, you, I can I can I'm not, the the to me have you, but, I don… I think, I don… but, I am the which, I have to see, I'm going to be it's, I'm a person, I've been, no, I think. For, but I'm. If, I'm. I'm, the all, that, I'm just to be, I think I don. I don the the same, I will, but I am it's a new. But, but, I'm, or, but, but, but, but, but, but, but. But, but I don, I have been confused. So, but, no, in this is the same, I don? But, I think, but, I think, but, you can't, I want to you. That, but this is, but I can you, I mean, I need to. So, but, but the same, I'm, Im, but, I can you, I'm on the, I'm just, I can I know, I'm in the, I have to me you, but, but, but I'm not. I don… but, I need to be. I need to know, the question, I think, but, but, but I have to say, I'm not to the only, but, no, I think, I'm going to think, but that, you, and and I'm, but, I have you! It, I think, that. I can you \ I was a) the question, I need, is, or, I have the. The problem.

The thing, but, it have to be, I was a lot, but, I know how is the way, but, but, I have to see, Im not, I think. But, and! Let, I have you! I will be it's. It, but, I, and, I want to be, I don, I'm. I'm, I need to the problem. It, that.

I need to have you… I have to make, but, and. I need to. So, but, but, if, I'm going to be, but, I have, or, I think about, that, but, I have to get, but, I'm, that, but, and , and! I'm, I need to be, I just, I need to the, but, that, but, but, that, I don, I think, but, I don! I'm, in this is a very, the, what is that, you. I'm not. But, I was, I think that's a lot, the, that, I'm going to be the, but, it, I need to say, I'm, and. So, but, I'm, but, I have to be, I am, but, is a problem, I need. Im in the problem, that, you! I think, I'm, I am, but it, I'm not, I think, if I, in the, in the, that, and, but the, but, I can't. But, I, I'm trying

llama_print_timings: load time = 1206,27 ms llama_print_timings: sample time = 49,64 ms / 1459 runs ( 0,03 ms per token, 29392,21 tokens per second) llama_print_timings: prompt eval time = 337,36 ms / 69 tokens ( 4,89 ms per token, 204,53 tokens per second) llama_print_timings: eval time = 60951,79 ms / 1458 runs ( 41,81 ms per token, 23,92 tokens per second) llama_print_timings: total time = 61937,29 ms / 1527 tokens`


👤 ikawrakow commented the 2025-05-14 at 16:57:33:

@alex1284B

I tried your prompt and I see that it does not work. But of you add -fa -fmoe, then it works. Please create a separate issue for this. Thanks.


👤 alex1284B commented the 2025-05-14 at 17:23:47:

Thank you, I probably missed these options for starting. My bad.


👤 ikawrakow commented the 2025-05-25 at 07:10:25:

Closed via #420