mirror of
https://github.com/ggerganov/llama.cpp
synced 2026-03-01 21:00:04 +01:00
* common : fix Step-3.5-Flash format detection and thinking support Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: https://github.com/ggml-org/llama.cpp/pull/19283 * chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format. * chat : remove dead thinking code from qwen3_coder_xml Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths. |
||
|---|---|---|
| .. | ||
| templates | ||
| .editorconfig | ||
| ggml-vocab-aquila.gguf | ||
| ggml-vocab-baichuan.gguf | ||
| ggml-vocab-bert-bge.gguf | ||
| ggml-vocab-bert-bge.gguf.inp | ||
| ggml-vocab-bert-bge.gguf.out | ||
| ggml-vocab-command-r.gguf | ||
| ggml-vocab-command-r.gguf.inp | ||
| ggml-vocab-command-r.gguf.out | ||
| ggml-vocab-deepseek-coder.gguf | ||
| ggml-vocab-deepseek-coder.gguf.inp | ||
| ggml-vocab-deepseek-coder.gguf.out | ||
| ggml-vocab-deepseek-llm.gguf | ||
| ggml-vocab-deepseek-llm.gguf.inp | ||
| ggml-vocab-deepseek-llm.gguf.out | ||
| ggml-vocab-falcon.gguf | ||
| ggml-vocab-falcon.gguf.inp | ||
| ggml-vocab-falcon.gguf.out | ||
| ggml-vocab-gpt-2.gguf | ||
| ggml-vocab-gpt-2.gguf.inp | ||
| ggml-vocab-gpt-2.gguf.out | ||
| ggml-vocab-gpt-neox.gguf | ||
| ggml-vocab-llama-bpe.gguf | ||
| ggml-vocab-llama-bpe.gguf.inp | ||
| ggml-vocab-llama-bpe.gguf.out | ||
| ggml-vocab-llama-spm.gguf | ||
| ggml-vocab-llama-spm.gguf.inp | ||
| ggml-vocab-llama-spm.gguf.out | ||
| ggml-vocab-mpt.gguf | ||
| ggml-vocab-mpt.gguf.inp | ||
| ggml-vocab-mpt.gguf.out | ||
| ggml-vocab-nomic-bert-moe.gguf | ||
| ggml-vocab-phi-3.gguf | ||
| ggml-vocab-phi-3.gguf.inp | ||
| ggml-vocab-phi-3.gguf.out | ||
| ggml-vocab-qwen2.gguf | ||
| ggml-vocab-qwen2.gguf.inp | ||
| ggml-vocab-qwen2.gguf.out | ||
| ggml-vocab-refact.gguf | ||
| ggml-vocab-refact.gguf.inp | ||
| ggml-vocab-refact.gguf.out | ||
| ggml-vocab-starcoder.gguf | ||
| ggml-vocab-starcoder.gguf.inp | ||
| ggml-vocab-starcoder.gguf.out | ||