mirror of
https://github.com/ggerganov/llama.cpp
synced 2026-03-07 23:59:26 +01:00
* examples : add debug utility/example
This commit introduces a new example named llama-debug which is a
utility that is intended to be used to assist with developing/debugging
a converted model.
The motivation for this utilitiy is to assist in model conversion work
to verify that the model produces the expected outputs. It is intended
to replace logits.cpp in examples/model-conversion.
Example usage:
```console
./build/bin/llama-debug \
-m models/Qwen2.5-0.5B-Instruct.gguf \
--prompt "Hello, my name is" \
--save-logits
...
Model add_bos: false
Input prompt: "Hello, my name is"
Token ids (5):
Hello(9707) ,(11) my(847) name(829) is(374)
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.bin
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.txt
Prompt saved to data/llamacpp-Qwen2.5-0.5B-Instruct-prompt.txt
Tokens saved to data/llamacpp-Qwen2.5-0.5B-Instruct-tokens.bin
```
For more details about the options available for this example, please
refer to examples/debug/README.md.
* throw runtime error instead of logging error
* remove params.warmup and enable the warmup/nowarmup option
* model-conversion : remove logits.cpp
This commit removes logits.cpp in favor of using llama-debug for
generating logits and embeddings.
* examples : remove model-conversion directory
This was missed in the previous commit.
* model-conversion : add support for saving prompt and token ids
This commit add support for storing the prompt and the token ids for the
prompt when running the original models.
The motivation for this is that this will allow us to compare the prompt
and the tokens generated for the prompt when verifing the converted
model. Currently it is possible that even if the same prompt is used
that the tokens generated are different if there is a difference in the
tokenization between the original and converted model which would
currently go unnoticed (the verification will most likely fail but it
might not be obvious why).
* squash! model-conversion : add support for saving prompt and token ids
fix pyright errors.
* model-conversion : add compare_tokens utility
This commit adds a script to compare token outputs between original and
converted models.
Example usage:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
And there is a verbose flag that will also print out the prompts:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16 -v
Original model prompt (pytorch-gemma-3-270m-it):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Converted model prompt (llamacpp-gemma-3-270m-it-bf16):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
* model-conversion : add token comparison to verifiction scripts
This commit add the calling of the compare_tokens function in
compare-logits.py and semantic_check.py to ensure that the token ids
that the tokenizers procoduce are the same before proceeding with
verifying the logits/embeddings.
Placing them in the existing scripts instead calling them separately
ensures that the token comparison is always done prior to the
logit/embedding verifications.
Follow up commit/pr could refactor the causal logits verification into
a single script instead of the two that exist now. This would reduce the
code and make it consistent with the embeddings verficiation which only
has a single script.
* debug : use llama_model_n_embd_out
This commit updates the debug example to use the new function
llama_model_n_embd_out instead of llama_model_n_embd.
The motivation for this change is to support late interation retriever
models, like LFM2-ColBert-350M, where the output embeddings are down
projected to a lower dimension.
* debug : add print_usage function
This commit adds a print_usage function that is passed to the
common_params_parse.
The motivation for this is that this enables a specific usage message
which will be printed after all the options, for example:
```console
example usage:
Print tensors:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --verbose
The tensors to be printed can be filtered with --tensor-filter option.
Save logits/embeddings:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --save-logits
Add --embedding to save embeddings
```
244 lines
8.8 KiB
Python
Executable File
244 lines
8.8 KiB
Python
Executable File
#!/usr/bin/env python3
|
||
|
||
import argparse
|
||
import os
|
||
import sys
|
||
import importlib
|
||
|
||
from transformers import AutoTokenizer, AutoConfig, AutoModel
|
||
import torch
|
||
|
||
# Add parent directory to path for imports
|
||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
|
||
from utils.common import save_output_data
|
||
|
||
|
||
def parse_arguments():
|
||
parser = argparse.ArgumentParser(description='Run original embedding model')
|
||
parser.add_argument(
|
||
'--model-path',
|
||
'-m',
|
||
help='Path to the model'
|
||
)
|
||
parser.add_argument(
|
||
'--prompts-file',
|
||
'-p',
|
||
help='Path to file containing prompts (one per line)'
|
||
)
|
||
parser.add_argument(
|
||
'--use-sentence-transformers',
|
||
action='store_true',
|
||
help=('Use SentenceTransformer to apply all numbered layers '
|
||
'(01_Pooling, 02_Dense, 03_Dense, 04_Normalize)')
|
||
)
|
||
parser.add_argument(
|
||
'--device',
|
||
'-d',
|
||
help='Device to use (cpu, cuda, mps, auto)',
|
||
default='auto'
|
||
)
|
||
return parser.parse_args()
|
||
|
||
|
||
def load_model_and_tokenizer(model_path, use_sentence_transformers=False, device="auto"):
|
||
if device == "cpu":
|
||
device_map = {"": "cpu"}
|
||
print("Forcing CPU usage")
|
||
elif device == "auto":
|
||
# On Mac, "auto" device_map can cause issues with accelerate
|
||
# So we detect the best device manually
|
||
if torch.cuda.is_available():
|
||
device_map = {"": "cuda"}
|
||
print("Using CUDA")
|
||
elif torch.backends.mps.is_available():
|
||
device_map = {"": "mps"}
|
||
print("Using MPS (Apple Metal)")
|
||
else:
|
||
device_map = {"": "cpu"}
|
||
print("Using CPU")
|
||
else:
|
||
device_map = {"": device}
|
||
|
||
if use_sentence_transformers:
|
||
from sentence_transformers import SentenceTransformer
|
||
print("Using SentenceTransformer to apply all numbered layers")
|
||
model = SentenceTransformer(model_path)
|
||
tokenizer = model.tokenizer
|
||
config = model[0].auto_model.config # type: ignore
|
||
else:
|
||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
|
||
|
||
# This can be used to override the sliding window size for manual testing. This
|
||
# can be useful to verify the sliding window attention mask in the original model
|
||
# and compare it with the converted .gguf model.
|
||
if hasattr(config, 'sliding_window'):
|
||
original_sliding_window = config.sliding_window
|
||
print(f"Modified sliding window: {original_sliding_window} -> {config.sliding_window}")
|
||
|
||
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
|
||
print(f"Using unreleased model: {unreleased_model_name}")
|
||
if unreleased_model_name:
|
||
model_name_lower = unreleased_model_name.lower()
|
||
unreleased_module_path = f"transformers.models.{model_name_lower}.modular_{model_name_lower}"
|
||
class_name = f"{unreleased_model_name}Model"
|
||
print(f"Importing unreleased model module: {unreleased_module_path}")
|
||
|
||
try:
|
||
model_class = getattr(importlib.import_module(unreleased_module_path), class_name)
|
||
model = model_class.from_pretrained(
|
||
model_path,
|
||
device_map=device_map,
|
||
offload_folder="offload",
|
||
trust_remote_code=True,
|
||
config=config
|
||
)
|
||
except (ImportError, AttributeError) as e:
|
||
print(f"Failed to import or load model: {e}")
|
||
sys.exit(1)
|
||
else:
|
||
model = AutoModel.from_pretrained(
|
||
model_path,
|
||
device_map=device_map,
|
||
offload_folder="offload",
|
||
trust_remote_code=True,
|
||
config=config
|
||
)
|
||
print(f"Model class: {type(model)}")
|
||
print(f"Model file: {type(model).__module__}")
|
||
|
||
# Verify the model is using the correct sliding window
|
||
if hasattr(model.config, 'sliding_window'): # type: ignore
|
||
print(f"Model's sliding_window: {model.config.sliding_window}") # type: ignore
|
||
else:
|
||
print("Model config does not have sliding_window attribute")
|
||
|
||
return model, tokenizer, config
|
||
|
||
|
||
def get_prompt(args):
|
||
if args.prompts_file:
|
||
try:
|
||
with open(args.prompts_file, 'r', encoding='utf-8') as f:
|
||
return f.read().strip()
|
||
except FileNotFoundError:
|
||
print(f"Error: Prompts file '{args.prompts_file}' not found")
|
||
sys.exit(1)
|
||
except Exception as e:
|
||
print(f"Error reading prompts file: {e}")
|
||
sys.exit(1)
|
||
else:
|
||
return "Hello world today"
|
||
|
||
|
||
def main():
|
||
args = parse_arguments()
|
||
|
||
model_path = os.environ.get('EMBEDDING_MODEL_PATH', args.model_path)
|
||
if model_path is None:
|
||
print("Error: Model path must be specified either via --model-path argument "
|
||
"or EMBEDDING_MODEL_PATH environment variable")
|
||
sys.exit(1)
|
||
|
||
# Determine if we should use SentenceTransformer
|
||
use_st = (
|
||
args.use_sentence_transformers or os.environ.get('USE_SENTENCE_TRANSFORMERS', '').lower() in ('1', 'true', 'yes')
|
||
)
|
||
|
||
model, tokenizer, config = load_model_and_tokenizer(model_path, use_st, args.device)
|
||
|
||
# Get the device the model is on
|
||
if not use_st:
|
||
device = next(model.parameters()).device
|
||
else:
|
||
# For SentenceTransformer, get device from the underlying model
|
||
device = next(model[0].auto_model.parameters()).device # type: ignore
|
||
|
||
model_name = os.path.basename(model_path)
|
||
|
||
prompt_text = get_prompt(args)
|
||
texts = [prompt_text]
|
||
|
||
with torch.no_grad():
|
||
if use_st:
|
||
embeddings = model.encode(texts, convert_to_numpy=True)
|
||
all_embeddings = embeddings # Shape: [batch_size, hidden_size]
|
||
|
||
encoded = tokenizer(
|
||
texts,
|
||
padding=True,
|
||
truncation=True,
|
||
return_tensors="pt"
|
||
)
|
||
tokens = encoded['input_ids'][0]
|
||
token_ids = tokens.cpu().tolist()
|
||
token_strings = tokenizer.convert_ids_to_tokens(tokens)
|
||
for i, (token_id, token_str) in enumerate(zip(tokens, token_strings)):
|
||
print(f"{token_id:6d} -> '{token_str}'")
|
||
|
||
print(f"Embeddings shape (after all SentenceTransformer layers): {all_embeddings.shape}")
|
||
print(f"Embedding dimension: {all_embeddings.shape[1] if len(all_embeddings.shape) > 1 else all_embeddings.shape[0]}") # type: ignore
|
||
else:
|
||
# Standard approach: use base model output only
|
||
encoded = tokenizer(
|
||
texts,
|
||
padding=True,
|
||
truncation=True,
|
||
return_tensors="pt"
|
||
)
|
||
|
||
tokens = encoded['input_ids'][0]
|
||
token_ids = tokens.cpu().tolist()
|
||
token_strings = tokenizer.convert_ids_to_tokens(tokens)
|
||
for i, (token_id, token_str) in enumerate(zip(tokens, token_strings)):
|
||
print(f"{token_id:6d} -> '{token_str}'")
|
||
|
||
# Move inputs to the same device as the model
|
||
encoded = {k: v.to(device) for k, v in encoded.items()}
|
||
outputs = model(**encoded)
|
||
hidden_states = outputs.last_hidden_state # Shape: [batch_size, seq_len, hidden_size]
|
||
|
||
all_embeddings = hidden_states[0].float().cpu().numpy() # Shape: [seq_len, hidden_size]
|
||
|
||
print(f"Hidden states shape: {hidden_states.shape}")
|
||
print(f"All embeddings shape: {all_embeddings.shape}")
|
||
print(f"Embedding dimension: {all_embeddings.shape[1]}")
|
||
|
||
if len(all_embeddings.shape) == 1:
|
||
n_embd = all_embeddings.shape[0] # type: ignore
|
||
n_embd_count = 1
|
||
all_embeddings = all_embeddings.reshape(1, -1)
|
||
else:
|
||
n_embd = all_embeddings.shape[1] # type: ignore
|
||
n_embd_count = all_embeddings.shape[0] # type: ignore
|
||
|
||
print()
|
||
|
||
for j in range(n_embd_count):
|
||
embedding = all_embeddings[j]
|
||
print(f"embedding {j}: ", end="")
|
||
|
||
# Print first 3 values
|
||
for i in range(min(3, n_embd)):
|
||
print(f"{embedding[i]:9.6f} ", end="")
|
||
|
||
print(" ... ", end="")
|
||
|
||
# Print last 3 values
|
||
for i in range(n_embd - 3, n_embd):
|
||
print(f"{embedding[i]:9.6f} ", end="")
|
||
|
||
print() # New line
|
||
|
||
print()
|
||
|
||
flattened_embeddings = all_embeddings.flatten()
|
||
print(f"Total values: {len(flattened_embeddings)} ({n_embd_count} embeddings × {n_embd} dimensions)")
|
||
print("")
|
||
|
||
save_output_data(flattened_embeddings, token_ids, prompt_text, model_name, type_suffix="-embeddings")
|
||
|
||
|
||
if __name__ == "__main__":
|
||
main()
|