Go to file
Max Krasnyansky 0a6f36a11d Add experimental ggml-hexagon backend for the Hexagon NPU (llama/16547)
* model: add support for extra bufs for all devices

* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU

This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.

Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX

**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com>

* hexagon: fix format checker errors

* hexagon: update readme and cmake presets

* ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions

* hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input

* hexagon: move ADB helper scripts into scripts/snapdragon/adb

* hexagon: replace all f/printfs with GGML_LOG_...

* readme: add hexagon to the list supported backends

* hexagon: stack malmuts with quantized inputs only

* hexagon: add TODO for fixing issues in hexagon_graph_optimize

* hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC

* scripts: fix lint errors

* scripts: update qdc pytest script to make linter happy

* hexagon: add reduce sum in fp32

* hexagon: reduce number of vector stores in matmul output

* hexagon: remove the need for vdelta in reduce-multiply-x8

* hexagon: consistent use of reduce_sum_fp32 for row_sums

* hexagon: some more matmul optimizations and comments

Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models).
We've handled those cases already but at a higher overhead.

* hexagon: update cmake presets

* hexagon: add OPMASK support for run-bench.sh wrapper

* hexagon: update to use GGML_BACKEND_API

* hexagon: remove unused logic for setting tensor flags for the views

* hexagon: add asserts to set/get_tensor to make sure we handle complete tensors

Same asserts as the CPU backend.

* hexagon: use cpy_tensor slow path for non-host buffers

* hexagon: error checks in the buffer allocator

* cmake: move include(extProj) under ggml-hexagon

* hexagon: don't forget to delete the backend on free

* hexagon: set/get_tensor size assert apply only to quantized tensors

* hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now

GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way.
Ideally we need a bit more finer log levels.

* docs: typos in hexagon developer docs (libggm-...)

* hexagon: overhaul error handling in the session/device allocation

this should handle all failure paths in the session allocation.

* hexagon: update cmake presets to enable fp16 vectors

* hexagon: remove unused time_usec function

* hexagon: don't forget to release buffer contexts

* hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure)

* hexagon: remove custom can_repeat function and use ggml_can_repeat

---------

Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2025-11-01 09:41:35 +02:00
.github ci : add self-hosted workflows (#1357) 2025-09-29 15:15:13 +03:00
ci ci : print results [no ci] (#1358) 2025-09-29 16:20:52 +03:00
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (llama/15094) 2025-08-14 14:17:28 +03:00
docs Update gguf specification to synchronize the ggml_types declaration shown in the doc with the actual one. (#1342) 2025-09-16 13:42:24 +02:00
examples Rewrite simple-backend to use sched and ggml_backend_load_all (#1376) 2025-10-29 18:10:19 +01:00
include Add experimental ggml-hexagon backend for the Hexagon NPU (llama/16547) 2025-11-01 09:41:35 +02:00
scripts sync : whisper.cpp 2025-10-22 12:59:34 +03:00
src Add experimental ggml-hexagon backend for the Hexagon NPU (llama/16547) 2025-11-01 09:41:35 +02:00
tests CUDA: topk-moe: add optional parameter for gpt-oss (llama/16649) 2025-11-01 09:41:35 +02:00
.editorconfig gguf : add file format specification (#302) 2023-11-01 19:01:49 +02:00
.gitignore gitignore : ignore idea files (#1339) 2025-09-09 13:17:07 +02:00
.gitmodules git : remove kompute submodule (#1300) 2025-07-12 16:12:49 +03:00
AUTHORS authors : update 2025-02-04 13:03:55 +02:00
CMakeLists.txt Add experimental ggml-hexagon backend for the Hexagon NPU (llama/16547) 2025-11-01 09:41:35 +02:00
CONTRIBUTING.md contrib : recommend PRs to llama.cpp (#1312) 2025-07-25 07:05:38 +03:00
ggml.pc.in pkg-config: include the new GGML_VERSION as a version (#1348) 2025-09-25 18:59:38 +02:00
LICENSE license : update copyright notice + add AUTHORS 2024-04-09 20:17:51 +03:00
README.md readme : remove transfer notice (#1107) 2025-02-08 10:33:44 +02:00
requirements.txt ci : update requirements.txt 2024-12-03 21:05:37 +02:00

ggml

Roadmap / Manifesto

Tensor library for machine learning

Note that this project is under active development.
Some of the development is currently happening in the llama.cpp and whisper.cpp repos

Features

  • Low-level cross-platform implementation
  • Integer quantization support
  • Broad hardware support
  • Automatic differentiation
  • ADAM and L-BFGS optimizers
  • No third-party dependencies
  • Zero memory allocations during runtime

Build

git clone https://github.com/ggml-org/ggml
cd ggml

# install python dependencies in a virtual environment
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# build the examples
mkdir build && cd build
cmake ..
cmake --build . --config Release -j 8

GPT inference (example)

# run the GPT-2 small 117M model
../examples/gpt-2/download-ggml-model.sh 117M
./bin/gpt-2-backend -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

For more information, checkout the corresponding programs in the examples folder.

Using CUDA

# fix the path to point to your CUDA compiler
cmake -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc ..

Using hipBLAS

cmake -DCMAKE_C_COMPILER="$(hipconfig -l)/clang" -DCMAKE_CXX_COMPILER="$(hipconfig -l)/clang++" -DGGML_HIP=ON

Using SYCL

# linux
source /opt/intel/oneapi/setvars.sh
cmake -G "Ninja" -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL=ON ..

# windows
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
cmake -G "Ninja" -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DGGML_SYCL=ON ..

Compiling for Android

Download and unzip the NDK from this download page. Set the NDK_ROOT_PATH environment variable or provide the absolute path to the CMAKE_ANDROID_NDK in the command below.

cmake .. \
   -DCMAKE_SYSTEM_NAME=Android \
   -DCMAKE_SYSTEM_VERSION=33 \
   -DCMAKE_ANDROID_ARCH_ABI=arm64-v8a \
   -DCMAKE_ANDROID_NDK=$NDK_ROOT_PATH
   -DCMAKE_ANDROID_STL_TYPE=c++_shared
# create directories
adb shell 'mkdir /data/local/tmp/bin'
adb shell 'mkdir /data/local/tmp/models'

# push the compiled binaries to the folder
adb push bin/* /data/local/tmp/bin/

# push the ggml library
adb push src/libggml.so /data/local/tmp/

# push model files
adb push models/gpt-2-117M/ggml-model.bin /data/local/tmp/models/

adb shell
cd /data/local/tmp
export LD_LIBRARY_PATH=/data/local/tmp
./bin/gpt-2-backend -m models/ggml-model.bin -p "this is an example"

Resources