mirror of
https://github.com/ggerganov/llama.cpp
synced 2026-03-07 23:59:26 +01:00
* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention * hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx * hexagon: add support for SCALE fp32 * hexagon: replace scalar fp32 -> fp16 copy with HVX * hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA - Implements double-buffered DMA prefetching for K, V, and Mask tensors. - Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations. - Correctly synchronizes DMA transfers to prevent race conditions. - Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking. * hexagon: use aligned mad_f16 * hexagon: flash_atten more aligned ops * hexagon: optimize scale_f32 hvx helpers * hexagon: unroll fa loops * hexagon: remove unused set-rows log * hexagon: flash_attn_ext add support for DMAing Q - Update `op_flash_attn_ext` to include Q row size in scratchpad allocation. - Pad Q row size to 128 bytes for alignment. - Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`. - Update dot product computations to use VTCM-buffered Q data. * hexagon: fix handling of NANs hvx dotproducts * hexagon: cleanup spad allocation in flash-atten * hexagon: improve fp16/fp32 matmul - Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics. - Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM - Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible. - Implemented fallback logic to the original implementation for complex broadcasting scenarios. * hexagon: fix HVX_ARCH check * hexagon: matmul cleanup and fp16 fixes Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d. * hexagon: fix fp16 x fp16 matmuls and some minor refactoring * hexagon: add support for GET_ROWS f32 -> f32 Also optimize SET_ROWS threading a bit when we have just a few rows to process. * hexagon: optimize set-rows threading * hexagon: update adb/run-bench.sh to properly support experimental and verbose options * hexagon: flash_atten use aligned vectors for dot products |
||
|---|---|---|
| .. | ||
| apple | ||
| jinja | ||
| snapdragon | ||
| bench-models.sh | ||
| build-info.sh | ||
| check-requirements.sh | ||
| compare-commits.sh | ||
| compare-llama-bench.py | ||
| compare-logprobs.py | ||
| create_ops_docs.py | ||
| debug-test.sh | ||
| fetch_server_test_models.py | ||
| gen-authors.sh | ||
| gen-unicode-data.py | ||
| get_chat_template.py | ||
| get-flags.mk | ||
| get-hellaswag.sh | ||
| get-pg.sh | ||
| get-wikitext-2.sh | ||
| get-wikitext-103.sh | ||
| get-winogrande.sh | ||
| hf.sh | ||
| install-oneapi.bat | ||
| serve-static.js | ||
| server-bench.py | ||
| sync_vendor.py | ||
| sync-ggml-am.sh | ||
| sync-ggml.last | ||
| sync-ggml.sh | ||
| tool_bench.py | ||
| tool_bench.sh | ||
| verify-checksum-models.py | ||
| xxd.cmake | ||