mirror of
https://github.com/ggerganov/whisper.cpp
synced 2026-03-17 04:10:38 +01:00
This commit addresses an issue with token timestamps when audio segments
are skipped, in `whisper_exp_compute_token_level_timestamps` related to
the VAD processing and the energy levels.
The motivation for this is that the token timestamps exceed the energy
array bounds due to segment timing misalignment:
```console
(skipped introduction)
↓
Audio segment: [2600ms → 5600ms] (3 seconds of actual audio)
Energy array: [0 → 480652] (samples for 3 seconds)
Token timestamps: [3266ms → 3408ms] (absolute timestamps)
```
So both `s0` and `t1` get clamped to the maximum sample index (480652)
which causes the start/end timestamps to be the same for all the tokens
after a certain point.
This is addressed by using segment-relative timestamps in the
`timestamp_to_sample` and `sample_to_timestamp`.
|
||
|---|---|---|
| .. | ||
| coreml | ||
| openvino | ||
| CMakeLists.txt | ||
| whisper-arch.h | ||
| whisper.cpp | ||