mirror of
https://github.com/PABannier/bark.cpp
synced 2026-03-03 13:40:57 +01:00
DOC Update README (#55)
This commit is contained in:
parent
8ec7c7287a
commit
00ff99bcd2
1
.gitignore
vendored
1
.gitignore
vendored
@ -1,6 +1,7 @@
|
||||
ggml_weights/*
|
||||
*.dSYM/
|
||||
build/
|
||||
models/
|
||||
|
||||
bark
|
||||
encodec
|
||||
|
||||
219
README.md
219
README.md
@ -1,19 +1,214 @@
|
||||
# bark.cpp (coming soon!)
|
||||
# bark.cpp
|
||||
|
||||
Inference of SunoAI's bark model in pure C/C++ using [ggml](https://github.com/ggerganov/ggml).
|
||||

|
||||
|
||||
[](https://github.com/PABannier/bark.cpp/actions)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
|
||||
[Roadmap](https://github.com/users/PABannier/projects/1) / [encodec.cpp](https://github.com/PABannier/encodec.cpp) / [ggml](https://github.com/ggerganov/ggml)
|
||||
|
||||
Inference of [SunoAI's bark model](https://github.com/suno-ai/bark) in pure C/C++.
|
||||
|
||||
## Description
|
||||
|
||||
The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model.
|
||||
The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model in efficiently using only CPU.
|
||||
|
||||
Bark has essentially 4 components:
|
||||
- [x] Semantic model to encode the text input
|
||||
- [x] Coarse model
|
||||
- [x] Fine model
|
||||
- [ ] Encoder (quantizer + decoder) to generate the waveform from the tokens
|
||||
- [X] Plain C/C++ implementation without dependencies
|
||||
- [X] AVX, AVX2 and AVX512 for x86 architectures
|
||||
- [ ] Optimized via ARM NEON, Accelerate and Metal frameworks
|
||||
- [ ] iOS on-device deployment using CoreML
|
||||
- [ ] Mixed F16 / F32 precision
|
||||
- [ ] 4-bit, 5-bit and 8-bit integer quantization
|
||||
|
||||
## Roadmap
|
||||
The original implementation of `bark.cpp` is the bark's 24Khz English model. We expect to support multiple languages in the future, as well as other vocoders (see [this](https://github.com/PABannier/bark.cpp/issues/36) and [this](https://github.com/PABannier/bark.cpp/issues/6)).
|
||||
This project is for educational purposes.
|
||||
|
||||
- [ ] Quantization
|
||||
- [ ] FP16
|
||||
- [ ] Swift package for iOS devices
|
||||
**Supported platforms:**
|
||||
|
||||
- [X] Mac OS
|
||||
- [X] Linux
|
||||
- [X] Windows
|
||||
|
||||
**Supported models:**
|
||||
|
||||
- [X] Bark's 24Khz model
|
||||
- [ ] Bark's 48Khz model
|
||||
- [ ] Multiple voices
|
||||
|
||||
---
|
||||
|
||||
Here is a typical run using Bark:
|
||||
|
||||
```java
|
||||
make -j && ./main -p "this is an audio"
|
||||
I bark.cpp build info:
|
||||
I UNAME_S: Darwin
|
||||
I UNAME_P: arm
|
||||
I UNAME_M: arm64
|
||||
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_ACCELERATE
|
||||
I CXXFLAGS: -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
|
||||
I LDFLAGS: -framework Accelerate
|
||||
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
|
||||
I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)
|
||||
|
||||
bark_model_load: loading model from './ggml_weights'
|
||||
bark_model_load: reading bark text model
|
||||
gpt_model_load: n_in_vocab = 129600
|
||||
gpt_model_load: n_out_vocab = 10048
|
||||
gpt_model_load: block_size = 1024
|
||||
gpt_model_load: n_embd = 1024
|
||||
gpt_model_load: n_head = 16
|
||||
gpt_model_load: n_layer = 24
|
||||
gpt_model_load: n_lm_heads = 1
|
||||
gpt_model_load: n_wtes = 1
|
||||
gpt_model_load: ggml tensor size = 272 bytes
|
||||
gpt_model_load: ggml ctx size = 1894.87 MB
|
||||
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
|
||||
gpt_model_load: model size = 1701.69 MB
|
||||
bark_model_load: reading bark vocab
|
||||
|
||||
bark_model_load: reading bark coarse model
|
||||
gpt_model_load: n_in_vocab = 12096
|
||||
gpt_model_load: n_out_vocab = 12096
|
||||
gpt_model_load: block_size = 1024
|
||||
gpt_model_load: n_embd = 1024
|
||||
gpt_model_load: n_head = 16
|
||||
gpt_model_load: n_layer = 24
|
||||
gpt_model_load: n_lm_heads = 1
|
||||
gpt_model_load: n_wtes = 1
|
||||
gpt_model_load: ggml tensor size = 272 bytes
|
||||
gpt_model_load: ggml ctx size = 1443.87 MB
|
||||
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
|
||||
gpt_model_load: model size = 1250.69 MB
|
||||
|
||||
bark_model_load: reading bark fine model
|
||||
gpt_model_load: n_in_vocab = 1056
|
||||
gpt_model_load: n_out_vocab = 1056
|
||||
gpt_model_load: block_size = 1024
|
||||
gpt_model_load: n_embd = 1024
|
||||
gpt_model_load: n_head = 16
|
||||
gpt_model_load: n_layer = 24
|
||||
gpt_model_load: n_lm_heads = 7
|
||||
gpt_model_load: n_wtes = 8
|
||||
gpt_model_load: ggml tensor size = 272 bytes
|
||||
gpt_model_load: ggml ctx size = 1411.25 MB
|
||||
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
|
||||
gpt_model_load: model size = 1218.26 MB
|
||||
|
||||
bark_model_load: reading bark codec model
|
||||
encodec_model_load: model size = 44.32 MB
|
||||
|
||||
bark_model_load: total model size = 74.64 MB
|
||||
|
||||
bark_generate_audio: prompt: 'this is an audio'
|
||||
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
|
||||
bark_forward_text_encoder: ...........................................................................................................
|
||||
|
||||
bark_forward_text_encoder: mem per token = 4.80 MB
|
||||
bark_forward_text_encoder: sample time = 7.91 ms
|
||||
bark_forward_text_encoder: predict time = 2779.49 ms / 7.62 ms per token
|
||||
bark_forward_text_encoder: total time = 2829.35 ms
|
||||
|
||||
bark_forward_coarse_encoder: .................................................................................................................................................................
|
||||
..................................................................................................................................................................
|
||||
|
||||
bark_forward_coarse_encoder: mem per token = 8.51 MB
|
||||
bark_forward_coarse_encoder: sample time = 3.08 ms
|
||||
bark_forward_coarse_encoder: predict time = 10997.70 ms / 33.94 ms per token
|
||||
bark_forward_coarse_encoder: total time = 11036.88 ms
|
||||
|
||||
bark_forward_fine_encoder: .....
|
||||
|
||||
bark_forward_fine_encoder: mem per token = 5.11 MB
|
||||
bark_forward_fine_encoder: sample time = 39.85 ms
|
||||
bark_forward_fine_encoder: predict time = 19773.94 ms
|
||||
bark_forward_fine_encoder: total time = 19873.72 ms
|
||||
|
||||
|
||||
|
||||
bark_forward_encodec: mem per token = 760209 bytes
|
||||
bark_forward_encodec: predict time = 528.46 ms / 528.46 ms per token
|
||||
bark_forward_encodec: total time = 663.63 ms
|
||||
|
||||
Number of frames written = 51840.
|
||||
|
||||
|
||||
main: load time = 1436.36 ms
|
||||
main: eval time = 34520.53 ms
|
||||
main: total time = 35956.92 ms
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Here are the steps for the bark model.
|
||||
|
||||
### Get the code
|
||||
|
||||
```bash
|
||||
git clone https://github.com/PABannier/bark.cpp.git
|
||||
cd bark.cpp
|
||||
```
|
||||
|
||||
### Build
|
||||
|
||||
In order to build bark.cpp you have two different options. We recommend using `CMake` for Windows.
|
||||
|
||||
- Using `make`:
|
||||
- On Linux or MacOS:
|
||||
|
||||
```bash
|
||||
make
|
||||
```
|
||||
|
||||
- Using `CMake`:
|
||||
|
||||
```bash
|
||||
mkdir build
|
||||
cd build
|
||||
cmake ..
|
||||
cmake --build . --config Release
|
||||
```
|
||||
|
||||
### Prepare data & Run
|
||||
|
||||
```bash
|
||||
# obtain the original bark and encodec weights and place them in ./models
|
||||
python3 download_weights.py --download-dir ./models
|
||||
|
||||
# install Python dependencies
|
||||
python3 -m pip install -r requirements.txt
|
||||
|
||||
# convert the model to ggml format
|
||||
python convert.py \
|
||||
--dir-model ./models \
|
||||
--codec-path ./models \
|
||||
--vocab-path ./models \
|
||||
--out-dir ./ggml_weights/
|
||||
|
||||
# run the inference
|
||||
./main -m ./models/ggml_weights/ -p "this is an audio"
|
||||
```
|
||||
|
||||
### Seminal papers and background on models
|
||||
|
||||
- Bark
|
||||
- [Text Prompted Generative Audio](https://github.com/suno-ai/bark)
|
||||
- Encodec
|
||||
- [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
|
||||
- GPT-3
|
||||
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
|
||||
|
||||
### Contributing
|
||||
|
||||
`bark.cpp` is a continuous endeavour that relies on the community efforts to last and evolve. Your contribution is welcome and highly valuable. It can be
|
||||
|
||||
- bug report: you may encounter a bug while using `bark.cpp`. Don't hesitate to report it on the issue section.
|
||||
- feature request: you want to add a new model or support a new platform. You can use the issue section to make suggestions.
|
||||
- pull request: you may have fixed a bug, added a features, or even fixed a small typo in the documentation, ... you can submit a pull request and a reviewer will reach out to you.
|
||||
|
||||
### Coding guidelines
|
||||
|
||||
- Avoid adding third-party dependencies, extra files, extra headers, etc.
|
||||
- Always consider cross-compatibility with other operating systems and architectures
|
||||
- Avoid fancy looking modern STL constructs, keep it simple
|
||||
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & ref`
|
||||
|
||||
BIN
assets/banner.jpeg
Normal file
BIN
assets/banner.jpeg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 690 KiB |
49
download_weights.py
Normal file
49
download_weights.py
Normal file
@ -0,0 +1,49 @@
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
from huggingface_hub import hf_hub_download
|
||||
import torch
|
||||
|
||||
|
||||
ENCODEC_PATH = Path("https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th")
|
||||
|
||||
REMOTE_MODEL_PATHS = {
|
||||
"text": {
|
||||
"repo_id": "suno/bark",
|
||||
"file_name": "text_2.pt",
|
||||
},
|
||||
"coarse": {
|
||||
"repo_id": "suno/bark",
|
||||
"file_name": "coarse_2.pt",
|
||||
},
|
||||
"fine": {
|
||||
"repo_id": "suno/bark",
|
||||
"file_name": "fine_2.pt",
|
||||
},
|
||||
}
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--download-dir", type=str, required=True)
|
||||
|
||||
if __name__ == "__main__":
|
||||
args = parser.parse_args()
|
||||
out_dir = Path(args.download_dir)
|
||||
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print(" ### Downloading bark encoders...")
|
||||
for model_k in REMOTE_MODEL_PATHS.keys():
|
||||
model_details = REMOTE_MODEL_PATHS[model_k]
|
||||
repo_id, filename = model_details["repo_id"], model_details["file_name"]
|
||||
hf_hub_download(repo_id=repo_id, filename=filename, local_dir=out_dir)
|
||||
|
||||
print(" ### Downloading EnCodec weights...")
|
||||
state_dict = torch.hub.load_state_dict_from_url(
|
||||
str(ENCODEC_PATH),
|
||||
map_location="cpu",
|
||||
check_hash=True
|
||||
)
|
||||
with open(out_dir / ENCODEC_PATH.name, "wb") as fout:
|
||||
torch.save(state_dict, fout)
|
||||
|
||||
print("Done.")
|
||||
3
requirements.txt
Normal file
3
requirements.txt
Normal file
@ -0,0 +1,3 @@
|
||||
huggingface-hub>=0.14.1
|
||||
numpy
|
||||
torch
|
||||
Loading…
Reference in New Issue
Block a user