DOC Update README (#55)

This commit is contained in:
PAB 2023-08-12 20:35:15 +02:00 committed by GitHub
parent 8ec7c7287a
commit 00ff99bcd2
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 260 additions and 12 deletions

1
.gitignore vendored
View File

@ -1,6 +1,7 @@
ggml_weights/*
*.dSYM/
build/
models/
bark
encodec

219
README.md
View File

@ -1,19 +1,214 @@
# bark.cpp (coming soon!)
# bark.cpp
Inference of SunoAI's bark model in pure C/C++ using [ggml](https://github.com/ggerganov/ggml).
![bark.cpp](./assets/banner.jpeg)
[![Actions Status](https://github.com/PABannier/bark.cpp/actions/workflows/build.yml/badge.svg)](https://github.com/PABannier/bark.cpp/actions)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[Roadmap](https://github.com/users/PABannier/projects/1) / [encodec.cpp](https://github.com/PABannier/encodec.cpp) / [ggml](https://github.com/ggerganov/ggml)
Inference of [SunoAI's bark model](https://github.com/suno-ai/bark) in pure C/C++.
## Description
The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model.
The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model in efficiently using only CPU.
Bark has essentially 4 components:
- [x] Semantic model to encode the text input
- [x] Coarse model
- [x] Fine model
- [ ] Encoder (quantizer + decoder) to generate the waveform from the tokens
- [X] Plain C/C++ implementation without dependencies
- [X] AVX, AVX2 and AVX512 for x86 architectures
- [ ] Optimized via ARM NEON, Accelerate and Metal frameworks
- [ ] iOS on-device deployment using CoreML
- [ ] Mixed F16 / F32 precision
- [ ] 4-bit, 5-bit and 8-bit integer quantization
## Roadmap
The original implementation of `bark.cpp` is the bark's 24Khz English model. We expect to support multiple languages in the future, as well as other vocoders (see [this](https://github.com/PABannier/bark.cpp/issues/36) and [this](https://github.com/PABannier/bark.cpp/issues/6)).
This project is for educational purposes.
- [ ] Quantization
- [ ] FP16
- [ ] Swift package for iOS devices
**Supported platforms:**
- [X] Mac OS
- [X] Linux
- [X] Windows
**Supported models:**
- [X] Bark's 24Khz model
- [ ] Bark's 48Khz model
- [ ] Multiple voices
---
Here is a typical run using Bark:
```java
make -j && ./main -p "this is an audio"
I bark.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)
bark_model_load: loading model from './ggml_weights'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1701.69 MB
bark_model_load: reading bark vocab
bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1250.69 MB
bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 7
gpt_model_load: n_wtes = 8
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1218.26 MB
bark_model_load: reading bark codec model
encodec_model_load: model size = 44.32 MB
bark_model_load: total model size = 74.64 MB
bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
bark_forward_text_encoder: ...........................................................................................................
bark_forward_text_encoder: mem per token = 4.80 MB
bark_forward_text_encoder: sample time = 7.91 ms
bark_forward_text_encoder: predict time = 2779.49 ms / 7.62 ms per token
bark_forward_text_encoder: total time = 2829.35 ms
bark_forward_coarse_encoder: .................................................................................................................................................................
..................................................................................................................................................................
bark_forward_coarse_encoder: mem per token = 8.51 MB
bark_forward_coarse_encoder: sample time = 3.08 ms
bark_forward_coarse_encoder: predict time = 10997.70 ms / 33.94 ms per token
bark_forward_coarse_encoder: total time = 11036.88 ms
bark_forward_fine_encoder: .....
bark_forward_fine_encoder: mem per token = 5.11 MB
bark_forward_fine_encoder: sample time = 39.85 ms
bark_forward_fine_encoder: predict time = 19773.94 ms
bark_forward_fine_encoder: total time = 19873.72 ms
bark_forward_encodec: mem per token = 760209 bytes
bark_forward_encodec: predict time = 528.46 ms / 528.46 ms per token
bark_forward_encodec: total time = 663.63 ms
Number of frames written = 51840.
main: load time = 1436.36 ms
main: eval time = 34520.53 ms
main: total time = 35956.92 ms
```
## Usage
Here are the steps for the bark model.
### Get the code
```bash
git clone https://github.com/PABannier/bark.cpp.git
cd bark.cpp
```
### Build
In order to build bark.cpp you have two different options. We recommend using `CMake` for Windows.
- Using `make`:
- On Linux or MacOS:
```bash
make
```
- Using `CMake`:
```bash
mkdir build
cd build
cmake ..
cmake --build . --config Release
```
### Prepare data & Run
```bash
# obtain the original bark and encodec weights and place them in ./models
python3 download_weights.py --download-dir ./models
# install Python dependencies
python3 -m pip install -r requirements.txt
# convert the model to ggml format
python convert.py \
--dir-model ./models \
--codec-path ./models \
--vocab-path ./models \
--out-dir ./ggml_weights/
# run the inference
./main -m ./models/ggml_weights/ -p "this is an audio"
```
### Seminal papers and background on models
- Bark
- [Text Prompted Generative Audio](https://github.com/suno-ai/bark)
- Encodec
- [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
- GPT-3
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
### Contributing
`bark.cpp` is a continuous endeavour that relies on the community efforts to last and evolve. Your contribution is welcome and highly valuable. It can be
- bug report: you may encounter a bug while using `bark.cpp`. Don't hesitate to report it on the issue section.
- feature request: you want to add a new model or support a new platform. You can use the issue section to make suggestions.
- pull request: you may have fixed a bug, added a features, or even fixed a small typo in the documentation, ... you can submit a pull request and a reviewer will reach out to you.
### Coding guidelines
- Avoid adding third-party dependencies, extra files, extra headers, etc.
- Always consider cross-compatibility with other operating systems and architectures
- Avoid fancy looking modern STL constructs, keep it simple
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & ref`

BIN
assets/banner.jpeg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 690 KiB

49
download_weights.py Normal file
View File

@ -0,0 +1,49 @@
import argparse
from pathlib import Path
from huggingface_hub import hf_hub_download
import torch
ENCODEC_PATH = Path("https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th")
REMOTE_MODEL_PATHS = {
"text": {
"repo_id": "suno/bark",
"file_name": "text_2.pt",
},
"coarse": {
"repo_id": "suno/bark",
"file_name": "coarse_2.pt",
},
"fine": {
"repo_id": "suno/bark",
"file_name": "fine_2.pt",
},
}
parser = argparse.ArgumentParser()
parser.add_argument("--download-dir", type=str, required=True)
if __name__ == "__main__":
args = parser.parse_args()
out_dir = Path(args.download_dir)
out_dir.mkdir(parents=True, exist_ok=True)
print(" ### Downloading bark encoders...")
for model_k in REMOTE_MODEL_PATHS.keys():
model_details = REMOTE_MODEL_PATHS[model_k]
repo_id, filename = model_details["repo_id"], model_details["file_name"]
hf_hub_download(repo_id=repo_id, filename=filename, local_dir=out_dir)
print(" ### Downloading EnCodec weights...")
state_dict = torch.hub.load_state_dict_from_url(
str(ENCODEC_PATH),
map_location="cpu",
check_hash=True
)
with open(out_dir / ENCODEC_PATH.name, "wb") as fout:
torch.save(state_dict, fout)
print("Done.")

3
requirements.txt Normal file
View File

@ -0,0 +1,3 @@
huggingface-hub>=0.14.1
numpy
torch