DOC Update README (#55)

2026-03-03 13:40:57 +01:00 · 2023-08-12 20:35:15 +02:00 · 2023-08-12 20:35:15 +02:00 · 00ff99bcd2
commit 00ff99bcd2
parent 8ec7c7287a
5 changed files with 260 additions and 12 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,6 +1,7 @@
 ggml_weights/*
 *.dSYM/
 build/
+models/

 bark
 encodec
--- a/README.md
+++ b/README.md
@ -1,19 +1,214 @@
-# bark.cpp (coming soon!)
+# bark.cpp

-Inference of SunoAI's bark model in pure C/C++ using [ggml](https://github.com/ggerganov/ggml).
+![bark.cpp](./assets/banner.jpeg)
+
+[![Actions Status](https://github.com/PABannier/bark.cpp/actions/workflows/build.yml/badge.svg)](https://github.com/PABannier/bark.cpp/actions)
+[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
+
+[Roadmap](https://github.com/users/PABannier/projects/1) / [encodec.cpp](https://github.com/PABannier/encodec.cpp) / [ggml](https://github.com/ggerganov/ggml)
+
+Inference of [SunoAI's bark model](https://github.com/suno-ai/bark) in pure C/C++.

 ## Description

-The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model.
+The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model in efficiently using only CPU.

-Bark has essentially 4 components:
- [x] Semantic model to encode the text input
- [x] Coarse model
- [x] Fine model
- [ ] Encoder (quantizer + decoder) to generate the waveform from the tokens
+- [X] Plain C/C++ implementation without dependencies
+- [X] AVX, AVX2 and AVX512 for x86 architectures
+- [ ] Optimized via ARM NEON, Accelerate and Metal frameworks
+- [ ] iOS on-device deployment using CoreML
+- [ ] Mixed F16 / F32 precision
+- [ ] 4-bit, 5-bit and 8-bit integer quantization

-## Roadmap
+The original implementation of `bark.cpp` is the bark's 24Khz English model. We expect to support multiple languages in the future, as well as other vocoders (see [this](https://github.com/PABannier/bark.cpp/issues/36) and [this](https://github.com/PABannier/bark.cpp/issues/6)).
+This project is for educational purposes.

- [ ] Quantization
- [ ] FP16
- [ ] Swift package for iOS devices
+**Supported platforms:**
+
+- [X] Mac OS
+- [X] Linux
+- [X] Windows
+
+**Supported models:**
+
+- [X] Bark's 24Khz model
+- [ ] Bark's 48Khz model
+- [ ] Multiple voices
+
+---
+
+Here is a typical run using Bark:
+
+```java
+make -j && ./main -p "this is an audio"
+I bark.cpp build info:
+I UNAME_S:  Darwin
+I UNAME_P:  arm
+I UNAME_M:  arm64
+I CFLAGS:   -I. -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_ACCELERATE
+I CXXFLAGS: -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
+I LDFLAGS:   -framework Accelerate
+I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
+I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)
+
+bark_model_load: loading model from './ggml_weights'
+bark_model_load: reading bark text model
+gpt_model_load: n_in_vocab  = 129600
+gpt_model_load: n_out_vocab = 10048
+gpt_model_load: block_size  = 1024
+gpt_model_load: n_embd      = 1024
+gpt_model_load: n_head      = 16
+gpt_model_load: n_layer     = 24
+gpt_model_load: n_lm_heads  = 1
+gpt_model_load: n_wtes      = 1
+gpt_model_load: ggml tensor size = 272 bytes
+gpt_model_load: ggml ctx size = 1894.87 MB
+gpt_model_load: memory size =   192.00 MB, n_mem = 24576
+gpt_model_load: model size  =  1701.69 MB
+bark_model_load: reading bark vocab
+
+bark_model_load: reading bark coarse model
+gpt_model_load: n_in_vocab  = 12096
+gpt_model_load: n_out_vocab = 12096
+gpt_model_load: block_size  = 1024
+gpt_model_load: n_embd      = 1024
+gpt_model_load: n_head      = 16
+gpt_model_load: n_layer     = 24
+gpt_model_load: n_lm_heads  = 1
+gpt_model_load: n_wtes      = 1
+gpt_model_load: ggml tensor size = 272 bytes
+gpt_model_load: ggml ctx size = 1443.87 MB
+gpt_model_load: memory size =   192.00 MB, n_mem = 24576
+gpt_model_load: model size  =  1250.69 MB
+
+bark_model_load: reading bark fine model
+gpt_model_load: n_in_vocab  = 1056
+gpt_model_load: n_out_vocab = 1056
+gpt_model_load: block_size  = 1024
+gpt_model_load: n_embd      = 1024
+gpt_model_load: n_head      = 16
+gpt_model_load: n_layer     = 24
+gpt_model_load: n_lm_heads  = 7
+gpt_model_load: n_wtes      = 8
+gpt_model_load: ggml tensor size = 272 bytes
+gpt_model_load: ggml ctx size = 1411.25 MB
+gpt_model_load: memory size =   192.00 MB, n_mem = 24576
+gpt_model_load: model size  =  1218.26 MB
+
+bark_model_load: reading bark codec model
+encodec_model_load: model size    =   44.32 MB
+
+bark_model_load: total model size  =    74.64 MB
+
+bark_generate_audio: prompt: 'this is an audio'
+bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
+bark_forward_text_encoder: ...........................................................................................................
+
+bark_forward_text_encoder: mem per token =     4.80 MB
+bark_forward_text_encoder:   sample time =     7.91 ms
+bark_forward_text_encoder:  predict time =  2779.49 ms / 7.62 ms per token
+bark_forward_text_encoder:    total time =  2829.35 ms
+
+bark_forward_coarse_encoder: .................................................................................................................................................................
+..................................................................................................................................................................
+
+bark_forward_coarse_encoder: mem per token =     8.51 MB
+bark_forward_coarse_encoder:   sample time =     3.08 ms
+bark_forward_coarse_encoder:  predict time = 10997.70 ms / 33.94 ms per token
+bark_forward_coarse_encoder:    total time = 11036.88 ms
+
+bark_forward_fine_encoder: .....
+
+bark_forward_fine_encoder: mem per token =     5.11 MB
+bark_forward_fine_encoder:   sample time =    39.85 ms
+bark_forward_fine_encoder:  predict time = 19773.94 ms
+bark_forward_fine_encoder:    total time = 19873.72 ms
+
+
+
+bark_forward_encodec: mem per token = 760209 bytes
+bark_forward_encodec:  predict time =   528.46 ms / 528.46 ms per token
+bark_forward_encodec:    total time =   663.63 ms
+
+Number of frames written = 51840.
+
+
+main:     load time =  1436.36 ms
+main:     eval time = 34520.53 ms
+main:    total time = 35956.92 ms
+```
+
+## Usage
+
+Here are the steps for the bark model.
+
+### Get the code
+
+```bash
+git clone https://github.com/PABannier/bark.cpp.git
+cd bark.cpp
+```
+
+### Build
+
+In order to build bark.cpp you have two different options. We recommend using `CMake` for Windows.
+
+- Using `make`:
+  - On Linux or MacOS:
+
+      ```bash
+      make
+      ```
+
+- Using `CMake`:
+
+    ```bash
+    mkdir build
+    cd build
+    cmake ..
+    cmake --build . --config Release
+    ```
+
+### Prepare data & Run
+
+```bash
+# obtain the original bark and encodec weights and place them in ./models
+python3 download_weights.py --download-dir ./models
+
+# install Python dependencies
+python3 -m pip install -r requirements.txt
+
+# convert the model to ggml format
+python convert.py \
+        --dir-model ./models \
+        --codec-path ./models \
+        --vocab-path ./models \
+        --out-dir ./ggml_weights/
+
+# run the inference
+./main -m ./models/ggml_weights/ -p "this is an audio"
+```
+
+### Seminal papers and background on models
+
+- Bark
+    - [Text Prompted Generative Audio](https://github.com/suno-ai/bark)
+- Encodec
+    - [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
+- GPT-3
+    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
+
+### Contributing
+
+`bark.cpp` is a continuous endeavour that relies on the community efforts to last and evolve. Your contribution is welcome and highly valuable. It can be
+
+- bug report: you may encounter a bug while using `bark.cpp`. Don't hesitate to report it on the issue section.
+- feature request: you want to add a new model or support a new platform. You can use the issue section to make suggestions.
+- pull request: you may have fixed a bug, added a features, or even fixed a small typo in the documentation, ... you can submit a pull request and a reviewer will reach out to you.
+
+### Coding guidelines
+
+- Avoid adding third-party dependencies, extra files, extra headers, etc.
+- Always consider cross-compatibility with other operating systems and architectures
+- Avoid fancy looking modern STL constructs, keep it simple
+- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & ref`
--- a/assets/banner.jpeg
+++ b/assets/banner.jpeg
--- a/download_weights.py
+++ b/download_weights.py
@ -0,0 +1,49 @@
+import argparse
+from pathlib import Path
+
+from huggingface_hub import hf_hub_download
+import torch
+
+
+ENCODEC_PATH = Path("https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th")
+
+REMOTE_MODEL_PATHS = {
+    "text": {
+        "repo_id": "suno/bark",
+        "file_name": "text_2.pt",
+    },
+    "coarse": {
+        "repo_id": "suno/bark",
+        "file_name": "coarse_2.pt",
+    },
+    "fine": {
+        "repo_id": "suno/bark",
+        "file_name": "fine_2.pt",
+    },
+}
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--download-dir", type=str, required=True)
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    out_dir = Path(args.download_dir)
+
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    print(" ### Downloading bark encoders...")
+    for model_k in REMOTE_MODEL_PATHS.keys():
+        model_details = REMOTE_MODEL_PATHS[model_k]
+        repo_id, filename = model_details["repo_id"], model_details["file_name"]
+        hf_hub_download(repo_id=repo_id, filename=filename, local_dir=out_dir)
+
+    print(" ### Downloading EnCodec weights...")
+    state_dict = torch.hub.load_state_dict_from_url(
+        str(ENCODEC_PATH),
+        map_location="cpu",
+        check_hash=True
+    )
+    with open(out_dir / ENCODEC_PATH.name, "wb") as fout:
+        torch.save(state_dict, fout)
+
+    print("Done.")
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,3 @@
+huggingface-hub>=0.14.1
+numpy
+torch