中文 | English
High-performance C++ inference implementation for the BS Roformer and Mel-Band-Roformer audio source separation model.
This project is a pure C++ inference engine for the BS Roformer and Mel-Band-Roformer audio source separation models, built on the GGML tensor library. It primarily used for extracting vocals or accompaniment from music.
- 🚀 High-Performance Inference: Supports CPU/GPU (CUDA, Vulkan) acceleration
- 🏗️ Multi-Architecture: Support for both Mel-Band Roformer and BS Roformer
- 📦 GGUF Model Format: Unified model file format for easy distribution
- 🎚️ Multiple Quantization Support: FP32/FP16/Q8_0/Q4_0/Q4_1/Q5_0/Q5_1
- 🔧 Easy Deployment: Only requires executable and GGML library
- 🎵 Complete Audio Pipeline: Built-in STFT/ISTFT and audio I/O
- ⚡ Pipeline Optimization: CPU preprocessing and GPU inference run in parallel
- Pre-built Binaries: Download executables for your platform from the Releases page
- GGUF Models: Download pre-converted model files from BSRoformer-GGUF
./bs_roformer-cli <model.gguf> <input.wav> <output.wav> [options]
Options:
--chunk-size <N> Chunk size (in samples), defaults to model value
--overlap <N> Number of overlaps, defaults to model value
--help, -h Show help messageParameter Description:
| Parameter | Description |
|---|---|
--chunk-size |
Number of audio samples to process at once. Larger values require more VRAM but may improve processing efficiency. Default is typically 352800 (~8 seconds @44100Hz). |
--overlap |
Number of overlaps between chunks. Increasing this value can improve output quality as it helps reduce artifacts when reassembling chunks, but will increase inference time. Recommended value is 2-4. |
Examples:
# Basic usage (using model defaults)
./bs_roformer-cli model.gguf song.wav vocals.wav
# Custom chunking parameters
./bs_roformer-cli model.gguf song.wav vocals.wav --chunk-size 352800 --overlap 2
# High quality mode (increase overlap to reduce artifacts)
./bs_roformer-cli model.gguf song.wav vocals.wav --overlap 4Note: Input audio must be 44100 Hz. Stereo or mono is supported (auto-expanded).
- CMake >= 3.17
- C++17 compatible compiler (MSVC 2019+, GCC 9+, Clang 10+)
- GGML source code (submodule or local directory)
The project supports multiple ways to obtain GGML:
# Option 1: Git Submodule (Recommended)
git submodule add https://github.com/ggerganov/ggml.git
git submodule update --init --recursive
# Option 2: Sibling Directory
cd ..
git clone https://github.com/ggerganov/ggml.git
# Option 3: Explicit Path
cmake -B build -DGGML_DIR=/path/to/ggmlSee GGML_DEPENDENCY.md for details.
# CPU Build
cmake -B build
cmake --build build --config Release --parallel
# CUDA Acceleration (Recommended)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release --parallel
# Enable Tests
cmake -B build -DGGML_CUDA=ON -DBSR_BUILD_TESTS=ON
cmake --build build --config Release --parallel| Option | Default | Description |
|---|---|---|
GGML_CUDA |
ON |
Enable CUDA backend |
BSR_BUILD_CLI |
ON |
Build command line tool |
BSR_BUILD_TESTS |
OFF |
Build test suite |
Breaking Change: build/test prefixes were renamed from
MBR_*toBSR_*with no compatibility aliases.
If you need to convert models yourself, use convert_to_gguf.py to convert PyTorch weights to GGUF format.
Install Dependencies:
pip install torch numpy pyyaml librosa einops ggufConversion Command:
python scripts/convert_to_gguf.py \
--ckpt model.ckpt \
--config config.yaml \
--out model.gguf \
--dtype q8_0
# For BS Roformer (optional, usually auto-detected)
python scripts/convert_to_gguf.py ... --arch bs| Type | Precision | Size | Recommended Use |
|---|---|---|---|
fp32 |
Highest | 100% | Debugging/Baseline |
fp16 |
High | 50% | High precision needs |
q8_0 |
Good | 25% | Recommended (balance of precision and performance) |
q5_1 |
Medium | 18% | Resource constrained |
q4_0 |
Lower | 12.5% | Extreme compression |
Note: The conversion script currently does not support K-Quant types (Q4_K, Q5_K, etc.). This is mainly because the gguf-py library has not yet implemented K-Quant quantization (only supports reading/dequantization), and most models do not meet the requirement that dim must be divisible by 256.
#include <atomic>
#include <bs_roformer/inference.h>
#include <bs_roformer/audio.h>
// 1. Load audio file
AudioBuffer input = AudioFile::Load("input.wav");
// 2. Initialize inference engine
Inference engine("model.gguf");
// 3. Get model's recommended inference parameters
int chunk_size = engine.GetDefaultChunkSize(); // e.g., 352800
int num_overlap = engine.GetDefaultNumOverlap(); // e.g., 2
// 4. Run inference (with progress + cancel callback)
std::atomic<bool> should_cancel{false};
auto stems = engine.Process(input.data, chunk_size, num_overlap,
[](float progress) {
std::cout << "Progress: " << int(progress * 100) << "%" << std::endl;
},
[&should_cancel]() {
return should_cancel.load();
});
// 5. Save result
AudioBuffer output{stems[0], 2, 44100, stems[0].size()};
AudioFile::Save("vocals.wav", output);If cancel_callback returns true, Process() throws std::runtime_error("Inference cancelled").
BSRoformer.cpp/
├── include/
│ └── bs_roformer/
│ ├── inference.h # Inference Engine API
│ └── audio.h # Audio I/O API
├── src/
│ ├── model.h/cpp # Model weight loading & graph building (internal)
│ ├── inference.cpp # Core inference logic (STFT → Network → ISTFT)
│ ├── stft.h # STFT/ISTFT implementation (Radix-2 FFT)
│ ├── audio.cpp # Audio read/write implementation (dr_wav)
│ └── utils.h/cpp # NPY loading, tensor comparison tools
├── third_party/
│ └── dr_libs/dr_wav.h # dr_libs audio library
├── cli/
│ └── main.cpp # Command line tool
├── scripts/
│ ├── convert_to_gguf.py # PyTorch → GGUF conversion tool
│ ├── generate_test_data.py # Test data generation script
│ └── generate_test_audio.py # CI test audio generation (no external files needed)
├── tests/ # Unit test suite
├── models/ # Model file directory
└── CMakeLists.txt # Build configuration
The BSRoformer class is responsible for:
- GGUF Weight Loading: Parsing hyperparameters and tensors from file
- Buffer Generation:
freq_indices,num_bands_per_freq, etc. - Computation Graph Building:
BuildBandSplitGraph()- Band split layerBuildTransformersGraph()- Time-frequency Transformer stackingBuildMaskEstimatorGraph()- Mask estimator
The Inference class implements the complete audio processing pipeline:
Input Audio → Chunking → STFT → Neural Network → Mask Application → ISTFT → Overlap-Add → Output
Key Methods:
| Method | Function |
|---|---|
Process() |
Process complete audio (auto-chunking) |
ProcessChunk() |
Process a single audio chunk |
ComputeSTFT() |
Short-Time Fourier Transform |
PostProcessAndISTFT() |
Mask application and inverse transform |
Pipeline Optimization:
Chunk N: [CPU Preprocess] → [GPU Inference] → [CPU Postprocess]
Chunk N+1: [CPU Preprocess] → [GPU Inference] → [CPU Postprocess]
↑ Parallel execution
Pure C++ implementation, numerically aligned with PyTorch torch.stft/istft:
- Radix-2 Cooley-Tukey FFT: Efficient O(N log N) implementation
- Hann Window: Periodic window function
- Center Padding: Reflect mode padding
- OpenMP Parallelization: Frame-level parallel acceleration
Lightweight audio processing based on dr_libs:
- Read: WAV file →
float32interleaved format - Write:
float32interleaved format → WAV file
# Set environment variables
$env:BSR_MODEL_PATH = "models/model.gguf"
$env:BSR_TEST_DATA_DIR = "test_data"
# Run all tests
ctest --test-dir build -C Release
# Run specific test
ctest --test-dir build -C Release -R test_inference| Test File | Verification Content |
|---|---|
test_audio |
Audio read/write functionality |
test_component_stft |
STFT/ISTFT numerical precision |
test_component_bandsplit |
Band split layer |
test_component_layers |
Transformer layers |
test_component_mask |
Mask estimator |
test_inference |
End-to-end inference |
test_chunking_logic |
Chunking overlap-add logic |
First clone Music-Source-Separation-Training and install its dependencies:
git clone https://github.com/ZFTurbo/Music-Source-Separation-Training.git
cd Music-Source-Separation-Training
pip install -r requirements.txt
cd ..
python scripts/generate_test_data.py \
--model-repo "Music-Source-Separation-Training" \
--audio "test.wav" \
--checkpoint "model.ckpt" \
--output "test_data"- ggerganov/ggml - Efficient tensor library
- ZFTurbo/Music-Source-Separation-Training - PyTorch reference implementation
- dr_libs - Lightweight audio library