s2.cpp

ALPHA — EXPERIMENTAL SOFTWARE This is an early-stage, community-built inference engine. Expect rough edges, missing features, and breaking changes. Not production-ready.

s2.cpp — Fish Audio's S2 Pro Dual-AR text-to-speech model running locally via a pure C++/GGML inference engine with CPU, Vulkan, and CUDA GPU backends. No Python runtime required after build.

Built on Fish Audio S2 Pro The model weights are licensed under the Fish Audio Research License, Copyright © 39 AI, INC. All Rights Reserved. See LICENSE.md for full terms. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.

What this is

This repository contains:

s2.cpp — a self-contained C++17 inference engine built on ggml, handling tokenization, Dual-AR generation, audio codec encode/decode, and WAV output with no Python dependency
tokenizer.json — Qwen3 BPE tokenizer with ByteLevel pre-tokenization
GGUF model files are not included here — see Model variants below

The engine runs the full pipeline: text → tokens → Slow-AR transformer (with KV cache) → Fast-AR codebook decoder → audio codec → WAV file.

Model variants

GGUF files are available at rodrigomt/s2-pro-gguf on Hugging Face.

File	Size	Notes
`s2-pro-f16.gguf`	9.9 GB	Full precision — reference quality
`s2-pro-q8_0.gguf`	5.6 GB	Near-lossless — recommended for 8+ GB VRAM
`s2-pro-q6_k.gguf`	4.5 GB	Good quality/size balance — recommended for 6+ GB VRAM
`s2-pro-q5_k_m.gguf`	4.0 GB	Smaller with still-good quality
`s2-pro-q4_k_m.gguf`	3.6 GB	Best compact variant so far in quick RU validation
`s2-pro-q3_k.gguf`	3.0 GB	Usable, but starts stretching short words
`s2-pro-q2_k.gguf`	2.6 GB	Lowest-size experimental variant

All variants include both the transformer weights and the audio codec in a single file. The quantized variants above were regenerated with the codec tensors (c.*) kept in F16, so only the AR transformer is quantized.

Requirements

Build dependencies

CMake ≥ 3.14
C++17 compiler (GCC ≥ 10, Clang ≥ 11, MSVC 2019+)
For Vulkan GPU support: Vulkan SDK and glslc
For CUDA/NVIDIA GPU support: CUDA Toolkit ≥ 12.4
- MSVC 2019+ note: MSVC 2019 and later require CUDA ≥ 12.4 when building GGML. Older CUDA versions will produce compiler compatibility errors; upgrade to 12.4+ to resolve them.

# Ubuntu / Debian
sudo apt install cmake build-essential

# Vulkan (optional, for AMD/Intel GPU acceleration)
sudo apt install vulkan-tools libvulkan-dev glslc

# CUDA (optional, for NVIDIA GPU acceleration)
# Install from https://developer.nvidia.com/cuda-downloads

Runtime

No Python or PyTorch required. The binary links only against the ggml shared libraries built alongside it.

Building

Clone with submodules (ggml is a submodule):

git clone --recurse-submodules https://github.com/rodrigomatta/s2.cpp.git
cd s2.cpp

CPU only

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc)

With Vulkan GPU support (AMD/Intel)

cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_VULKAN=ON
cmake --build build --parallel $(nproc)

With CUDA GPU support (NVIDIA)

cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_CUDA=ON
cmake --build build --parallel $(nproc)

The binary is produced at build/s2.

Usage

Basic synthesis (CPU)

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -text "The quick brown fox jumps over the lazy dog." \
  -o output.wav

tokenizer.json is searched automatically in the same directory as the model file, then the parent directory. If not found in either, it falls back to tokenizer.json in the current working directory.

Voice cloning with a reference audio

Provide a short reference clip (5–30 seconds, WAV or MP3) and a transcript of it:

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -pa reference.wav \
  -pt "Transcript of what the reference speaker says." \
  -text "Now synthesize this text in that voice." \
  -o output.wav

By default, the engine uses fish-speech-aligned sampling defaults: --min-tokens-before-end 0, no trailing-silence trim, no peak normalization, and no dynamic loudness normalization. All of these behaviors are optional and can be enabled from the CLI.

GPU inference via Vulkan (AMD/Intel)

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -text "Text to synthesize." \
  -v 0 \
  -o output.wav

-v 0 selects the first Vulkan device.

GPU inference via CUDA (NVIDIA)

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -text "Text to synthesize." \
  -c 0 \
  -o output.wav

-c 0 selects the first CUDA device. The transformer runs on GPU; the audio codec always runs on CPU (executes only twice per synthesis).

All options

Flag	Default	Description
`-m`, `--model`	`model.gguf`	Path to GGUF model file
`-t`, `--tokenizer`	`tokenizer.json`	Path to tokenizer.json
`-text`	`"Hello world"`	Text to synthesize
`-pa`, `--prompt-audio`	—	Reference audio file for voice cloning (WAV/MP3)
`-pt`, `--prompt-text`	—	Transcript of the reference audio
`-o`, `--output`	`out.wav`	Output WAV file path
`-v`, `--vulkan`	`-1` (CPU)	Vulkan device index (`-1` = CPU only)
`-c`, `--cuda`	`-1` (CPU)	CUDA device index (`-1` = CPU only)
`-threads N`	`4`	Number of CPU threads
`-max-tokens N`	`1024`	Max tokens to generate
`--min-tokens-before-end N`	`0`	Minimum generated tokens before `EOS` is allowed; `0` matches fish-speech default behavior
`-temp F`	`0.8`	Sampling temperature
`-top-p F`	`0.8`	Top-p nucleus sampling
`-top-k N`	`30`	Top-k sampling
`--dynamic-normalize` / `--no-dynamic-normalize`	`disabled`	Enable or disable dynamic RMS normalization
`--trim-silence` / `--no-trim-silence`	`trim` disabled	Enable or disable trailing silence trimming on the saved WAV
`--normalize` / `--no-normalize`	`normalize` disabled	Enable or disable peak normalization to `0.95` on the saved WAV
`--server`	—	Start HTTP server instead of CLI synthesis
`-H`, `--host`	`127.0.0.1`	Server bind address
`-P`, `--port`	`3030`	Server port

Setting --min-tokens-before-end 0 matches the upstream fish-speech behavior. Non-zero values deliberately bias the model away from early EOS.

HTTP server mode

Start the server:

./build/s2 -m s2-pro-q6_k.gguf --server
# or with custom host/port:
./build/s2 -m s2-pro-q6_k.gguf --server -H 0.0.0.0 -P 8080

POST /generate — synthesize audio (multipart/form-data)

Field	Type	Required	Description
`text`	string	yes	Text to synthesize
`reference`	file	no	Reference audio file for voice cloning (WAV or MP3). Aliases: `reference_audio`, `prompt_audio`, `ref_audio`
`reference_text`	string	if reference audio is provided	Transcript of the reference audio. Aliases: `ref_text`, `prompt_text`
`params`	JSON string	no	Generation params: `max_new_tokens`, `temperature`, `top_p`, `top_k`, `min_tokens_before_end`, `n_threads`, `verbose`

Returns audio/wav.

# Basic
curl -X POST http://127.0.0.1:3030/generate \
  --form "text=Hello world" \
  --form 'params={"max_new_tokens":512,"temperature":0.58,"top_p":0.88,"top_k":40}' \
  -o output.wav

# With voice cloning
curl -X POST http://127.0.0.1:3030/generate \
  --form "reference=@reference.wav" \
  --form "reference_text=Transcript of the reference." \
  --form "text=Text to synthesize in that voice." \
  --form 'params={"max_new_tokens":512,"temperature":0.58,"top_p":0.88,"top_k":40}' \
  -o output.wav

# Same request using the accepted aliases
curl -X POST http://127.0.0.1:3030/generate \
  --form "reference_audio=@reference.wav" \
  --form "ref_text=Transcript of the reference." \
  --form "text=Text to synthesize in that voice." \
  -o output.wav

Choosing a model

VRAM available	Recommended model
≥ 10 GB	`q8_0` — near-lossless quality
6–9 GB	`q6_k` — good quality/size balance
5–7 GB	`q4_k_m` — best compact variant in current quick validation
< 5 GB	`q3_k` or `q2_k` — experimental, quality drops faster

VRAM usage at runtime is roughly on the order of the model size, but actual usage depends on backend buffers, KV cache length, and allocator overhead. The audio codec executes on CPU during inference.

Architecture notes

S2 Pro uses a Dual-AR architecture:

Slow-AR — a 36-layer Qwen3-based transformer (4.13B params) that processes the full token sequence with GQA (32 heads, 8 KV heads), RoPE at 1M base, QK norm, and a persistent KV cache
Fast-AR — a 4-layer transformer (0.42B params) that autoregressively generates 10 acoustic codebook tokens from the Slow-AR hidden state for each semantic step
Audio codec — a convolutional encoder/decoder with residual vector quantization (RVQ, 10 codebooks × 4096 entries) that converts between audio waveforms and discrete codes

Total: ~4.56B parameters.

Implementation notes

The C++ engine (src/) is built entirely on ggml (unmodified, pinned as a submodule). Key design decisions:

Separate persistent gallocr allocators for Slow-AR and Fast-AR — each path keeps its own compute buffer, avoiding memory re-planning per token
Temporary prefill allocator — freed immediately after prefill, so the large compute buffer does not persist into the generation loop
Codec on CPU — the audio codec executes once per synthesis (decode only) or twice when a reference audio is provided (encode reference + decode output), so running it on CPU has zero impact on generation throughput
posix_fadvise(DONTNEED) after loading the weights (Linux only) — advises the kernel to drop the GGUF file from page cache once the tensors are already in the backend buffer, reducing duplicate RAM use
Correct ByteLevel tokenization — the GPT-2 byte-to-unicode table is applied before BPE, producing token IDs identical to the HuggingFace reference tokenizer

Tips

Long outputs

Voice quality and amplitude tend to degrade after ~800 tokens (~37 s of audio). For longer texts, split into sentences and concatenate the resulting WAV files. Optional post-processing flags such as --dynamic-normalize, --normalize, and --trim-silence can help clean up the result, but splitting remains the most reliable approach.

Known limitations (alpha)

No streaming output — WAV is written only after full generation completes
No batch inference
Voice cloning quality depends heavily on reference audio length and SNR
Windows: CUDA and Vulkan backends are supported; when using MSVC 2019+, ensure CUDA ≥ 12.4 is installed before building
macOS is untested

License

The model weights and associated materials are licensed under the Fish Audio Research License. Key points:

Research and non-commercial use: free, under the terms of this Agreement
Commercial use: requires a separate written license from Fish Audio
When distributing, you must include a copy of the license and the attribution notice

Full license: LICENSE.md

Commercial licensing: https://fish.audio · business@fish.audio

The inference engine source code (src/) is a Derivative Work of the Fish Audio Materials as defined in the Agreement and is distributed under the same Fish Audio Research License terms.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
ggml @ c044a8e		ggml @ c044a8e
include		include
src		src
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
README.md		README.md
tokenizer.json		tokenizer.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

s2.cpp

What this is

Model variants

Requirements

Build dependencies

Runtime

Building

CPU only

With Vulkan GPU support (AMD/Intel)

With CUDA GPU support (NVIDIA)

Usage

Basic synthesis (CPU)

Voice cloning with a reference audio

GPU inference via Vulkan (AMD/Intel)

GPU inference via CUDA (NVIDIA)

All options

HTTP server mode

Choosing a model

Architecture notes

Implementation notes

Tips

Long outputs

Known limitations (alpha)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

s2.cpp

What this is

Model variants

Requirements

Build dependencies

Runtime

Building

CPU only

With Vulkan GPU support (AMD/Intel)

With CUDA GPU support (NVIDIA)

Usage

Basic synthesis (CPU)

Voice cloning with a reference audio

GPU inference via Vulkan (AMD/Intel)

GPU inference via CUDA (NVIDIA)

All options

HTTP server mode

Choosing a model

Architecture notes

Implementation notes

Tips

Long outputs

Known limitations (alpha)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages