本仓库包含MOSS-Audio-Tokenizer-Nano的Hugging Face远程代码实现和权重,这是MOSS-TTS-Nano所使用的轻量级音频编码器。
MOSS-Audio-Tokenizer-Nano是一款紧凑的离散音频编码器,基于MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models中的Cat(Causal Audio Tokenizer with Transformer,因果音频Transformer编码器)架构。本仓库中的 checkpoint 拥有21,969,664个参数(约2200万),相比全尺寸的MOSS-Audio-Tokenizer大幅减小,同时保留了MOSS-TTS系列所使用的48 kHz立体声编码器接口。
总结: 通过将紧凑的因果Transformer编码器与原生48 kHz立体声建模相结合,MOSS-Audio-Tokenizer-Nano降低了MOSS音频编码器接口的部署成本,同时保持了对语音、通用音频和音乐的高保真重建能力。它为MOSS-TTS-Nano及其他实时语音生成工作流提供了轻量级、低帧率且流式友好的离散音频表示。
本仓库包含轻量级的远程代码实现,其镜像了当前Hugging Face Transformers的transformers.models.moss_audio_tokenizer模块。需要时,请使用trust_remote_code=True加载。
下表对比了MOSS-Audio-Tokenizer-Nano与参数不超过120M的开源音频分词器在语音、音频和音乐数据上的重建质量。MOSS-Audio-Tokenizer-Nano在对比模型中保持了较小的模型尺寸,同时支持48 kHz立体声重建。
ch=1表示单声道音频,ch=2表示立体声音频。| 模型 | 参数(M) | 采样率 | 通道数 | 比特率 | 量化器数量 | 语音:SIM ↑(英/中) | 语音:STOI ↑(英/中) | 语音:PESQ-NB ↑(英/中) | 语音:PESQ-WB ↑(英/中) | 音频/音乐:Mel-Loss ↓ | 音频/音乐:STFT-Dist. ↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mimi VAE | 28 | 24k | 1 | -- | -- | 0.75 / 0.54 | 0.91 / 0.83 | 2.92 / 2.20 | 2.30 / 1.73 | 1.35 / 1.31 | 2.70 / 2.59 |
| DAC | 77 | 44.1k | 1 | 861 | 1 | 0.30 / 0.20 | 0.76 / 0.68 | 1.55 / 1.36 | 1.24 / 1.15 | 1.25 / 1.18 | 2.71 / 2.54 |
| SpeechTokenizer | 120 | 16k | 1 | 1000 | 2 | 0.36 / 0.25 | 0.77 / 0.68 | 1.59 / 1.38 | 1.25 / 1.17 | -- / -- | -- / -- |
| Mimi | 96 | 24k | 1 | 1100 | 8 | 0.74 / 0.59 | 0.91 / 0.85 | 2.80 / 2.24 | 2.25 / 1.78 | 1.24 / 1.19 | 2.62 / 2.49 |
| MOSS-Audio-Tokenizer-Nano | 22 | 48k | 2 | 750 | 6 | 0.64 / 0.61 | 0.90 / 0.85 | 2.65 / 2.28 | 2.11 / 1.87 | 1.04 / 1.01 | 2.42 / 2.27 |
| MOSS-Audio-Tokenizer-Nano | 22 | 48k | 2 | 1000 | 8 | 0.75 / 0.69 | 0.92 / 0.87 | 2.92 / 2.48 | 2.36 / 2.04 | 1.00 / 0.97 | 2.37 / 2.22 |
| EnCodec | 19 | 48k | 2 | 1500 | 1 | 0.35 / 0.30 | 0.76 / 0.75 | 1.54 / 1.60 | 1.25 / 1.32 | 1.25 / 1.05 | 2.73 / 2.30 |
| SpeechTokenizer | 120 | 16k | 1 | 1500 | 3 | 0.52 / 0.38 | 0.84 / 0.75 | 2.00 / 1.60 | 1.57 / 1.33 | -- / -- | -- / -- |
| Mimi | 96 | 24k | 1 | 1512.5 | 11 | 0.82 / 0.67 | 0.92 / 0.88 | 3.10 / 2.50 | 2.54 / 2.00 | 1.19 / 1.14 | 2.55 / 2.42 |
| DAC | 77 | 44.1k | 1 | 1723 | 2 | 0.57 / 0.47 | 0.86 / 0.80 | 2.21 / 1.85 | 1.74 / 1.49 | 1.03 / 0.99 | 2.43 / 2.26 |
| SpeechTokenizer | 120 | 16k | 1 | 2000 | 4 | 0.66 / 0.50 | 0.88 / 0.80 | 2.38 / 1.79 | 1.92 / 1.49 | -- / -- | -- / -- |
| Mimi | 96 | 24k | 1 | 2062.5 | 15 | 0.87 / 0.73 | 0.94 / 0.90 | 3.36 / 2.76 | 2.81 / 2.22 | 1.14 / 1.09 | 2.49 / 2.36 |
| MOSS-Audio-Tokenizer-Nano | 22 | 48k | 2 | 1500 | 12 | 0.84 / 0.77 | 0.94 / 0.90 | 3.25 / 2.77 | 2.71 / 2.31 | 0.95 / 0.91 | 2.31 / 2.14 |
| MOSS-Audio-Tokenizer-Nano | 22 | 48k | 2 | 2000 | 16 | 0.88 / 0.81 | 0.95 / 0.91 | 3.40 / 2.93 | 2.89 / 2.47 | 0.93 / 0.89 | 2.28 / 2.11 |
import torchaudio
from transformers import AutoModel
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
wav, sr = torchaudio.load("demo/demo_gt.wav")
if sr != model.sampling_rate:
wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
# The public waveform interface expects stereo audio.
if wav.shape[0] == 1:
wav = wav.repeat(model.config.number_channels, 1)
else:
wav = wav[: model.config.number_channels]
wav = wav.unsqueeze(0)
enc = model.encode(wav, return_dict=True)
print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
dec = model.decode(enc.audio_codes, return_dict=True)
print(f"dec.audio.shape: {dec.audio.shape}")
wav = dec.audio.squeeze(0)
torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
# Decode with the first 8 codebooks, roughly 1 kbps.
dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
wav_rvq8 = dec_rvq8.audio.squeeze(0)
torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)config.attention_implementation 控制 Transformer 层是优先使用 sdpa 还是 flash_attention_2。
config.compute_dtype 控制非量化器的自动转换数据类型,支持 fp32、bf16 和 fp16。
model.set_attention_implementation("flash_attention_2")
model.set_compute_dtype("fp16")量化器始终以 fp32 模式运行。
MossAudioTokenizerModel.encode、decode、batch_encode 和 batch_decode 均通过 chunk_duration 参数支持流式处理。
chunk_duration 以秒为单位。chunk_duration * MossAudioTokenizerConfig.sampling_rate 必须能被 MossAudioTokenizerConfig.downsample_rate 整除。(2, T),或批量立体声输入的形状为 (B, 2, T)。import torch
from transformers import AutoModel
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
audio = torch.randn(2, 48000 * 6) # dummy stereo waveform
# 6.0s @ 48kHz = 288000 samples, divisible by downsample_rate=3840
enc = model.encode(audio.unsqueeze(0), return_dict=True, chunk_duration=0.08)
dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
batch_enc = model.batch_encode([audio, audio[:, : 48000 * 3]], chunk_duration=0.08)
codes_list = [
batch_enc.audio_codes[:, i, : batch_enc.audio_codes_lengths[i]]
for i in range(batch_enc.audio_codes.shape[1])
]
batch_dec = model.batch_decode(codes_list, chunk_duration=0.08)对于解码器端的连续批处理,建议使用 batch_decode(..., streaming=True, ...)。
max_batch_size=...。若省略该参数,首个批处理大小会为该公共流预留固定槽位的解码器资源。finalize_indices 表示“对这些行进行最后一次解码,然后将其逐出”。索引是相对于调用前的逻辑顺序进行解释的。reset_stream=True 会丢弃隐藏的公共流式状态并启动新的流。里程碑 1 边界:
max_batch_size 确定的固定槽位解码器预留import torch
from transformers import AutoModel
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
num_quantizers = model.config.quantizer_kwargs["num_quantizers"]
codebook_size = model.config.quantizer_kwargs["codebook_size"]
codes_a0 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_b0 = torch.randint(0, codebook_size, (num_quantizers, 3))
codes_a1 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_b1 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_c0 = torch.randint(0, codebook_size, (num_quantizers, 1))
codes_a2 = torch.randint(0, codebook_size, (num_quantizers, 1))
codes_b2 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_c1 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_b3 = torch.randint(0, codebook_size, (num_quantizers, 1))
codes_c2 = torch.randint(0, codebook_size, (num_quantizers, 1))
# First call reserves 3 fixed decoder slots for A and B.
out_ab0 = model.batch_decode(
[codes_a0, codes_b0],
streaming=True,
max_batch_size=3,
reset_stream=True,
)
# Same logical rows continue in order; C is a tail append.
out_abc1 = model.batch_decode(
[codes_a1, codes_b1, codes_c0],
streaming=True,
)
# Finalize A against the pre-call logical order. A still decodes in this call,
# then is evicted immediately afterward.
out_abc2 = model.batch_decode(
[codes_a2, codes_b2, codes_c1],
streaming=True,
finalize_indices=[0],
)
# The next call can shrink to the surviving logical rows only.
out_bc3 = model.batch_decode(
[codes_b3, codes_c2],
streaming=True,
)configuration_moss_audio_tokenizer.pymodeling_moss_audio_tokenizer.py__init__.pyconfig.json如果您在研究工作中使用了本模型或代码,请引用:
@misc{gong2026mossttstechnicalreport,
title={MOSS-TTS Technical Report},
author={Yitian Gong and Botian Jiang and Yiwei Zhao and Yucheng Yuan and Kuangwei Chen and Yaozhou Jiang and Cheng Chang and Dong Hong and Mingshu Chen and Ruixiao Li and Yiyang Zhang and Yang Gao and Hanfu Chen and Ke Chen and Songlin Wang and Xiaogui Yang and Yuqian Zhang and Kexin Huang and ZhengYuan Lin and Kang Yu and Ziqi Chen and Jin Wang and Zhaoye Fei and Qinyuan Cheng and Shimin Li and Xipeng Qiu},
year={2026},
eprint={2603.18090},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.18090}
}@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
year={2026},
eprint={2602.10934},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2602.10934}
}