摘要:
记录了昇思MindSpore AI框架中音乐生成模型MusicGen的概念、实际使用方法和步骤。包括无提示生成、文本提示生成、音频提示生成、批量音频提示生成四种音乐生成方式。
一、概念
音乐生成模型
MusicGen
以语言模型(LM)为基础
根据文本描述或音频提示
生成高质量的音乐样本
研究成果参考论文《Simple and Controllable Music Generation》。
MusicGen模型
基于Transformer结构
分解为三个不同的阶段:
用户描述文本 -> 文本编码器模型(谷歌t5-base及其权重) -> 隐形状态表示;
训练MusicGen解码器(语言模型架构)来预测离散的隐形状态音频token;
音频压缩模型(EnCodec 32kHz及其权重)解码音频token恢复音频波形。
MusicGen
Transformer LM
高效token
生成符合文本描述的单声道和立体声音乐样本
生成旋律条件控制的音调结构
码本延迟模式
下载模型
预训练权重文件
Small 生成音频质量较低,速度最快,本例中选用
Medium
big
二、环境配置
%%capture captured_output
# 实验环境已经预装了mindspore==2.2.14,如需更换mindspore版本,可更改下面mindspore的版本号
!pip uninstall mindspore -y
!pip install -i https://pypi.mirrors.ustc.edu.cn/simple mindspore==2.2.14%%capture captured_output
# 该案例在 mindnlp 0.3.1 版本完成适配,如果发现案例跑不通,可以指定mindnlp版本,执行`!pip install mindnlp==0.3.1 jieba soundfile librosa`
!pip install -i https://pypi.mirrors.ustc.edu.cn/simple mindnlp jieba soundfile librosa# 查看当前 mindspore 版本
!pip show mindspore
输出:
Name: mindspore
Version: 2.2.14
Summary: MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
Home-page: https://www.mindspore.cn
Author: The MindSpore Authors
Author-email: contact@mindspore.cn
License: Apache 2.0
Location: /home/nginx/miniconda/envs/jupyter/lib/python3.9/site-packages
Requires: asttokens, astunparse, numpy, packaging, pillow, protobuf, psutil, scipy
Required-by: mindnlp
from mindnlp.transformers import MusicgenForConditionalGeneration
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
输出:
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.027 seconds.
Prefix dict has been built successfully.
--------------- 7.68k/? [00:00<00:00, 490kB/s]
100%------------- 2.20G/2.20G [02:57<00:00, 14.9MB/s]
\
100%--------------- 224/224 [00:00<00:00, 18.9kB/s]
三、生成音乐
生成模式
贪心greedy
采样sampling 较优,本例中选用。
显式指定使用采样模式:
调用MusicgenForConditionalGeneration.generate时设置do_sample=True
1.无提示生成
MusicgenForConditionalGeneration.get_unconditional_inputs
网络随机输入
.generate
自回归生成,do_sample=True 来启用采样模式:
%%time
unconditional_inputs = model.get_unconditional_inputs(num_samples=1)
audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256)
输出:
CPU times: user 7min 17s, sys: 1min 16s, total: 8min 34s
Wall time: 10min 50s
音频输出是格式是:
a Torch tensor of shape (batch_size, num_channels, sequence_length)
使用第三方库scipy保存音频文件musicgen_out.wav 。
import scipy
sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].asnumpy())
import scipy
sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].asnumpy())
参数 max_new_tokens 指定生成的 token 数。
使用 EnCodec 模型帧速率计算音频样本长度(以秒为单位):
audio_length_in_s = 256 / model.config.audio_encoder.frame_rate
audio_length_in_s
输出:
5.12
2.文本提示生成
文本提示输入
AutoProcessor预处理
.generate方法生成文本条件音频样本
do_sample=True 启用采样模式
guidance_scale 无分类器指导(CFG)
设置条件对数之间的权重(从文本提示中预测)
无条件对数(从无条件或空文本中预测)
越高表示生成模型与输入的文本更加紧密
设置>1启用 CFG
%%time
from mindnlp.transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
inputs = processor(text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],padding=True,return_tensors="ms",
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
输出:
--------------- 2.31k/? [00:00<00:00, 2.13kB/s]
100%--------------- 773k/773k [00:00<00:00, 1.24MB/s]-- 2.31M/0.00 [01:17<00:00, 41.5kB/s]--------------- 2.15k/? [00:00<00:00, 195kB/s]
CPU times: user 7min 18s, sys: 1min 16s, total: 8min 35s
Wall time: 11min 37s
scipy.io.wavfile.write("musicgen_out_text.wav", rate=sampling_rate, data=audio_values[0, 0].asnumpy())
from IPython.display import Audio
# 要收听生成的音频样本,可以使用 Audio 在 notebook 进行播放
Audio(audio_values[0].asnumpy(), rate=sampling_rate)
3.音频提示生成
加载音频提示文件
预处理
输入网络模型生成音频
保存音频文件musicgen_out_audio.wav
%%time
from datasets import load_dataset
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)
sample = next(iter(dataset))["audio"]
# take the first half of the audio sample
sample["array"] = sample["array"][: len(sample["array"]) // 2]
inputs = processor(audio=sample["array"],sampling_rate=sample["sampling_rate"],text=["80s blues track with groovy saxophone"],padding=True,return_tensors="ms",
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
输出:
100%--------------- 275/275 [00:00<00:00, 24.6kB/s]
Downloading readme: --------------- 703/? [00:00<00:00, 62.6kB/s]
CPU times: user 7min 17s, sys: 1min 14s, total: 8min 32s
Wall time: 12min 8s
scipy.io.wavfile.write("musicgen_out_audio.wav", rate=sampling_rate, data=audio_values[0, 0].asnumpy())
from IPython.display import Audio
# 要收听生成的音频样本,可以使用 Audio 在 notebook 进行播放
Audio(audio_values[0].asnumpy(), rate=sampling_rate)
4.批量音频提示生成
切片得到两个不同长度的样本音频
由于输入音频提示长度不确定,按最长长度批处理填充
生成音频数据audio_values
删除填充数据,恢复最终音频
sample = next(iter(dataset))["audio"]
# take the first quater of the audio sample
sample_1 = sample["array"][: len(sample["array"]) // 4]
# take the first half of the audio sample
sample_2 = sample["array"][: len(sample["array"]) // 2]
inputs = processor(audio=[sample_1, sample_2],sampling_rate=sample["sampling_rate"],text=["80s blues track with groovy saxophone", "90s rock song with loud guitars and heavy drums"],padding=True,return_tensors="ms",
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
# post-process to remove padding from the batched audio
audio_values = processor.batch_decode(audio_values, padding_mask=inputs.padding_mask)
输出:
|
Audio(audio_values[0], rate=sampling_rate)
输出:
四、生成配置
默认的生成配置:
model.generation_config
输出:
<mindnlp.transformers.generation.configuration_utils.GenerationConfig at 0xfffeb0dc0550>
默认使用采样模式 do_sample=True
指导刻度为 3
最大生成长度为 1500(相当于 30 秒的音频)
更改默认生成参数:
# increase the guidance scale to 4.0
model.generation_config.guidance_scale = 4.0
# set the max new tokens to 256
model.generation_config.max_new_tokens = 256
# set the softmax sampling temperature to 1.5
model.generation_config.temperature = 1.5
使用新配置重新生成
audio_values = model.generate(**inputs)
注意,传参将取代生成配置中原来的设置