Re-implementation of the paper “It’s Raw! Audio Generation with State-Space Models”.

Paper by:
Karan Goel, Albert Gu, Chris Donahue, Christopher Ré
Re-implementation by:
İlker Işık, Muhammed Can Keleş
GitHub Repository

In this page, we showcase some samples from our re-implementation of the SaShiMi paper in PyTorch, which we developed as the semester project for METU CENG 796 Deep Generative Models course in Spring 2023.

SaShiMi is a deep-learning model designed for generating raw audio. It is built around the recently introduced S4 model (from the paper “Efficiently Modeling Long Sequences with Structured State Spaces”), which is designed for sequence-to-sequence processing tasks. In SaShiMi, the authors modify S4 for better stability and achieve state-of-the-art results on multiple audio domains (music and speech), with less parameters and shorter inference time.

Youtube Mix

We trained an 8-layer SaShiMi model on Youtube Mix dataset, which is just this 4-hour long solo piano music from Youtube. The dataset is sampled at 16 kHz, and splitted into 8-second long clips. In order to discretize the samples, 8-bit μ-law encoding was used. 1

Some of the 8-second samples taken from this model is given below:

Furthermore, since this is an autoregressive model, we can use it to generate audio with unbounded length. Below, you can find 64-second long samples generated by this model:

Ablation Experiments

We've trained several SaShiMi variants with just 2 layers. Some of the results are given below:

🍒 Suspicious of cherry-picking?

Don't worry! All of our generated audio samples are available in this link.

Here is a mirror of this page.

  1. See this post if you want empirical evidence showing why you should use μ-law encoding instead of linear encoding.