🍣 SaShiMi-796 Samples

Re-implementation of the paper “It’s Raw! Audio Generation with State-Space Models”.

Paper by:
Karan Goel, Albert Gu, Chris Donahue, Christopher Ré Re-implementation by:
İlker Işık, Muhammed Can Keleş GitHub Repository

In this page, we showcase some samples from our re-implementation of the SaShiMi paper in PyTorch, which we developed as the semester project for METU CENG 796 Deep Generative Models course in Spring 2023.

SaShiMi is a deep-learning model designed for generating raw audio. It is built around the recently introduced S4 model (from the paper “Efficiently Modeling Long Sequences with Structured State Spaces”), which is designed for sequence-to-sequence processing tasks. In SaShiMi, the authors modify S4 for better stability and achieve state-of-the-art results on multiple audio domains (music and speech), with less parameters and shorter inference time.

Youtube Mix

We trained an 8-layer SaShiMi model on Youtube Mix dataset, which is just this 4-hour long solo piano music from Youtube. The dataset is sampled at 16 kHz, and splitted into 8-second long clips. In order to discretize the samples, 8-bit μ-law encoding was used. ¹

Some of the 8-second samples taken from this model is given below:

Furthermore, since this is an autoregressive model, we can use it to generate audio with unbounded length. Below, you can find 64-second long samples generated by this model: