Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

Roche, Fanny; Hueber, Thomas; Limier, Samuel; Girin, Laurent

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:1806.04096 (eess)

[Submitted on 11 Jun 2018 (v1), last revised 24 May 2019 (this version, v2)]

Title:Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

Authors:Fanny Roche (1 and 2), Thomas Hueber (1), Samuel Limier (2), Laurent Girin (1 and 3) ((1) Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France, (2) Arturia, Meylan, France, (3) INRIA, Perception Team, Montbonnot, France)

View PDF

Abstract:This study investigates the use of non-linear unsupervised dimensionality reduction techniques to compress a music dataset into a low-dimensional representation which can be used in turn for the synthesis of new sounds. We systematically compare (shallow) autoencoders (AEs), deep autoencoders (DAEs), recurrent autoencoders (with Long Short-Term Memory cells -- LSTM-AEs) and variational autoencoders (VAEs) with principal component analysis (PCA) for representing the high-resolution short-term magnitude spectrum of a large and dense dataset of music notes into a lower-dimensional vector (and then convert it back to a magnitude spectrum used for sound resynthesis). Our experiments were conducted on the publicly available multi-instrument and multi-pitch database NSynth. Interestingly and contrary to the recent literature on image processing, we can show that PCA systematically outperforms shallow AE. Only deep and recurrent architectures (DAEs and LSTM-AEs) lead to a lower reconstruction error. The optimization criterion in VAEs being the sum of the reconstruction error and a regularization term, it naturally leads to a lower reconstruction accuracy than DAEs but we show that VAEs are still able to outperform PCA while providing a low-dimensional latent space with nice "usability" properties. We also provide corresponding objective measures of perceptual audio quality (PEMO-Q scores), which generally correlate well with the reconstruction error.

Comments:	SMC 2019
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:1806.04096 [eess.AS]
	(or arXiv:1806.04096v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.1806.04096

Submission history

From: Fanny Roche [view email]
[v1] Mon, 11 Jun 2018 16:39:16 UTC (1,380 KB)
[v2] Fri, 24 May 2019 08:30:50 UTC (2,275 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators