VQ-VAE paper extension improving reconstruction error.

Developed a modification on the VQ-VAE model (image generation) that improves by a factor of 2 the reconstruction error on the ImageNet dataset.

VQ-VAE

The Vector-Quantised Variational AutoEncoder (Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017) is a very popular model used for image generation. It is the basic idea behind popular image generation products such as Stable Diffusion or Dall·e.

Original VQ-VAE

Motivation

VQ-VAEs have grown in popularity in the last few years given the splendid results they have proven in different generative AI processes. They are mainly based in an encoder-decoder architecture where the latent variables are discretized. Using a discrete space is a more compact way of coding latent variables and it seems reasonable to do so because in real life objects have underlying discrete representations. Nevertheless, discretizing the underlying latent space using a finite codebook as done in VQ-VAE has some limitations when generating new images given that they can only properly represent the underlying structure of the image but misses most of the details. Moreover, sampling from such latent space, will result in a finite set of generated images. To try tackling all these issues, I wondered if there was a possible encoding of the latent space such that it maintains VQ VAEs’ discrete latent properties while combining them with the continuous latents given by the encoder architecture.This, may convert the latent space into a more varied and true to reality one. Nevertheless, I haven’t found any research trying to use both discrete and continuous latent variables to try to make the generating process better. In this report I study how such a combination could be done and if that’s the case, if it results in a better model.

Modifiactions of the original model

This is the main contribution, which changes the loss function used during the optimization process of reconstructing input images. All the details and motivation behind this is in this paper.

This modification of the algorithm proved to improve the reconstruction error of the original model, leading to very good reconstructions from the original images, being in some cases almost identical.

On the left, a sample of the original and recomstructed images. On the right, the test and training reconstruction errors.

Results

Overrall, the proposed method achieves the best reconstruction errors, as we can see in the following image.

Loss comparison between original method and the proposed method.

Limitations

While using a discrete latent space makes it very easy to generate new images, by sampling from the discrete latent space, when I use projected latents instead of discrete, the projected space is continuous, which would make it harder to sample or generate new images. To tackle this, one might consider some sort of prior distribution on the latent space, such as a gaussian, and use it to generate new images.

References

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6309–6318.