Bridging the inference gap in multimodal variational autoencoders
DOI:
https://doi.org/10.52933/jdssv.v5i9.160Keywords:
Contrastive learning, multimodality, normalizing flows, variational autoencoders.Abstract
From medical diagnosis to autonomous vehicles, many critical applications
rely on the integration of multiple heterogeneous data modalities. Multimodal
Variational Autoencoders offer versatile and scalable methods for generating un-
observed modalities from observed ones. Recent models using mixture-of-experts
aggregation suffer from theoretical limitations that reduce generation quality on
complex datasets. In this article, we propose a novel interpretable model able
to learn both joint and conditional distributions without introducing mixture
aggregation. Our model follows a multistage training process: after learning
the joint distribution with variational inference, we learn the conditional dis-
tributions using normalizing flows and a new, theoretically grounded objective
function. Importantly, we also propose extracting the semantic content shared be-
tween modalities in a pre-training stage and incorporating these representations
into the inference distributions to enhance generative coherence. Our method
achieves state-of-the-art results on several benchmark datasets.
