SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation
Date
Authors
Niu, Xinlei
Zhang, Jing
Walder, Christian
Martin, Charles Patrick
Journal Title
Journal ISSN
Volume Title
Publisher
Access Statement
Abstract
We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD1.
Description
Citation
Collections
Source
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Type
Book Title
Entity type
Publication