SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Date

Authors

Niu, Xinlei
Zhang, Jing
Walder, Christian
Martin, Charles Patrick

Journal Title

Journal ISSN

Volume Title

Publisher

Access Statement

Research Projects

Organizational Units

Journal Issue

Abstract

We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD1.

Description

Citation

Source

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Book Title

Entity type

Publication

Access Statement

License Rights

Restricted until