SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Niu, XinleiZhang, JingWalder, ChristianMartin, Charles Patrick2026-02-272026-02-271520-6149ORCID:/0000-0001-5683-7529/work/206651274https://hdl.handle.net/1885/733806737We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD1.5enPublisher Copyright: © 2024 IEEE.conditional discrete contrastive diffusionLoRAText-to-sound generationSoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation202410.1109/ICASSP48485.2024.10446349105001499542