HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts

dc.contributor.authorNiu, Xinleien
dc.contributor.authorZhang, Jingen
dc.contributor.authorMartin, Charles Patricken
dc.date.accessioned2025-05-23T01:14:06Z
dc.date.available2025-05-23T01:14:06Z
dc.date.issued2024en
dc.description.abstractWe introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder and optimises style text embeddings to align with the speaker style information through contrastive learning in parallel. Therefore, HybridVC can be efficiently trained under limited computational resources. Our experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multimodal voice style conversion. This underscores its potential for widespread applications such as user-defined personalised voice in various social media platforms. A comprehensive ablation study further validates the effectiveness of our method.en
dc.description.statusPeer-revieweden
dc.format.extent5en
dc.identifier.issn2308-457Xen
dc.identifier.otherORCID:/0000-0001-5683-7529/work/183586150en
dc.identifier.scopus85208989191en
dc.identifier.urihttp://www.scopus.com/inward/record.url?scp=85208989191&partnerID=8YFLogxKen
dc.identifier.urihttps://hdl.handle.net/1885/733750653
dc.language.isoenen
dc.relation.ispartofseries25th Interspeech Conferece 2024en
dc.rightsPublisher Copyright: © 2024 International Speech Communication Association. All rights reserved.en
dc.sourceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECHen
dc.subjectcontrastive learningen
dc.subjecthybrid prompten
dc.subjectvoice styleen
dc.titleHybridVC: Efficient Voice Style Conversion with Text and Audio Promptsen
dc.typeConference paperen
dspace.entity.typePublicationen
local.bibliographicCitation.lastpage4372en
local.bibliographicCitation.startpage4368en
local.contributor.affiliationNiu, Xinlei; ANU College of Systems and Society, The Australian National Universityen
local.contributor.affiliationZhang, Jing; School of Computing, ANU College of Systems and Society, The Australian National Universityen
local.contributor.affiliationMartin, Charles Patrick; School of Computing, ANU College of Systems and Society, The Australian National Universityen
local.identifier.doi10.21437/Interspeech.2024-46en
local.identifier.pure5019a6e7-48d5-4048-b95e-b75b8a277c0ben
local.identifier.urlhttps://www.scopus.com/pages/publications/85208989191en
local.type.statusPublisheden

Downloads