WEKO3
アイテム
Zero-Shot Cross-Lingual Text-to-Speech With Style-Enhanced Normalization and Auditory Feedback Training Mechanism
http://hdl.handle.net/10061/0002001296
http://hdl.handle.net/10061/0002001296c17fb058-8422-454d-9504-8c23d0e0d0a8
| アイテムタイプ | 学術雑誌論文 / Journal Article(1) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 公開日 | 2025-12-16 | |||||||||||
| タイトル | ||||||||||||
| タイトル | Zero-Shot Cross-Lingual Text-to-Speech With Style-Enhanced Normalization and Auditory Feedback Training Mechanism | |||||||||||
| 言語 | ||||||||||||
| 言語 | eng | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Adaptation models | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Data models | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Training | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Multilingual | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Diffusion models | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Decoding | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Data mining | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Vectors | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Speech enhancement | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Text to speech | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | Zero-shot adaptive TTS | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | cross-lingual TTS | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | diffusion model | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | high-resource languages | |||||||||||
| キーワード | ||||||||||||
| 主題Scheme | Other | |||||||||||
| 主題 | low-resource languages | |||||||||||
| 資源タイプ | ||||||||||||
| 資源タイプ | journal article | |||||||||||
| アクセス権 | ||||||||||||
| アクセス権 | open access | |||||||||||
| 著者 |
Tran, Chung
× Tran, Chung
× Luong, Chi Mai
× Sakti, Sakriani
|
|||||||||||
| 抄録 | ||||||||||||
| 内容記述タイプ | Abstract | |||||||||||
| 内容記述 | In an increasingly globalized and interconnected world, the ability to communicate in more than one language is a vital skill that can reduce language barriers and promote cultural interaction. However, mastering multiple languages requires a significant investment of time and effort. Here, zero-shot cross-lingual text-to-speech synthesis (TTS) offers benefits to augment human communication by producing high-quality speech in multiple languages while preserving the original speaker's vocal characteristics. However, building such a system presents several challenges, including ensuring high-quality synthesis and achieving similarity between the synthesized speaker and the reference speaker, especially when training a model for low-resource languages. In this study, we propose a novel technique known as Style-Enhanced Normalization TTS (STEN-TTS) to achieve two objectives: preserving synthesis quality while simultaneously enhancing the ability of zero-shot adaptation with just a few seconds of reference for the purpose of cross-lingual synthesis. The model itself can also be trained with low-resource data, but using data of only 10 or 20 minutes is a major challenge. To improve the quality of synthesized audio in low-resource languages, we propose a combination of STEN-TTS with different training methods, including unsupervised text encoding, knowledge distillation, and an auditory feedback mechanism. An experimental evaluation was conducted in five languages (English, Chinese, Indonesian, Japanese, and Vietnamese), considering high- and low-resource training data as well as seen and unseen speakers. The proposed approach has shown its effectiveness in a high-resource setting, achieving a remarkable similarity (SMOS) of 3.44±0.17 for cross-lingual conversion as well as verification scores of 93.4% and 80.5% for seen and unseen speakers, respectively. The results in a low-resource setting, measured by phoneme error rates, also indicate a substantial improvement, with enhancements of approximately 3-4% . In this case, the quality of speaker verification remains consistently high, achieving scores of 90.0% and 78.0% for seen and unseen speakers. | |||||||||||
| 書誌情報 |
en : IEEE Transactions on Audio, Speech and Language Processing 巻 33, p. 1479-1492, ページ数 14, 発行日 2025-03-05 |
|||||||||||
| 出版者 | ||||||||||||
| 出版者 | IEEE | |||||||||||
| ISSN | ||||||||||||
| 収録物識別子タイプ | EISSN | |||||||||||
| 収録物識別子 | 2998-4173 | |||||||||||
| 出版者版DOI | ||||||||||||
| 関連タイプ | isReplacedBy | |||||||||||
| 識別子タイプ | DOI | |||||||||||
| 関連識別子 | https://doi.org/10.1109/TASLPRO.2025.3548429 | |||||||||||
| 出版者版URI | ||||||||||||
| 関連タイプ | isReplacedBy | |||||||||||
| 識別子タイプ | URI | |||||||||||
| 関連識別子 | https://ieeexplore.ieee.org/abstract/document/10910244 | |||||||||||
| 権利 | ||||||||||||
| 権利情報Resource | https://creativecommons.org/licenses/by/4.0/ | |||||||||||
| 権利情報 | © 2025 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ | |||||||||||
| 著者版フラグ | ||||||||||||
| 出版タイプ | NA | |||||||||||
| 助成情報 | ||||||||||||
| 助成機関名 | Japan Society for the Promotion of Science (JSPS) | |||||||||||
| 研究課題番号 | JP21H05054 | |||||||||||
| 研究課題番号URI | https://kaken.nii.ac.jp/grant/KAKENHI-PROJECT-21H05054/ | |||||||||||
| 研究課題名 | 多元自動通訳システムと評価法に関する研究とその応用展開 | |||||||||||
| 助成情報 | ||||||||||||
| 助成機関名 | Japan Society for the Promotion of Science (JSPS) | |||||||||||
| 研究課題番号 | JP23K21681 | |||||||||||
| 研究課題番号URI | https://kaken.nii.ac.jp/grant/KAKENHI-PROJECT-23K21681/ | |||||||||||
| 研究課題名 | 言語の壁を超える低資源多言語Machine Speech Chain技術の構築 | |||||||||||