WEKO3
アイテム
Japanese Neural Incremental Text-to-Speech Synthesis Framework With an Accent Phrase Input
http://hdl.handle.net/10061/0002000110
http://hdl.handle.net/10061/0002000110b5828399-ab97-4aaa-a1c7-7255959db30f
| アイテムタイプ | 学術雑誌論文 / Journal Article(1) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 公開日 | 2024-01-26 | |||||||||
| タイトル | ||||||||||
| タイトル | Japanese Neural Incremental Text-to-Speech Synthesis Framework With an Accent Phrase Input | |||||||||
| 言語 | ||||||||||
| 言語 | eng | |||||||||
| キーワード | ||||||||||
| 主題Scheme | Other | |||||||||
| 主題 | Incremental speech synthesis | |||||||||
| キーワード | ||||||||||
| 主題Scheme | Other | |||||||||
| 主題 | end-to-end | |||||||||
| キーワード | ||||||||||
| 主題Scheme | Other | |||||||||
| 主題 | Japanese language | |||||||||
| キーワード | ||||||||||
| 主題Scheme | Other | |||||||||
| 主題 | accent phrase unit | |||||||||
| 資源タイプ | ||||||||||
| 資源タイプ | journal article | |||||||||
| アクセス権 | ||||||||||
| アクセス権 | open access | |||||||||
| 著者 |
Yanagita, Tomoya
× Yanagita, Tomoya
× Sakti, Sakriani
× 中村, 哲 |
|||||||||
| 抄録 | ||||||||||
| 内容記述タイプ | Abstract | |||||||||
| 内容記述 | Work in the development of neural incremental text-to-speech (iTTS), which is attracting increasing attention, has recently pursued low-latency processing by generating speech on the fly before reading complete sentences. Most current state-of-the-art iTTS systems use a prefix-to-prefix neural iTTS framework with look-ahead of 1-2 unit segments (i.e., phonemes or words). However, since the Japanese language is based on accent phrase units that are longer than words, using a prefix-to-prefix neural iTTS with a look-ahead approach increases latency. Here, we propose an alternative to the end-to-end neural iTTS architecture that does not apply look-ahead input when synthesizing speech chunks. We further propose a method to use information from the previous time step by connecting the synthesized vector and the model’s internal state to the current time step. We experimentally investigated the latency of various iTTS systems with different modeling and synthesis chunks. The experimental results show that, for Japanese, the proposed iTTS is able to synthesize better speech quality, with a similar latency range, than the conventional baseline prefix-to-prefix neural iTTS with word units. Moreover, we found that our proposed approach improved the prosodic naturalness among synthesized units in the Japanese language. Subjective evaluations also revealed that the proposed approach with an incremental unit of two accent phrases achieved the best scores in Japanese iTTS systems. | |||||||||
| 書誌情報 |
en : IEEE Access 巻 11, p. 22355-22363, 発行日 2023-03-02 |
|||||||||
| 出版者 | ||||||||||
| 出版者 | Institute of Electrical and Electronics Engineers | |||||||||
| ISSN | ||||||||||
| 収録物識別子タイプ | EISSN | |||||||||
| 収録物識別子 | 2169-3536 | |||||||||
| 出版者版DOI | ||||||||||
| 関連タイプ | isReplacedBy | |||||||||
| 識別子タイプ | DOI | |||||||||
| 関連識別子 | https://doi.org/10.1109/ACCESS.2023.3251657 | |||||||||
| 出版者版URI | ||||||||||
| 関連タイプ | isReplacedBy | |||||||||
| 識別子タイプ | URI | |||||||||
| 関連識別子 | https://ieeexplore.ieee.org/document/10057419 | |||||||||
| 権利 | ||||||||||
| 権利情報Resource | https://creativecommons.org/licenses/by-nc-nd/4.0/ | |||||||||
| 権利情報 | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ | |||||||||
| 著者版フラグ | ||||||||||
| 出版タイプ | NA | |||||||||