End-to-End Text-to-Speech for Minangkabau Pariaman Dialect Using Variational Autoencoder with Adversarial Learning (VITS)
DOI:
https://doi.org/10.30983/knowbase.v5i1.9909Keywords:
Machine Learning, Natural Languange Processing, Variational Inference with adversarial learning for end-to-end Text-to-Speech, Mean Opinion Score, Minangkabau, PariamanAbstract
References
F. Lafamane, “Fenomena Penggunaan Bahasa Daerah di Kalangan Remaja,” OSF, Jul. 2020, doi: 10.31219/osf.io/jubxp.
R. S. Velini and M. Suryadi, “Usaha Pemertahanan Bahasa Minangkabau melalui Permainan dan Tradisi Budaya Lokal di Kota Padang, Sumatera Barat,” Jurnal Sastra Indonesia, vol. 12, no. 1, pp. 71–80, Apr. 2023, doi: 10.15294/jsi.v12i1.59370.
S. BAHRI, “Eufemisme Bahasa Minangkabau Dialek Pariaman,” Jurnal Bahasa Unimed, no. 84, 2012, doi: https://doi.org/10.24114/bhs.v0i84%20TH%2038.2328.
A. Sutra, Z. Rida Rahayu, M. Putri, and H. Fikri, “Variasi Dialek Bahasa Minang (Pasisia) Siswa Kelas X SMA NEGERI 3 Painan,” Jurnal Edukasi dan Literasi Bahasa, vol. 4, no. 1, Apr. 2023, doi: https://doi.org/10.36665/jelisa.v4i1.711.
H. Hernawati, “Penggunaan Bahasa Ibu Sebagai Pengantar Dalam Pembelajaran Bahasa,” Jurnal Ilmiah Program Studi Pendidikan Bahasa dan Sastra Indonesia, vol. 4, Sep. 2015, doi: https://doi.org/10.22460/semantik.v4i2.p83%20-%2091.
A. Triandana, Y. Mestika Putra, S. Fitriah, and A. Kartika Putri, “Strategi Pemertahanan Bahasa Daerah Sebagai Bentuk Pelestarian Bahasa Pada Generasi Muda Di Kalangan Mahasiswa Sastra Indonesia Universitas Jambi,” Jurnal Pengabdian Masyarakat, vol. 2, no. 1, Apr. 2023, doi: 10.22437/est.v2i1.24576.
N. Arrizqi, I. Santoso, D. Yosua, and A. A. Soetrisno, “Implementasi Google Text To Speech Pada Aplikasi Pendeteksi Uang Berbasis Android,” Transient, vol. 10, no. 3, pp. 2685–0206, Sep. 2021, doi: https://doi.org/10.14710/transient.v10i3.510-516.
K. D. Fatmawati and I. Tahyudin, “Teknologi Text To Speech Menggunakan Amazon Polly Untuk Meningkatkan Kemampuan Membaca Pada Anak Dengan Gejala Disleksia,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 11, no. 6, pp. 1351–1360, Dec. 2024, doi: 10.25126/jtiik2024117426.
J. Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” ArXiv, Feb. 2018, doi: https://doi.org/10.48550/arXiv.1712.05884.
S. Wang, G. E. Henter, J. Gustafson, and É. Székely, “A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS,” ArXiv, Jul. 2023, doi: https://doi.org/10.48550/arXiv.2303.02719.
S. R. U. A. S. S. P. Muhamad M.I. Putra, “Implementasi Speech Recognition pada Aplikasi Pembelajaran Bahasa Inggris untuk Anak,” Jurnal Teknik Informatika, vol. 15, no. 4, Oct. 2020, doi: https://doi.org/10.35793/jti.v15i4.30426.
K. Peng, W. Ping, Z. Song, and K. Zhao, “Non-Autoregressive Neural Text-to-Speech,” ArXiv, Jun. 2019, doi: https://doi.org/10.48550/arXiv.1905.08459.
S. Sarif and A. AR, “Efektivitas Artificial Intelligence Text to Speech dalam Meningkatkan Keterampilan Membaca,” Jurnal Naskhi Jurnal Kajian Pendidikan dan Bahasa Arab, vol. 6, no. 1, pp. 1–8, Apr. 2024, doi: 10.47435/naskhi.v6i1.2697.
Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” ArXiv, Aug. 2022, doi: https://doi.org/10.48550/arXiv.2006.04558.
R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” ArXiv, Oct. 2019, doi: https://doi.org/10.48550/arXiv.1910.11997.
W. Zhao and Z. Yang, “An Emotion Speech Synthesis Method Based on VITS,” Applied Sciences (Switzerland), vol. 13, no. 4, Feb. 2023, doi: 10.3390/app13042225.
Y. Shirahata, R. Yamamoto, E. Song, R. Terashima, J.-M. Kim, and K. Tachibana, “Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis,” ArXiv, Feb. 2022, doi: https://doi.org/10.48550/arXiv.2210.15964.
K. Liang, B. Liu, Y. Hu, R. Liu, F. Bao, and G. Gao, “Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus,” Applied Sciences (Switzerland), vol. 13, no. 7, Apr. 2023, doi: 10.3390/app13074237.
J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” ArXiv, Jun. 2021, doi: https://doi.org/10.48550/arXiv.2106.06103.
H. Guo et al., “QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning,” ArXiv, Aug. 2023, doi: https://doi.org/10.48550/arXiv:2309.00126v1.
J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” ArXiv, Oct. 2020, doi: https://doi.org/10.48550/arXiv.2010.05646.
J. Frnda, J. Nedoma, R. Martinek, and M. Fridrich, “Predicting Perceptual Quality in Internet Television Based on Unsupervised Learning,” Symmetry (Basel), vol. 12, no. 9, p. 1535, Sep. 2020, doi: https://doi.org/10.3390/sym12091535.
S. Gururani, K. Gupta, D. Shah, Z. Shakeri, and J. Pinto, “Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features,” ArXiv, May 2020, doi: https://doi.org/10.48550/arXiv.1911.09645.
Y. Choi, Y. Jung, Y. Suh, and H. Kim, “Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech,” ArXiv, May 2022, doi: https://doi.org/10.48550/arXiv.2011.01174.
J. Coldenhoff, A. Harper, P. Kendrick, T. Stojkovic, and M. Cernak, “Multi-Channel MOSRA : Mean Opinion Score And Room Acoustics Estimation Using Simulated Data And Teacher Model,” ArXiv, Mar. 2024, doi: https://doi.org/10.48550/arXiv.2309.11976.
J. Frnda, J. Nedoma, R. Martinek, and M. Fridrich, “Predicting perceptual quality in internet television based on unsupervised learning,” Symmetry (Basel), vol. 12, no. 9, Sep. 2020, doi: 10.3390/SYM12091535.
Q. Wang et al., “VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” ArXiv, Jun. 2019, doi: https://doi.org/10.48550/arXiv.1810.04826.
A. Pothula, B. Akkiraju, S. Bandarupalli, C. D, S. Kesiraju, and A. K. Vuppala, “End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data,” ArXiv, Jun. 2025, doi: http://arxiv.org/abs/2506.16251.
T. Hayashi et al., “ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,” ArXiv, Feb. 2020, doi: https://doi.org/10.48550/arXiv.1910.10909.
G. Zhu, Y. Wen, and Z. Duan, “A Review on Score-based Generative Models for Audio Applications,” ArXiv, Jun. 2025, doi: http://arxiv.org/abs/2506.08457.
S.-F. Huang, C.-J. Lin, D.-R. Liu, Y.-C. Chen, and H. Lee, “Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech,” ArXiv, Jul. 2022, doi: https://doi.org/10.1109/TASLP.2022.3167258.
Downloads
Published
How to Cite
Issue
Section
Citation Check
License
Copyright (c) 2025 Muhammad Dzaki Fakhrezi, Yusra, Muhammad Fikry, Pizaini, Suwanto Sanjaya

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

