End-to-End Text-to-Speech for Minangkabau Pariaman Dialect Using Variational Autoencoder with Adversarial Learning (VITS)
DOI:
https://doi.org/10.30983/knowbase.v5i1.9909Keywords:
Machine Learning, Natural Languange Processing, Variational Inference with adversarial learning for end-to-end Text-to-Speech, Mean Opinion Score, Minangkabau, PariamanAbstract
Language serves as a medium of human communication to convey ideas, emotions, and information, both orally and in writing. Each language possesses vocabulary and grammar adapted to the local culture. One of the regional languages that enriches Indonesian as the national language is Minangkabau. This language has four main dialects, namely Tanah Datar, Lima Puluh Kota, Agam, and Pesisir. Within the Pesisir dialect, there are several variations, including the Padang Kota, Padang Luar Kota, Painan, Tapan, and Pariaman dialects. This study discusses the application of Text-to-Speech (TTS) technology to the Minangkabau language, specifically the Pariaman dialect, using the Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS) method. This dialect needs to be preserved to prevent extinction and supported through technological development that broadens its use. The VITS method was chosen because it is capable of producing natural and high-quality speech. The research stages include voice data collection and recording, VITS model training, and speech quality evaluation using the Mean Opinion Score (MOS). The final results show a score of 4.72 out of 5, indicating that the generated speech closely resembles the natural utterances of native speakers. This TTS technology is expected to support the preservation and development of the Minangkabau language in the Pariaman dialect, as well as enhance information accessibility for its speakers.
References
F. Lafamane, “Fenomena Penggunaan Bahasa Daerah di Kalangan Remaja,” OSF, Jul. 2020, doi: 10.31219/osf.io/jubxp.
R. S. Velini and M. Suryadi, “Usaha Pemertahanan Bahasa Minangkabau melalui Permainan dan Tradisi Budaya Lokal di Kota Padang, Sumatera Barat,” Jurnal Sastra Indonesia, vol. 12, no. 1, pp. 71–80, Apr. 2023, doi: 10.15294/jsi.v12i1.59370.
S. BAHRI, “Eufemisme Bahasa Minangkabau Dialek Pariaman,” Jurnal Bahasa Unimed, no. 84, 2012, doi: https://doi.org/10.24114/bhs.v0i84%20TH%2038.2328.
A. Sutra, Z. Rida Rahayu, M. Putri, and H. Fikri, “Variasi Dialek Bahasa Minang (Pasisia) Siswa Kelas X SMA NEGERI 3 Painan,” Jurnal Edukasi dan Literasi Bahasa, vol. 4, no. 1, Apr. 2023, doi: https://doi.org/10.36665/jelisa.v4i1.711.
H. Hernawati, “Penggunaan Bahasa Ibu Sebagai Pengantar Dalam Pembelajaran Bahasa,” Jurnal Ilmiah Program Studi Pendidikan Bahasa dan Sastra Indonesia, vol. 4, Sep. 2015, doi: https://doi.org/10.22460/semantik.v4i2.p83%20-%2091.
A. Triandana, Y. Mestika Putra, S. Fitriah, and A. Kartika Putri, “Strategi Pemertahanan Bahasa Daerah Sebagai Bentuk Pelestarian Bahasa Pada Generasi Muda Di Kalangan Mahasiswa Sastra Indonesia Universitas Jambi,” Jurnal Pengabdian Masyarakat, vol. 2, no. 1, Apr. 2023, doi: 10.22437/est.v2i1.24576.
N. Arrizqi, I. Santoso, D. Yosua, and A. A. Soetrisno, “Implementasi Google Text To Speech Pada Aplikasi Pendeteksi Uang Berbasis Android,” Transient, vol. 10, no. 3, pp. 2685–0206, Sep. 2021, doi: https://doi.org/10.14710/transient.v10i3.510-516.
K. D. Fatmawati and I. Tahyudin, “Teknologi Text To Speech Menggunakan Amazon Polly Untuk Meningkatkan Kemampuan Membaca Pada Anak Dengan Gejala Disleksia,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 11, no. 6, pp. 1351–1360, Dec. 2024, doi: 10.25126/jtiik2024117426.
J. Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” ArXiv, Feb. 2018, doi: https://doi.org/10.48550/arXiv.1712.05884.
S. Wang, G. E. Henter, J. Gustafson, and É. Székely, “A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS,” ArXiv, Jul. 2023, doi: https://doi.org/10.48550/arXiv.2303.02719.
S. R. U. A. S. S. P. Muhamad M.I. Putra, “Implementasi Speech Recognition pada Aplikasi Pembelajaran Bahasa Inggris untuk Anak,” Jurnal Teknik Informatika, vol. 15, no. 4, Oct. 2020, doi: https://doi.org/10.35793/jti.v15i4.30426.
K. Peng, W. Ping, Z. Song, and K. Zhao, “Non-Autoregressive Neural Text-to-Speech,” ArXiv, Jun. 2019, doi: https://doi.org/10.48550/arXiv.1905.08459.
S. Sarif and A. AR, “Efektivitas Artificial Intelligence Text to Speech dalam Meningkatkan Keterampilan Membaca,” Jurnal Naskhi Jurnal Kajian Pendidikan dan Bahasa Arab, vol. 6, no. 1, pp. 1–8, Apr. 2024, doi: 10.47435/naskhi.v6i1.2697.
Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” ArXiv, Aug. 2022, doi: https://doi.org/10.48550/arXiv.2006.04558.
R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” ArXiv, Oct. 2019, doi: https://doi.org/10.48550/arXiv.1910.11997.
W. Zhao and Z. Yang, “An Emotion Speech Synthesis Method Based on VITS,” Applied Sciences (Switzerland), vol. 13, no. 4, Feb. 2023, doi: 10.3390/app13042225.
Y. Shirahata, R. Yamamoto, E. Song, R. Terashima, J.-M. Kim, and K. Tachibana, “Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis,” ArXiv, Feb. 2022, doi: https://doi.org/10.48550/arXiv.2210.15964.
K. Liang, B. Liu, Y. Hu, R. Liu, F. Bao, and G. Gao, “Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus,” Applied Sciences (Switzerland), vol. 13, no. 7, Apr. 2023, doi: 10.3390/app13074237.
J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” ArXiv, Jun. 2021, doi: https://doi.org/10.48550/arXiv.2106.06103.
H. Guo et al., “QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning,” ArXiv, Aug. 2023, doi: https://doi.org/10.48550/arXiv:2309.00126v1.
J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” ArXiv, Oct. 2020, doi: https://doi.org/10.48550/arXiv.2010.05646.
J. Frnda, J. Nedoma, R. Martinek, and M. Fridrich, “Predicting Perceptual Quality in Internet Television Based on Unsupervised Learning,” Symmetry (Basel), vol. 12, no. 9, p. 1535, Sep. 2020, doi: https://doi.org/10.3390/sym12091535.
S. Gururani, K. Gupta, D. Shah, Z. Shakeri, and J. Pinto, “Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features,” ArXiv, May 2020, doi: https://doi.org/10.48550/arXiv.1911.09645.
Y. Choi, Y. Jung, Y. Suh, and H. Kim, “Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech,” ArXiv, May 2022, doi: https://doi.org/10.48550/arXiv.2011.01174.
J. Coldenhoff, A. Harper, P. Kendrick, T. Stojkovic, and M. Cernak, “Multi-Channel MOSRA : Mean Opinion Score And Room Acoustics Estimation Using Simulated Data And Teacher Model,” ArXiv, Mar. 2024, doi: https://doi.org/10.48550/arXiv.2309.11976.
J. Frnda, J. Nedoma, R. Martinek, and M. Fridrich, “Predicting perceptual quality in internet television based on unsupervised learning,” Symmetry (Basel), vol. 12, no. 9, Sep. 2020, doi: 10.3390/SYM12091535.
Q. Wang et al., “VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” ArXiv, Jun. 2019, doi: https://doi.org/10.48550/arXiv.1810.04826.
A. Pothula, B. Akkiraju, S. Bandarupalli, C. D, S. Kesiraju, and A. K. Vuppala, “End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data,” ArXiv, Jun. 2025, doi: http://arxiv.org/abs/2506.16251.
T. Hayashi et al., “ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,” ArXiv, Feb. 2020, doi: https://doi.org/10.48550/arXiv.1910.10909.
G. Zhu, Y. Wen, and Z. Duan, “A Review on Score-based Generative Models for Audio Applications,” ArXiv, Jun. 2025, doi: http://arxiv.org/abs/2506.08457.
S.-F. Huang, C.-J. Lin, D.-R. Liu, Y.-C. Chen, and H. Lee, “Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech,” ArXiv, Jul. 2022, doi: https://doi.org/10.1109/TASLP.2022.3167258.
Downloads
Published
Issue
Section
Citation Check
License
Copyright (c) 2025 Muhammad Dzaki Fakhrezi, Yusra, Muhammad Fikry, Pizaini, Suwanto Sanjaya

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

