End-to-End Text-to-Speech for Minangkabau Pariaman Dialect Using Variational Autoencoder with Adversarial Learning (VITS)

Authors

  • Muhammad Dzaki Fakhrezi Universitas Islam Negeri Sultan Syarif Kasim Riau, Indonesia
  • Yusra Universitas Islam Negeri Sultan Syarif Kasim Riau, Indonesia
  • Muhammad Fikry Universitas Islam Negeri Sultan Syarif Kasim Riau, Indonesia
  • Pizaini Universitas Islam Negeri Sultan Syarif Kasim Riau, Indonesia
  • Suwanto Sanjaya Universitas Islam Negeri Sultan Syarif Kasim Riau, Indonesia

DOI:

https://doi.org/10.30983/knowbase.v5i1.9909

Keywords:

Machine Learning, Natural Languange Processing, Variational Inference with adversarial learning for end-to-end Text-to-Speech, Mean Opinion Score, Minangkabau, Pariaman

Abstract

Language serves as a medium of human communication to convey ideas, emotions, and information, both orally and in writing. Each language possesses vocabulary and grammar adapted to the local culture. One of the regional languages that enriches Indonesian as the national language is Minangkabau. This language has four main dialects, namely Tanah Datar, Lima Puluh Kota, Agam, and Pesisir. Within the Pesisir dialect, there are several variations, including the Padang Kota, Padang Luar Kota, Painan, Tapan, and Pariaman dialects. This study discusses the application of Text-to-Speech (TTS) technology to the Minangkabau language, specifically the Pariaman dialect, using the Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS) method. This dialect needs to be preserved to prevent extinction and supported through technological development that broadens its use. The VITS method was chosen because it is capable of producing natural and high-quality speech. The research stages include voice data collection and recording, VITS model training, and speech quality evaluation using the Mean Opinion Score (MOS). The final results show a score of 4.72 out of 5, indicating that the generated speech closely resembles the natural utterances of native speakers. This TTS technology is expected to support the preservation and development of the Minangkabau language in the Pariaman dialect, as well as enhance information accessibility for its speakers.

References

F. Lafamane, “Fenomena Penggunaan Bahasa Daerah di Kalangan Remaja,” OSF, Jul. 2020, doi: 10.31219/osf.io/jubxp.

R. S. Velini and M. Suryadi, “Usaha Pemertahanan Bahasa Minangkabau melalui Permainan dan Tradisi Budaya Lokal di Kota Padang, Sumatera Barat,” Jurnal Sastra Indonesia, vol. 12, no. 1, pp. 71–80, Apr. 2023, doi: 10.15294/jsi.v12i1.59370.

S. BAHRI, “Eufemisme Bahasa Minangkabau Dialek Pariaman,” Jurnal Bahasa Unimed, no. 84, 2012, doi: https://doi.org/10.24114/bhs.v0i84%20TH%2038.2328.

A. Sutra, Z. Rida Rahayu, M. Putri, and H. Fikri, “Variasi Dialek Bahasa Minang (Pasisia) Siswa Kelas X SMA NEGERI 3 Painan,” Jurnal Edukasi dan Literasi Bahasa, vol. 4, no. 1, Apr. 2023, doi: https://doi.org/10.36665/jelisa.v4i1.711.

H. Hernawati, “Penggunaan Bahasa Ibu Sebagai Pengantar Dalam Pembelajaran Bahasa,” Jurnal Ilmiah Program Studi Pendidikan Bahasa dan Sastra Indonesia, vol. 4, Sep. 2015, doi: https://doi.org/10.22460/semantik.v4i2.p83%20-%2091.

A. Triandana, Y. Mestika Putra, S. Fitriah, and A. Kartika Putri, “Strategi Pemertahanan Bahasa Daerah Sebagai Bentuk Pelestarian Bahasa Pada Generasi Muda Di Kalangan Mahasiswa Sastra Indonesia Universitas Jambi,” Jurnal Pengabdian Masyarakat, vol. 2, no. 1, Apr. 2023, doi: 10.22437/est.v2i1.24576.

N. Arrizqi, I. Santoso, D. Yosua, and A. A. Soetrisno, “Implementasi Google Text To Speech Pada Aplikasi Pendeteksi Uang Berbasis Android,” Transient, vol. 10, no. 3, pp. 2685–0206, Sep. 2021, doi: https://doi.org/10.14710/transient.v10i3.510-516.

K. D. Fatmawati and I. Tahyudin, “Teknologi Text To Speech Menggunakan Amazon Polly Untuk Meningkatkan Kemampuan Membaca Pada Anak Dengan Gejala Disleksia,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 11, no. 6, pp. 1351–1360, Dec. 2024, doi: 10.25126/jtiik2024117426.

J. Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” ArXiv, Feb. 2018, doi: https://doi.org/10.48550/arXiv.1712.05884.

S. Wang, G. E. Henter, J. Gustafson, and É. Székely, “A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS,” ArXiv, Jul. 2023, doi: https://doi.org/10.48550/arXiv.2303.02719.

S. R. U. A. S. S. P. Muhamad M.I. Putra, “Implementasi Speech Recognition pada Aplikasi Pembelajaran Bahasa Inggris untuk Anak,” Jurnal Teknik Informatika, vol. 15, no. 4, Oct. 2020, doi: https://doi.org/10.35793/jti.v15i4.30426.

K. Peng, W. Ping, Z. Song, and K. Zhao, “Non-Autoregressive Neural Text-to-Speech,” ArXiv, Jun. 2019, doi: https://doi.org/10.48550/arXiv.1905.08459.

S. Sarif and A. AR, “Efektivitas Artificial Intelligence Text to Speech dalam Meningkatkan Keterampilan Membaca,” Jurnal Naskhi Jurnal Kajian Pendidikan dan Bahasa Arab, vol. 6, no. 1, pp. 1–8, Apr. 2024, doi: 10.47435/naskhi.v6i1.2697.

Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” ArXiv, Aug. 2022, doi: https://doi.org/10.48550/arXiv.2006.04558.

R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” ArXiv, Oct. 2019, doi: https://doi.org/10.48550/arXiv.1910.11997.

W. Zhao and Z. Yang, “An Emotion Speech Synthesis Method Based on VITS,” Applied Sciences (Switzerland), vol. 13, no. 4, Feb. 2023, doi: 10.3390/app13042225.

Y. Shirahata, R. Yamamoto, E. Song, R. Terashima, J.-M. Kim, and K. Tachibana, “Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis,” ArXiv, Feb. 2022, doi: https://doi.org/10.48550/arXiv.2210.15964.

K. Liang, B. Liu, Y. Hu, R. Liu, F. Bao, and G. Gao, “Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus,” Applied Sciences (Switzerland), vol. 13, no. 7, Apr. 2023, doi: 10.3390/app13074237.

J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” ArXiv, Jun. 2021, doi: https://doi.org/10.48550/arXiv.2106.06103.

H. Guo et al., “QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning,” ArXiv, Aug. 2023, doi: https://doi.org/10.48550/arXiv:2309.00126v1.

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” ArXiv, Oct. 2020, doi: https://doi.org/10.48550/arXiv.2010.05646.

J. Frnda, J. Nedoma, R. Martinek, and M. Fridrich, “Predicting Perceptual Quality in Internet Television Based on Unsupervised Learning,” Symmetry (Basel), vol. 12, no. 9, p. 1535, Sep. 2020, doi: https://doi.org/10.3390/sym12091535.

S. Gururani, K. Gupta, D. Shah, Z. Shakeri, and J. Pinto, “Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features,” ArXiv, May 2020, doi: https://doi.org/10.48550/arXiv.1911.09645.

Y. Choi, Y. Jung, Y. Suh, and H. Kim, “Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech,” ArXiv, May 2022, doi: https://doi.org/10.48550/arXiv.2011.01174.

J. Coldenhoff, A. Harper, P. Kendrick, T. Stojkovic, and M. Cernak, “Multi-Channel MOSRA : Mean Opinion Score And Room Acoustics Estimation Using Simulated Data And Teacher Model,” ArXiv, Mar. 2024, doi: https://doi.org/10.48550/arXiv.2309.11976.

J. Frnda, J. Nedoma, R. Martinek, and M. Fridrich, “Predicting perceptual quality in internet television based on unsupervised learning,” Symmetry (Basel), vol. 12, no. 9, Sep. 2020, doi: 10.3390/SYM12091535.

Q. Wang et al., “VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” ArXiv, Jun. 2019, doi: https://doi.org/10.48550/arXiv.1810.04826.

A. Pothula, B. Akkiraju, S. Bandarupalli, C. D, S. Kesiraju, and A. K. Vuppala, “End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data,” ArXiv, Jun. 2025, doi: http://arxiv.org/abs/2506.16251.

T. Hayashi et al., “ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,” ArXiv, Feb. 2020, doi: https://doi.org/10.48550/arXiv.1910.10909.

G. Zhu, Y. Wen, and Z. Duan, “A Review on Score-based Generative Models for Audio Applications,” ArXiv, Jun. 2025, doi: http://arxiv.org/abs/2506.08457.

S.-F. Huang, C.-J. Lin, D.-R. Liu, Y.-C. Chen, and H. Lee, “Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech,” ArXiv, Jul. 2022, doi: https://doi.org/10.1109/TASLP.2022.3167258.

Downloads

Published

2025-06-30

Issue

Section

Articles

Citation Check