Transliteration and Byte Pair Encoding to Improve Tamil to Sinhala Neural Machine Translation

Pasindu Tennage¹, Achini Herath¹, Malith Thilakarathne¹, Prabath Sandaruwan¹, Surangika Ranathunga¹

¹University of Moratuwa, Sri Lanka

Details

09:45 - 10:00 | Thu 31 May | Seminar Room | T.1.3-4

Session: Big Data, Machine Learning, and Cloud Computing

Abstract

Neural Machine Translation (NMT) is the current state-of-the-art machine translation technique. However, applicability of NMT for language pairs that have high morphological variations is still debatable. Lack of language resources, especially a sufficiently large parallel corpus causes additional issues, which leads to very poor translation performance, when NMT is applied to languages with high morphological variations. In this paper, we present three techniques to improve domain-specific NMT performance of the under-resourced language pair Sinhala and Tamil that have high morphological variations. Out of these three techniques, transliteration is a novel approach to improve domain-specific NMT performance for language pairs such as Sinhala and Tamil that share a common grammatical structure and have moderate lexical similarity. We built the first transliteration system for Sinhala to English and Tamil to English, which provided an accuracy of 99.6%, when tested with the parallel corpus we used for NMT training. The other technique we employed is Byte Pair Encoding (BPE), which is a technique that has been used to achieve open vocabulary translation with a fixed vocabulary of subword symbols. Our experiments show that while the translation based on independent BPE models and pure transliteration perform moderately, integrating transliteration to build a joint BPE model for the aforementioned language pair increases the translation quality by 1.68 BLEU score.