Integration of Bilingual Lists for Domain-Specific Statistical Machine Translation for Sinhala-Tamil

Fathima Farhath¹, Surangika Ranathunga¹, Sanath Jayasena¹, Gihan Dias¹

¹University of Moratuwa, Sri Lanka

Details

10:15 - 10:30 | Thu 31 May | Seminar Room | T.1.3-6

Session: Big Data, Machine Learning, and Cloud Computing

Abstract

Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.