Machine translation

Work package 4

In the fourth work package, we will develop a machine translation engine for the English-Slovenian and Slovenian-English language pairs. It will be available on the DSDE portal as a web application, as part of a pipeline for voice translation and as program code under an open-source license, which will also allow commercial use. In addition, the creation of a corpus of translations comprised by at least 3 million translation units is planned.

Goals

  • We will increase the corpus of bilingual parallel texts for the EN-SL and SL-EN language pairs by at least 3 million translation units.
  • We will develop support tools that will facilitate the process of text collection and processing, which will be available in open access, as well as tools for determining the evaluation methodology, which will provide a comprehensive insight into the quality of the reference translation engine.
  • We will develop a machine translation engine for the EN-SL and SL-EN language combination, which will serve as an upgrade for the existing engine developed at the Jožef Stefan Institute, and publish it on a purposefully developed portal, where it will be available to all users.

Text collection

Would you like to help us with text collection and contribute to the development of Slovene in the digital environment?

Read more on how you can contribute, on our page of answers to frequently asked questions.

Texts for the corpus of translations

The currently available bilingual corpus of Slovenian and English texts consists of approximately 34 million aligned sentences from various freely accessible corpora. In this work package, we will increase the size of the corpus by at least 3 million units. We will collect texts and their corresponding translations from both public and private companies that use computer-assisted translation tools, which means that the texts in question are already properly segmented and ready to use in the learning process. During text selection, we will focus primarily on domains that are not yet represented in the existing database.

Support tools and machine translation engine evaluation methodology

From the viewpoint of text collection and processing for machine translation training, some tools need to be developed. We will create tools for personal data anonymization, tools for (semi)automatic bilingual text alignment and tools for the extraction of suitable texts from larger databases.

In addition to support tools, we will also develop a methodology for evaluating the machine translation engine. The primary method for determining the quality of an engine, which is reflected in the quality of its translations, will be the BLEU automatic metric. However, as BLEU does not always provide a comprehensive qualitative insight, we will develop an additional evaluation method – one that will be based on manual review. The manual evaluation will be performed by MA in translation students who will be suitably trained to perform the task. The results of the evaluation will be reviewed by the coordinator, who will also analyse the results.

Prior to the development of the machine translation engine, we will repeat the evaluation of the reference engine developed at the Jožef Stefan Institute within the TraMOOC project, funded by the European Union Horizon 2020 program. The engine is available in open access on the website www.translexy.com.

New models of neural machine translation

The current state-of-the-art in the field of MT are neural machine translation engines (NMT), which are trained on deep neural networks. There are a number of publicly available platforms for training a neural machine translation engine. We will test some of them and ultimately choose one to work with. We will test various settings and different adjustments in order to achieve the best possible results. By doing so, we will evaluate each build with selected automatic metrics and with a manual evaluation methodology.

The final version of the new neural machine translation engine for the SL-EN and EN-SL language pairs will be made available to all users on a publicly accessible web portal, which will be purposefully created to host it. The users will be able to upload a text in Slovene or English for translation into either language of this pair. After initiating the translation, which will take a couple minutes, the user will be able to download it or have it sent to his or hers email address.

We will also prepare a long-term plan for the development of a machine translation engine for offline translation of lectures from Slovenian into English.

Learn more about other Work packages

Terminology portal

Terminology portal

Maintenance of the infrastructure centre for language resources and technologies

Maintenance of the infrastructure centre for language resources and technologies

Language resources

Language resources

© 2020. All rights reserved

Concept and implementation: ENKI, d.o.o. Legal notice Cookies