Machine Translation Benchmark Dataset for Languages in the Horn of Africa


Check out the first release of HornMT on GitHub.


Machine translation (MT) systems are now able to provide very accurate results for high resource language pairs such as English and German. However, for many low resource languages, MT is still under active research. We propose to develop and share publicly an evaluation dataset to automatically quantify the quality of MT systems for five languages in the Horn of Africa, namely: Amharic (am), Oromo (om), Somali (so), Tigrinya (ti) and Afar (aa). The goal is to create a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems for these languages.

The purpose of such a benchmark dataset is four fold. Primarily, we envision the dataset to be the standard benchmark to evaluate machine translation systems in research and production systems for these languages. It will be an integral part of the growing fields of multilingual and similar languages machine translation and zero-shot translation scenarios. Second, we believe this dataset will stimulate students and researchers both from within universities in these countries and elsewhere in the world to work on impactful languages for machine translation. These languages are spoken by millions of people yet there is not much work done because there is a lack of proper evaluation dataset. Third, the design of the dataset makes it easy to build up on and expand to more languages with significantly less effort than starting from scratch. For example, the same English source sentences that are used in this dataset when translated to a new language say French or German, would yield translations to these local languages automatically. Finally and most importantly, this dataset will contribute towards real-world machine translation systems that will contribute towards the social, economic and political empowerment of millions of people in the Horn of Africa.

Horn of Africa
Horn of Africa Source.


Currently we are working on the following languages: Afar, Amharic, Oromo, Somali and Tigrinya.


We are a group of researchers and language enthusiasts who are commited to advancing machine translation technolgy in the Horn of Africa. If you are excited about this project as we are and have translation experience in any of the project languages to and from English, drop us a line.