Transliteration between Mongolian Scripts: Cyrillic and Latin

Claire Chen, Hannah Kang

Mongolian is a language spoken by 2.6 million people (Eberhard, Simmons, & Fenning, 2021), and it is written in two different scripts officially – Mongolian script in Inner Mongolia and Cyrillic script in Outer Mongolia. This difference in the scripts has made it difficult for people in Inner and Outer Mongolia to communicate online, as most Mongolians know only one of the two scripts and cannot understand the text written in the other script. Moreover, more and more young people in both Inner and Outer Mongolia are using the Latin script instead of the Cyrillic or the Mongolian script to communicate with each other online (Dovchin, 2015; Mikovic, 2019), adding a third script to the Mongolian language. To bridge this gap in online communication between Inner and Outer Mongolia, we developed a transliteration tool between Latin and Cyrillic scripts using deep learning.

Before we started developing our tool, we first needed to examine if Mongolian speakers even wanted or needed a transliteration tool. Therefore, we interviewed Baolier, a native Mongolian speaker from Inner Mongolia living in NYC, who speaks fluent Mongolian and can read and write in the Mongolian script. Baolier told us that the Mongolian script is not used that often online, as typing in Mongolian requires downloading a Wechat plug-in to convert a text into an image file. She also showed us an alternative option, which was to download a Mongolian keyboard on a phone, but it types vertical texts horizontally, which makes the text confusing to read, as all texts appear tilted at a 90 degree angle. She also checked the existing Cyrillic-Mongolian transliteration tool and found some errors, which we will discuss in more detail later. Finally, she supported the idea of building a Cyrillic-Latin transliteration tool and commented that this tool would finally allow her to be able to read news articles written in Cyrillic Mongolian.

In addition to consulting a Mongolian speaker, we also sent out a survey to examine: 1) which scripts are being used the most among Mongolian speakers in online spaces, and 2) which transliteration tool would be the most useful to Mongolian speakers. Since neither of us speak Mongolian, we sent out a survey to the Mongolian studies Facebook group, a public group that was created to share any academic posts related to historical and contemporary Mongolia, whose users speak both English and Mongolian. We put filtering questions at the beginning of our survey to make sure that only those that both speak Mongolian and use it in their daily lives to communicate online filled out our survey, and we gathered six reponses total. Finally, most of our respondents were in their twenties, so our results reflect how young people communicate with each other online in Mongolian.

Our survey indicates that the Cyrillic script and the Latin script are used the most often to communicate with each other online, and we can see that the usage of the Latin script is more frequent in social media compared to text messages. We can also see that the Mongolian script is not used as much, and one of our respondents commented: “I am from Mongolia. I don’t use Mongolian traditional scripts in my daily life. But I can read and write it smoothly.”

Another question that we asked in our survey was which transliteration or translation tool would benefit Mongolian speakers the most. We included translation in this question because we were informed by our Mongolian consultant that people in Inner Mongolia also use Chinese script to text with each other in Chinese, a language spoken as a second language for many Mongolians in Inner Mongolia. The transliteration tools between Cyrillic-Latin and Mongolian-Latin received the most votes, which confirmed what we learned from our interview with the native Mongolian speaker that Mongolian speakers would benefit from using a Cyrillic-Latin transliteration tool.

Since the Latin script is only used in an informal setting in both Inner and Outer Mongolia, there were only a few resources available for parallel data. After connecting with the Mongolian community online, we were able to find 1,003 lines of song lyrics, 90,000 Mongolian family/clan names, 220,000 Mongolian personal names, and 192,000 Mongolian company names to use as our parallel data between Cyrillic and Latin scripts. The lyrics that we used were transliterated from Cyrillic to Latin by Mongolian speakers in this online community (lyricstranslate.com), and the data for Mongolian names were from the Mongolian government’s open data project (opendata.burtgel.gov.mn). Since the family names were only written in Mongolian, Cyrillic, and IPA, and not in the Latin script in this project, we first converted the names written in IPA to the Latin script using the Mongolian Latin alphabet mapping devised by Michael Fustumum (Fustumum, n.d.) before using the data to train our model. We trained a transformer character model for 30 epochs with Facebook’s fairseq project with the data we collected. To test the accuracy and readability of our tool, we transliterated Cyrillic Mongolian lyrics using both our tool and the only available Cyrillic-Latin transliteration tool for Mongolian online, Lexilogos and compared the outputs.

Overall, our transliteration is considered more readable in terms of its use of the regular Latin alphabet (without the diacritical marks). Since people use the regular English keyboard for Latin script input in informal settings like texting and posting in online forums, the use of diacritical marks causes burden in both reading and typing the text. Moreover, compared with the Turkish alphabet that Lexilogos uses, our transliteration appears more helpful for Mongolian language learners in terms of its simplicity. In fact, our Mongolian consultant Baolier argued that for Mongolian learners in China, the simpler version (without the diacritical marks) of Latin script is precisely what is used in class.

There were still some errors and inconsistencies in our transliterations though. One error was with the letter “p” in Cyrillic: some words that had the letter “p” in Cyrillic were transliterated into a “p” in Latin script, when “p” in Cyrillic should have been transliterated to “r” in Latin alphabet. Moreover, some letters such as “ü” and “ö” in our transliterated text still included diacritical marks, so with Baolier’s recommendation, we replaced the “ü” with a “u” and the “ö” with an “o.”

In conclusion, our tool works better in informal scenarios where readability and efficiency are prioritized, whereas the tool by Lexilogos using the Turkish alphabet is more helpful for scholars working in academics, since diacritical marks such as “š” are not used in daily scenarios, but may offer a more accurate representation of the sounds of the Mongolian language.