Multilingual Technologies and Language Diversity

A course taught in the Department of Computer Science, the Data Science Institute, and the Institute for Comparative Literature and Society, Columbia University, in Spring 2020 and 2021.

Syllabus: Multilingual Technologies and Language Diversity

Smaranda Muresan, Isabelle Zaugg

Transliteration between Mongolian Scripts: Cyrillic and Latin

2021-04-16 Claire Chen, Hannah Kang

Mongolian is written in two different scripts officially – Cyrillic script and Mongolian script – and in Latin script unofficially in online spaces. Because of these different scripts, it has been difficult for Mongolian speakers to communicate with each other, especially in online spaces. To help bridge this gap in communication, we built a transliteration tool between Cyrillic and Latin scripts using deep learning.

Building Armenian Sentiment Analysis Tools using mBERT and Neural Networks

2020-04-16 Ciaran Beckford, Newman Cheng, Selena Huang, Lingsi Kong

Sentiment analysis is a natural language processing technique that summarizes the users’ thoughts towards a specific topic. As a low resource language, Armenian does not have NLP tools to conduct sentiment analysis, much less social media posts. Our research aims to fill this gap, by outlining and comparing two models for Armenian sentiment analysis: one based on multilingual BERT and the other on a pre-trained English sentiment model. We apply our sentiment analysis tool not only as a means for conducting research into the 2020 Nagorno-Karabakh conflict, but also as a validation component against prior work. Our results show that our model is significantly better than random guessing but there is much more work to be done to improve its accuracy. Ultimately, building sentiment analysis models using native Armenian word embeddings is the ideal route, although future work in expanding labelled datasets and accounting for dialect differences is needed.

Example Final Project

2020-04-08 Author One, Author Two, Author Three

This is a very short description of your final project.

NLP Models for Aragonese: Next Word Predictor and Spell Checker

2021-04-16 David Giliver, Mihika Nadig, Soo Rin Lee

This project attempted to create educational technological tools, specifically a next word predictor and a spell checker, using Google’s BERT tool.

Morphological Inflection for Oto-Manguean Languages

2021-04-22 Sahil Jayaram, Diana Abagyan, Tiansheng Sun

Leveraging lexicostatistical information for multilingual morphological inflection

ScriptKey: An Image-Based Keyboard for Unencoded Alphabetic Scripts

2020-04-21 Gabriela Arredondo, David Rosado, Megan St. Hilaire, Eve Washington

For our project, we sought to create a tool to crowdsource symbols for scripts that are not supported by Unicode. This tool could allow users to send messages in under-resourced scripts/languages and allows developers to access a library of symbol images for future developments of Unicode and fonts.

Improving Accessibility for Shanghainese Speakers with Screen Reader for Mandarin

2020-04-27 Akhila Chinepall, Havi Nguyen, Joan Martinez

The goal of our project was to create a screen reader extension to translate characters in Mandarin Chinese into Shanghainese audio. This will help visually-impaired Shanghainese navigate the web in China.

Machine Translation for Tibetan Buddhist Texts

2021-04-16 Jacob Atkins, Chris Calloway, Kelsey Namba, Rahul Sehrawat

Our project aims to increase the accessibility of Tibetan religious texts by training a machine translation model specifically to translate religious Tibetan texts to English.