A course taught in the Department of Computer Science, the Data Science Institute, and the Institute for Comparative Literature and Society, Columbia University, in Spring 2020 and 2021.
Smaranda Muresan, Isabelle Zaugg
2021-04-16 Claire Chen, Hannah Kang
Mongolian is written in two different scripts officially – Cyrillic script and Mongolian script – and in Latin script unofficially in online spaces. Because of these different scripts, it has been difficult for Mongolian speakers to communicate with each other, especially in online spaces. To help bridge this gap in communication, we built a transliteration tool between Cyrillic and Latin scripts using deep learning.
2020-04-16 Ciaran Beckford, Newman Cheng, Selena Huang, Lingsi Kong
Sentiment analysis is a natural language processing technique that summarizes the users’ thoughts towards a specific topic. As a low resource language, Armenian does not have NLP tools to conduct sentiment analysis, much less social media posts. Our research aims to fill this gap, by outlining and comparing two models for Armenian sentiment analysis: one based on multilingual BERT and the other on a pre-trained English sentiment model. We apply our sentiment analysis tool not only as a means for conducting research into the 2020 Nagorno-Karabakh conflict, but also as a validation component against prior work. Our results show that our model is significantly better than random guessing but there is much more work to be done to improve its accuracy. Ultimately, building sentiment analysis models using native Armenian word embeddings is the ideal route, although future work in expanding labelled datasets and accounting for dialect differences is needed.
2020-04-08 Author One, Author Two, Author Three
This is a very short description of your final project.
2021-04-16 David Giliver, Mihika Nadig, Soo Rin Lee
This project attempted to create educational technological tools, specifically a next word predictor and a spell checker, using Google’s BERT tool.
2021-04-22 Sahil Jayaram, Diana Abagyan, Tiansheng Sun
Leveraging lexicostatistical information for multilingual morphological inflection
2020-04-21 Gabriela Arredondo, David Rosado, Megan St. Hilaire, Eve Washington
For our project, we sought to create a tool to crowdsource symbols for scripts that are not supported by Unicode. This tool could allow users to send messages in under-resourced scripts/languages and allows developers to access a library of symbol images for future developments of Unicode and fonts.
2020-04-27 Akhila Chinepall, Havi Nguyen, Joan Martinez
The goal of our project was to create a screen reader extension to translate characters in Mandarin Chinese into Shanghainese audio. This will help visually-impaired Shanghainese navigate the web in China.
2021-04-16 Jacob Atkins, Chris Calloway, Kelsey Namba, Rahul Sehrawat
Our project aims to increase the accessibility of Tibetan religious texts by training a machine translation model specifically to translate religious Tibetan texts to English.