Building Armenian Sentiment Analysis Tools using mBERT and Neural Networks

Ciaran Beckford, Newman Cheng, Selena Huang, Lingsi Kong

Introduction

When applying our models to social media posts about the Nagorno-Karabakh conflict specifically, we found out that there was a generally negative sentiment. This is not surprising as conflicts and wars naturally revolve around negative times regardless of which side is portraying it. In a way, we treated this verification of sentiment as a form of evaluation since we could compare this with Bhattacharjee’s work where he analyzed sentiment of English Tweets about the same conflict. As discussed, our work displayed negative sentiment regarding the conflict, just like Bhattacharjee’s, signifying that our model was performing to a reasonably acceptable standard.

The complexities in expanding digital support, not limited to natural language processing alone, for low-resource languages has long been a challenge. Low-resource languages not only face social challenges from the lack of extensive investments in development but also technological barriers from scientists focusing on trending topics. Today, we introduce a framework for building Armenian sentiment analysis tools such that Armenian can continue to move up the digital vitality scale. Our intention was to build Armenian sentiment analysis tools which we could use to conduct research on Armenian social media posts regarding the 2020 Nagorno-Karabakh conflict. We wanted to understand the collective emotions of Armenians regarding this conflict, and see whether sentiment would differ when analyzed in another language such as English. Furthermore, an important question we wanted to answer was which type of machine learning framework would be best suited for conducting sentiment analysis in Armenian.

Sentiment Analysis Models

We chose to implement two different approaches for constructing Armenian sentiment analysis tools. The first approach we chose took inspiration from Galeshchuk et al’s work where they translated the target language, Armenian in our case, to English before obtaining word embeddings. Our second approach follows similarly with the idea of using word embeddings, but we obtained Armenian word embeddings using multilingual-BERT rather than English. At a high level, the intuition behind the two models is the same, but loss in translation between any two languages can affect NLP tool quality. As such, we drew a comparison of these two approaches by constructing both models to contrast results. However, the main challenge before we were able to train either approach was the obstacle of data. As with many other low-resource languages, it is hard to find pre-existing open-source datasets that can be readily used for NLP research. For sentiment analysis, we needed sentences labeled with sentiment scores denoting whether the reference is positive, negative, or neutral in nature. Without such labeled data and enough quantity, it is virtually impossible to train sentiment analysis models.

In our first approach, we did not face such a challenge in collecting data for training the model because all Armenian text was converted to English. Since the majority of NLP research has been conducted in English, we did not have trouble finding a pre-trained BERT sentiment analysis model which we used to score our translated Armenian text. In our second approach, we went down multiple different avenues to collect Armenian sentences and match them to sentiment labels. First, we manually scraped Twitter and YouTube for almost 500 tweets and comments about the Nagorno-Karabakh conflict in the Armenian language to form a test set corpus. We were unable to fully utilize Twitter’s API due to limitations in historical query frames that we could not overcome. To extend our collection even further, we collected Armenian sentences from shopping websites and forums where there review scores associated with the comment. This technique has been used commonly in building English sentiment-labeled datasets where Amazon reviews are extracted and scores are classified into their respective labels. We took the same inspiration and applied it to Armenian and we were able to increase the size of our total dataset. Our last technique to collect labeled sentences was through the use of Google Forms and Armenian Facebook groups. We created a Google Form Questionnaire asking respondents to label the Armenian sentences into the respective category. In total, we received six total responses and took the most common label for each sentence as our true label. We were unable to verify whether the respondents spoke Armenian, but we hoped that those on Armenian Facebook groups were invested in the language or culture with some knowledge. Finally, since there were not enough total sentences labeled in Armenian to train our model, we utilized a labeled English dataset and translated them to Armenian.

Our final dataset consisted of 18,000 sentiment labeled sentences in Armenian with a separate held out test set of 500 tweets, which we used for performance assessment. First, we used a 80/20 split on the dataset to create training and validation data. To train our custom mBERT model, we first used mBERT to obtain word embeddings for the Armenian text before feeding that into a deep neural network to make predictions. Our model was fairly simple, with the use of recurrent layers and dropout with a softmax-activated layer to make a prediction. The multi-class classification model used categorical cross entropy loss as its loss function and classification accuracy as the performance evaluation. Since we did not have to train our own model for our first approach, we utilized an open-source tool called VADER (Valence Aware Dictionary and Sentiment Reasoner), specifically designed for analyzing social media.

Results

Findings

Through our two approaches, we found that each one had its advantages and disadvantages. Method 1, which first translated Armenian to English before conducting sentiment analysis saw a test set accuracy of 50.00%. This means that the framework was able to correctly predict the sentiment better than a random guess, although not by a significant amount. A random guess would have resulted in around 33% accuracy due to the three classes of prediction: negative, neutral, and positive. Method 2, which used Armenian word embeddings from mBERT to train a model proved to do worse with a test set accuracy of 19.42%. While the mBERT model performed worse, we believe that using Armenian word embeddings is far more promising in practice.

Through some more tests, we realized that since we lacked neutral sentiment data in training our model using mBERT embeddings, we ran the test set without any predictions of neutral sentiment. With that, we found that Method 1 retrieved a test set accuracy of 59.22% and Method 2 retrieved a test set accuracy of 70.39% respectively. In this case, we see a much better performance when predicting whether a sentence is either positive or negative in nature. With more training and validation data, both methods are bound to perform better up to a certain degree. In Figure 1. below, denoting the training and validation accuracies for training our model in Method 2, both training and validation accuracies converge at around 85%. In part, the model cannot learn more and predict the neutral sentiment sentences better due to the lack of data as mentioned. As we see in our held-out test set, it is natural to experience lower accuracies since the model has never seen the data, providing us with a better sense of the model’s performance.

When applying our models to social media posts about the Nagorno-Karabakh conflict specifically, we found out that there was a generally negative sentiment. This is not surprising as conflicts and wars naturally revolve around negative times regardless of which side is portraying it. In a way, we treated this verification of sentiment as a form of evaluation since we could compare this with Bhattacharjee’s work where he analyzed sentiment of English Tweets about the same conflict. As discussed, our work displayed negative sentiment regarding the conflict, just like Bhattacharjee’s, signifying that our model was performing to a reasonably acceptable standard.

Challenges

We have intermittently discussed the challenges faced when building our sentiment analysis model, but would like to dive further on the most prominent challenges we faced: Data, Labeling, and Dialect Shifts.

Manually scraping tweets and comments was the only way that we were able to collect data within the past few weeks. To build a robust and diverse dataset, however, this method would not be sustainable. Typical NLP models are trained on more than one million sentiment-labeled sentences, but we were only able to gather a fraction of this number. Such a task of manual collection is time-consuming and labor intensive, an infeasible task without the resources of a cohort of individuals equally invested in language justice. Furthermore, our team had no knowledge about the Armenian language nor substantial access to an Armenian speaker. Thus, we were constrained to utilizing and augmenting the small dataset we were able to gather and label.

Similarly related to the challenge we faced in data, we found that it was a challenge in verifying the accuracy of our labeled data. Both Method 1 and Method 2 yielded decent test set accuracies, but we were wholly unsure of its validity. Some of these challenges can be explained by direct means, such as the lack of knowledge about Armenian among our group. We were unable to overcome these challenges despite reaching out to resources, although, with more resources in the future it is certainly possible. Inherently, it is a challenge in building resources for low-resource languages that there isn’t a widely accessible group of speakers, but as Bird notes, groups advocating for language justice should work closely with natives to set community goals, one that rings loudly in our scenario. Others factors were out of our control such as verification of Google Form respondents, and tie-breaking in Google Form responses.

Eastern Armenian is the standard dialect spoken in Armenia amongst more than seven other dialects. There are even more dialects spoken by Armenians of the diaspora living in the Middle-East and America. Unlike language families like Chinese which have many different dialects – Mandarin, Cantonese, etc – but the same script is used for each dialect, Armenian uses an alphabet and the script changes for how people pronounce different words. Furthermore, the Karabakh dialect is the dialect spoken in this conflict region and we were unable to determine whether or not the dialect had any impact on the second model as many English sentences were translated to the standard Eastern Armenian dialect. The Karabakh word հի՞նչ meaning “what?” is clearly different than the Eastern Armenian word ի՞նչ. In an ideal world, we would have a native Armenian researcher working with us to help build a model that is inclusive of all dialects thereby representing more Armenians and improving the accuracy of the model.

Conclusion

While neither model is fully matured, the precedent set through our work is important in developing not only Armenian sentiment analysis tools further, but also for other low-resource languages. The improvements in accuracy gained in building sentiment analysis models using native word embeddings is a worthwhile tradeoff when developing such tools. For many other low-resource languages like Azerbaijani, Aragonese, or Catalan which all have support with mBERT, this is feasible. Unfortunately, even lower resource languages like Mongolian and Rohingya may not even be supported by mBERT, calling for the need for different approaches. Perhaps then, translating the languages’ text to English or a higher resource language before performing sentiment analysis is a solution more attainable. There is research that shows training a model on a high resource language that is similar to the target language significantly improves the accuracy of the model. With regard to Armenian, it is rather unique in that it is its own language family so there are no languages similar to it, but lower resource languages that do have similarities to higher resourced languages can benefit from this improvement. In retrospect, although we set out to analyze sentiment for the Nagorno-Karabakh conflict initially, we have come to realize that our approach can be adapted to a larger group of low-resource languages. We hope that our research provides a framework for Armenians, Azerbaijanis, or any researchers interested in looking at Armenian sentiment from the Armenian perspective, be it the Nagorno-Karabakh conflict or any text they wish.