Syllabus: Multilingual Technologies and Language Diversity

Smaranda Muresan, Isabelle Zaugg

Multilingual Technologies and Language Diversity

Instructor:  Prof. Smaranda Muresan and Dr. Isabelle Zaugg

Teaching assistant: Jonathan Reeve

Course Meeting Time: Fridays 10:10 AM – 12:40 PM

Office Hours: Smaranda Muresan, by appointment on Monday, 4-5pm

Isabelle Zaugg, by appointment on Monday and Thursday, 10-11am

Short Description

Innovations in digital technologies have shown their potential to be at times breathtakingly beneficial, and at others divisive or troubling. With regard to digital technologies" impact on the ecosystem of language diversity, evidence suggests that new technologies are one contributor to the decline and predicted extinction of 50-90% of the world\’s languages this century. Yet digital innovations supporting a growing number of languages also have the potential to bolster language diversity in ways unimaginable a few years ago. Will innovations in multilingual natural language processing bring about a renaissance of language diversity, as users no longer need to rely on English and other dominant languages? To address this question, this course will introduce a dual view on language diversity: 1) a typology of language vitality and endangerment and 2) a resource-centric typology (low-resource vs. high-resource) regarding the availability of data resources to develop computational models for language analysis. This course will address the challenge of scaling natural language processing technologies developed mostly for English to the rich diversity of human languages. The resource-centric typology will also contribute to the dialogue of what is “Data Science.” Much research has been dedicated to the “Big Data” scenario; however “Small Data” poses equally challenging problems, which this course will highlight. This course brings data and computational literacy about multilingual technologies to humanities students, while also exposing computer science and data science students to ethical, cultural, business, and policy issues within the context of multilingual technologies. 

This 4000-level course is cross-listed in the Department of Computer Science and the Institute for Comparative Literature and Society, and open to both upper-level undergraduate and graduate students. The class will also provide open source state-of-the-art NLP tools and datasets to be used by students and relevant readings. The final project will team up students in CS and ICLS.  

Academic Integrity

Columbia’s intellectual community relies on academic integrity and responsibility as the cornerstone of its work. Graduate students are expected to exhibit the highest level of personal and academic honesty as they engage in scholarly discourse and research. In practical terms, you must be responsible for the full and accurate attribution of the ideas of others in all of your research papers and projects; you must be honest when taking your examinations; you must always submit your own work and not that of another student, scholar, or internet source. Graduate students are responsible for knowing and correctly utilizing referencing and bibliographical guidelines. When in doubt, consult your professor. Citation and plagiarism-prevention resources can be found at the GSAS page on Academic Integrity and Responsible Conduct of Research.

Failure to observe these rules of conduct will have serious academic consequences, up to and including dismissal from the university. If a faculty member suspects a breach of academic honesty, appropriate investigative and disciplinary action will be taken following Dean’s Discipline procedures.

Disabilities Accommodations

If you have been certified by Disability Services (DS) to receive accommodations, please either bring your accommodation letter from DS to your professor"s office hours to confirm your accommodation needs, or ask your liaison in GSAS to consult with your professor.  If you believe that you may have a disability that requires accommodation, please contact Disability Services at 212-854-2388 or .

Important: To request and receive an accommodation you must be certified by DS.

Percentage Breakdown of Grading:

 Class Sessions (reading list tentative; might be updated)

1.  Jan 15. ISABELLE & SMARA:  What is the relationship between language diversity and digital technologies?  Introduce syllabus.  Introduction to the historical trajectory of multilingual computing.  Introduction to rapidly diminishing language diversity.  Instructor and student introductions.

2.  Jan 22. ISABELLE (& SMARA):  Language Diversity: Vitality, Resource-centric, Linguistics Typology. High-level discussion of the impact on language communities if multilingual technologies are or are not developed (machine translation (MT), morph analyzers, learning lexicons, parts-of-speech (POS) taggers). Are there downsides to digital inclusion?  Introduce the Final Project.  Students share languages and NLP areas of interest to facilitate group-building.  Assign the Language Quick-fire exercise.

 

3.  Jan 29. SMARA:  A critical look at resource-centric language typologies & linguistic typologies with an eye on their use in developing NLP technologies. Students present the quick-fire exercise. Introduction to how resource-centric typology influences the approach to developing language technologies, and how to port resources and computational models across languages (based on similarities in morphological perspective, syntax, etc.).  What type and size of data is needed (e.g. parallel data, comparable corpora, code-switched and mixed language documents)

- Learning and projecting morphological information
- Learning and projecting syntactic information
- Machine translation

4.  Feb 5.  ISABELLE:  Diminishing Language Diversity:  What do we lose when a language dies?  Discussion of readings.  Possible screening of the first portion of “Language Matters with Bob Holman,” focused on Indigenous languages and normalcy of multilingualism in Northern Australia, and/or possible guest speakers:  Daniel Kaufman or Ross Perlin, Directors of NYC Endangered Language Alliance.  Final Project proposal ideas (non-graded) due.

Readings:

Optional:

5.  Feb 12. ISABELLE:  Scripts.  Possible guest speaker Anshuman Pandey:  Lecture/discussion on script diversity in digital sphere and Unicode“s role in promoting digital support for scripts, as well as his own experience identifying digitally-disadvantaged communities online and doing primary research and community connection work in South Asia and Southeast Asia to incorporate them into Unicode. Assign Essay on the first 6 weeks of the course: normative stance on”Digital Language Justice" (due next week, week 6).  Final Project proposal (one-pager) due.

Reading:

6.  Feb 19. SMARA**: What“s a Word? ** Relate to the morphological typology (agglutinative, polysynthetic). Talk about Segmentation issues and how to do segmentation unsupervised and for different languages. Word classes: Part of Speech and cross-lingual approaches for POS tagging.  Guest Lecturer: Ramy Eskander.  Essay on first 6 weeks due (normative stance on”Digital Language Justice").  Introduction of Final Project Lit Review expectations.

Readings:

7.  Feb 26. SMARA: Learning Word Meanings (Representations).  Talk about monolingual and multilingual dictionaries; monolingual word representations (word embeddings); contextualize-word embeddings; multilingual word embeddings. How to evaluate?  Students provide updates on Final Projects and ask questions.

Readings:

Resources and Additional Readings:

8.  March 12. SMARA: Machine Translation (Technologies). History of Machine Translation. Brief intro to Statistical Machine Translation and Neural Machine Translation. Final Project Lit Review due.

Readings:

Optional:

9.  March 19. SMARA:  Machine Translation #2.  High Resource vs. Low Resource & Evaluation. Tentative Guest speaker:  Mona Diab: “[Faithfulness in natural language generation in an era of heightened ethical AI awareness: opportunities for MT.”  Students provide Final Project updates and ask questions.

Readings:

Optional:

10.  March 26. ISABELLE:   Who are the stakeholders in creating a multilingual digital sphere?  Discussion of readings and role of international tech companies, local tech companies, governments, military/surveillance, users/volunteers, international governance institutions, language advocates, etc. The readings are assigned as a “jigsaw,” i.e. students are assigned to read one of the readings and make a 1-2 min. presentation on it.  Introduction of Stakeholder Investigation:  for the next class, students will examine one stakeholder in more depth, including mission statement, “true” interests (if they diverge from mission statement), and how it is working towards or against a multilingual digital sphere.

Readings (each student selects one to read and present - this list may be updated):

April 2. ISABELLE:  What is the “clash of values” when it comes to “global language justice?”**   Students each present 2-3 minutes on the digital stakeholder they chose last week, their mission statement or agenda, and how that intersects with support for language diversity.  If extra time remains, students are divided into teams and compete in a class debate about the ideal relationship between multilingual technologies and language diversity.  Check-in on final projects.

12.  April 9. ISABELLE & SMARA:  Final Project Presentations (last day of class):  Students present their final projects in a festive setting (organized as a poster session).  Final paper, including the computational project and an essay tying project together with themes in the course, is due a week after the end of class, April 16.