Four centuries ago, Yao women in the southern Chinese province of Hunan created a script called Nüshu—literally meaning “women’s writing” in Chinese—that was used for centuries by women to communicate with one another in secret.
After women gained greater access to formal education in the 1900s, the use of the script declined and many Nüshu texts were lost or destroyed over time. Since the turn of this century, however, there has been a sustained effort in China to save the script from becoming extinct.
Now, computer science graduate student Ivory Yang, Guarini, who remembers learning a few words of Nüshu from her grandmother as a child, is exploring how artificial intelligence models offer new ways to preserve and help revitalize the rare script.
Yang and her collaborators, Weicheng Ma, Guarini ’24, and Assistant Professor of Computer Science Soroush Vosoughi, built an AI-driven framework called NüshuRescue that can potentially be adapted to other “low-resource” languages, which have fewer written or translated materials available for training AI systems.
The tool used minimal data—just 35 pairs of matching sentences in Chinese and Nüshu—to train a large language model that had no prior knowledge of Nüshu to expand the database of text in the script through translations from Chinese.
The researchers began with A Compendium of Chinese Nüshu, the most comprehensive, expert-validated collection of scanned Nüshu scripts and corresponding Chinese translations. They worked with expert annotators trained in computational linguistics to create a dataset of 500 digitized Chinese-Nüshu sentence pairs, which included newly mapped words in the two languages.
The researchers used samples from the manually translated dataset to train the GPT-4 Turbo large language model. They found that with just 35 samples, the model began to get a grasp of the script and was able to translate test phrases from Chinese to Nüshu that were not part of the training data. Their work in creating an expert-validated Nüshu-Chinese digital dataset is the first of its kind.
Among other things, Yang is keen to extend her model to different media. “There are handkerchiefs and floating fans that have Nüshu writings on them,” she says. “So the next step would be to build multimodal models that can use computer vision to capture these images and train a model to recognize and translate the characters for us.”
Their work, published recently in the proceedings of the 31st International Conference on Computational Linguistics, also demonstrates how the framework, which minimizes reliance on extensive human annotations, can be applied to other low-resource languages such as Cherokee.
“Our work demonstrates that generative AI and large language models significantly lower barriers to revitalizing endangered languages, rapidly producing valuable linguistic resources even from minimal data,” says Vosoughi. But, he says, despite their transformative potential, these models inherently carry the risk of introducing biases from dominant cultures, potentially distorting or oversimplifying nuanced cultural identities.
“Active participation from native speakers and linguists is essential to ensure linguistic authenticity and cultural fidelity. AI and community expertise are both fundamental for meaningful preservation efforts,” says Vosoughi.
Evaluating existing technologies
Besides creating new tools, researchers are also examining whether existing language technologies, which are built for and center around mainstream languages, support endangered languages.
A notable case is Google Translate’s LangID, which does not support most Native American languages, including Navajo, one of the most widely spoken Indigenous languages in North America, says Yang. This means that these languages cannot even be detected online.
In a recent paper, which Yang will present at the Nations of the Americas Chapter of the Association for Computational Linguistics conference, Yang and her collaborators found that Google LangID misidentified Navajo sentences as other, unrelated languages.
To address this, the researchers built a simple yet highly accurate language-identification model for Navajo and related Athabaskan languages that can accurately distinguish these languages from those erroneously suggested by LangID. Their work highlights the need for machine-learning technologies that better support underrepresented languages and cultural diversity.
Tech tools to aid language preservation
Computer tools and AI models are valuable aids that speakers of endangered languages—members of these communities as well as researchers studying them—can use to document and revive these languages, says Rolando Coto Solano, an assistant professor in the Department of Linguistics whose work focuses on the creation of computer models that can understand Indigenous languages.
“A lot of the work that we linguists do requires a lot of expertise and attention, but is also very repetitive and tedious, something that would be good for a computer to take care of,” says Coto Solano, who is also an adjunct assistant professor of computer science.

It was a meeting with Sally Nicholas, a linguist at the University of Auckland, working in the South Pacific nation of the Cook Islands, that motivated Coto Solano to blend his skills in linguistics and computer science and create new technologies.
“For her dissertation, Sally said that she had recorded dozens, maybe hundreds, of hours of recordings, and joked that she was going to die before she finished transcribing everything,” he said.
Working with Nicholas and other collaborators, Coto Solano built automatic speech-recognition models for Cook Islands Māori that uses machine learning to identify speech patterns from audio recordings and transcribe them into text.
“Transcription is a very specialized and difficult task, especially in a language that very few people write,” says Coto Solano, who has also made speech-recognition models for the Costa Rican languages Bribri and Cabécar.
By accelerating transcription, speech recognition makes it possible to transcribe and document stories and cultures of communities that have a dwindling number of native speakers.
Coto Solano also uses techniques from natural language processing, a field of artificial intelligence that enables computers to understand a language by analyzing text and speech data to develop text-to-speech and machine translation tools.
These can open the doors to future applications that can potentially engage young people and motivate them to learn and use the language. “We can create learning tools that can have a voice or make it possible for children in the diaspora, for example, to have access to native language content through machine translation,” says Coto Solano.
Collaborating with communities and ultimately empowering them to drive language initiatives will pave the way for developing the most useful and impactful applications, says Coto Solano. Among other efforts in this direction, “We conducted a workshop last July to train Cook Islanders in linguistics and a little bit in natural language processing,” he says.
Coto Solano and his collaborators are also working with the Digital Applied Learning and Innovation Lab to create an easy-to-use interface for the speech recognition tool so that it can be used by the community at large and not just researchers for transcribing and documenting video and audio content.
The Department of Linguistics also conducts a foreign studies program in Auckland, New Zealand, and Rarotonga in the Cook Islands, during which students take classes on Māori language and culture before conducting linguistics field research.
“In the Americas, and around the world, there are a lot of languages that are in danger of going dormant—they will no longer be spoken by anyone alive in the near future,” says Coto Solano. “Whichever tools we can use to help turn this tide are urgent, are necessary.”