Online language gaps in the United States are a pressing problem for Asian Americans


Chen says that while the content moderation policies of Facebook, Twitter and others have been successful in filtering out some of the more obvious misinformation in English, the system often misses that content when it is in other languages. Rather, this work was to be done by volunteers like his team, who sought out disinformation and were trained to defuse it and minimize its spread. “These mechanisms to pick up certain words and things don’t necessarily pick up that misinformation and misinformation when they’re in a different language,” she says.

Google translation services and technologies, such as Translatotron and realtime translation headset use artificial intelligence to convert between languages. But Xiong finds these tools inadequate for Hmong, a deeply complex language where context is incredibly important. “I think we’ve become really complacent and dependent on advanced systems like Google,” she says. “They claim to be ‘accessible to language’ and then I read it and it says something totally different.”

(A Google spokesperson admitted that smaller languages ​​”pose a more difficult translation task,” but said the company had “invested in research that particularly benefits low-resource language translations,” using the ‘machine learning and community feedback.)

Until the end

The challenges of language online go beyond the United States – and descend, literally, to the underlying code. Yudhanjaya Wijeratne is a researcher and data scientist at the Sri Lankan think tank LIRNEasia. In 2018, he began tracking down botnets whose social media activity encouraged violence against Muslims: in February and March of that year, a series of Sinhala Buddhist riots targeted Muslims and mosques in the towns of Ampara and Kandy. His team documented The Robots’ “Hunting Logic” cataloged hundreds of thousands of Sinhala social media posts and transmitted the results to Twitter and Facebook. “They were saying all kinds of nice, well-meaning things – basically canned statements,” he says. (In a statement, Twitter said it uses human scrutiny and automated systems to “apply our rules impartially to everyone in the service, regardless of their background, ideology or position in the political spectrum.” “)

Contacted by the MIT Technology Review, a Facebook spokesperson said the company had commissioned an independent human rights assessment of the platform’s role in the violence in Sri Lanka, which was published in May 2020, and made changes in the wake of the attacks, including hiring dozens of Sinhala and Tamil-speaking content moderators. “We have deployed proactive Sinhala hate speech detection technology to help us identify potentially violent content faster and more effectively,” they said.

“What I can do with three lines of code in Python in English literally took me two years looking at 28 million Sinhala words”

Yudhanjaya Wijeratne, LIRNEasia

As the bot’s behavior continued, Wijeratne became skeptical of the platitudes. He decided to look at the code libraries and software tools that companies were using and found that mechanisms for monitoring hate speech in most languages ​​other than English had yet to be put in place.

“Much of the research, in fact, for many languages ​​like ours just hasn’t been done yet,” says Wijeratne. “What I can do with three lines of code in Python in English literally took me two years looking at 28 million Sinhala words to build the basic corpora, to build the basic tools, and then bring things to this. level where I could potentially do this level of text analysis. “

After suicide bombers targeted churches in Colombo, the Sri Lankan capital, in April 2019, Wijeratne built a tool to analyze hate speech and disinformation in Sinhala and Tamil. The system, called Watch dog, is a free mobile application that aggregates news and links warnings with fake stories. The warnings come from volunteers trained in fact checking.

Wijeratne emphasizes that this work goes far beyond translation.

“Many of the algorithms that we take for granted and that are often cited in research, especially in natural language processing, work very well for English,” he says. “And yet, many identical algorithms, even used on languages ​​which are only a few degrees apart, whether it is West Germany or the Romance language tree, can give completely different results. ”

Natural language processing is the basis of automated content moderation systems. Wijeratne published an article in 2019 who examined the discrepancies between their accuracy in different languages. He argues that the more computational resources there are for a language, such as datasets and web pages, the better algorithms can perform. The languages ​​of the poorest countries or communities are at a disadvantage.

“If you build, say, the Empire State Building for English, you have the plans. You have the material, ”he says. “You have it all on hand and all you need to do is put it all together. For all other languages, you don’t have the plans.

“You have no idea where the concrete is going to come from. You don’t have steel, and you don’t have workers either. So you’ll be sitting there banging one brick at a time and hoping your grandson or granddaughter could finish the project.

Deep issues

The movement to deliver these plans is known as linguistic justice, and this is nothing new. The American Bar Association describes linguistic justice as a “framework” that preserves the right of people “to communicate, understand and be understood in the language in which they prefer and feel most articulate and powerful”.

Leave a Reply

Your email address will not be published. Required fields are marked *