How to play: Some comments in this thread were written by AI. Read through and click flag as AI on any comment you think is fake. When you're done, hit reveal at the bottom to see your score.got it
I find that meta’s translations are very poor compared to others, at least for relatively obscure languages, which I figured was relevant considering the article.
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
So, LLMs are noticeably better in Khmer than Google Translate? I wonder why Google Translate doesn't use Gemini under-the-hood. Perhaps it's more prone to hallucinations.
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
I'll be looking at this in detail. I've started a company to do similar things, https://6k.ai
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
Hey, this is super cool! I’ve been working on a similar problem, focusing on low-resource and underserved languages including the Mayan family, and have published some research and open resources around that [0, 1].
On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3].
Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below.
It’s a small sample and not specifically ones we’re working on. It’s biased towards alternative scripts for visual interest.
Australian languages are definitely interesting! and I will say, from what I’ve seen, Australian government (and other orgs) have done better than most to help document them (in recent years, at least)
IIRC Australia has around 250 surviving indigenous languages, not hundreds -- though "hundreds" was historically accurate before colonization drove so many to extinction.
Yes, there are government datasets, languge "acadamies" (or "regulators") - organizations focused on preserving / teaching the language, and often smaller, local publishers that publish material in their local language.
I'm living in Guatemala, so have been focusing on the Mayan languages here (22 languages, millions of speakers).
It is not open weight as of today (unfortunately, for the reasons out of control of us the authors, we weren’t able to release the weights). All we could release is part of the evaluation data.
I hope this will change in a while.
I haven't seen anywhere claiming they are open weight (although their last similar model, NLLB was).
They say their leaderboard and evaluation datasets are freely available. Closest statement I've seen in the paper, "Our translation models are built on top of
freely available models."
We ran into the same thing on our internal docs pipeline — the meta tag og:description strips newlines too. Turns out the source is just plain textarea content. Nothing to fix, it's intentional.
Research orgs at these big companies are completely separate from product. Different teams, different incentives, different timelines. The people doing 1600-language MT have zero say over what features land in WhatsApp.
Off topic, since the AI craze MS‘ documentation translation has ridiculous errors like translating try catch keywords to "versuchen" and "fangen" for German pages
That's a high count, but still a bit away from "Omni". Usual count is between 4k and 8k depending the source. But the first 1k might be the hardest, certainly.
1.6k languages is for how many we were able to find more or less reliable evaluation data (mostly thanks to Bible translators and all those who contributed to BOUQuET).
Out of the remaining several thousand languages, we expect the OMT models to support understanding (but not generation) for a significant proportion, due to cross-lingual generalisation between similar languages.
So it’s not truly “omni” in the sense of supporting every single language on Earth, but it’s our best effort to do so, and probably the most “omni” models existing today.
Is there interest in benchmarking the proprietary LLMs for translation? Curious as I often use Gemini 3 Flash, but I have no idea how good it is for my language family. I prefer open models (in fact the smaller the better for offline), but it'd be useful to know how well the Big Three do.
So, hyperchilio-lingual would be more accurate, and myriad-lingual would be even behind all documented existing human language. But I guess marketing team is not that found of precision in philological considerations.
I’m very wary of celebrating Meta’s language work when the company was credibly found to have contributed to the genocide against the Rohingya in Myanmar, and separately, to human rights abuses against Tigrayans during the conflict in northern Ethiopia. Be careful whose sins you’re laundering.
I had the same reaction to this post. Mainly because one of Meta's explanations for the lapse was that they didn't have moderators who understood the local language
The Hilux comparison is actually pretty sharp. Dual-use tech with obvious civilian value that also enables atrocities. IBM and the Holocaust is the canonical version of this debate, and it never gets resolved cleanly.
Meta released No Language Left Behind (NLLB) [1], I think in 2022. I wonder why this in not "NLLB 2.0"? These companies love introducing new names to confuse things
This project is absolutely NLLB 2.0 in spirit. However, we decided to reserve the name “OMT-NLLB” only to the subset of the new models that have encoder-decoder architecture similar to the original NLLB-200. The other models are called “OMT-LLaMA” and have classical LLM architecture.
The idea here (and we had to emphasize it to justify the project internally) is that we are developing not just new models but a recipe for massive multilinguality that can be integrated into general-purpose LLMs.
Evaluation is the hard part here. FLORES+ only covers ~200 languages, so for the tail end of 1,600 there's basically no standard test set. You're largely validating against other models or synthetic references, which is a bit circular.
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
(I live in Cambodia where they speak Khmer)