French AI research company Pleias and the GSMA have released CommonLingua, a language identification (LID) model that covers 334 languages, including 61 African languages, and is designed to address a foundational gap in AI systems that has caused African-language text to be routinely misidentified.
The Problem It Solves
Before any AI model can be built for a language, it first needs to correctly identify what language it is looking at. That step, language identification, has been quietly failing African languages for years.
Leading LID tools such as fastText, GlotLID and OpenLID were built primarily around European and Asian languages, meaning African-language text is frequently mislabelled as English or French. Even state-of-the-art AI models lose roughly 30 percentage points in accuracy on African languages compared to major world languages.
Africa is home to more than 2,000 living languages, many of which remain underrepresented in AI training data. One reason is that before language models for Swahili, Yoruba or Wolof can be built, the underlying text must first be correctly identified. CommonLingua is designed to make that identification step reliable.
What the Model Does
CommonLingua covers 61 African languages across eight language families: Bantu with 21 languages, Niger-Congo and West African with 18, Afro-Asiatic and Semitic with 7, Cushitic and Chadic with 4, Berber with 3, Nilo-Saharan with 3, and pidgins, creoles and other languages with 5.
The two-million-parameter model achieves 83% accuracy in identifying African languages, a significant improvement over existing systems. Notably, it operates directly on UTF-8 byte sequences rather than relying on language-specific tokenisers, enabling consistent handling across scripts including Latin, Arabic, Ethiopic, N’Ko, and Tifinagh. That technical design choice matters: it means the model does not need to be retrained each time a new script is introduced.
The model is trained exclusively on open-licensed and public domain content aggregated through the Common Corpus project, including Wikipedia, scientific publications from OpenAlex, VOA Africa, WaxalNLP, and cultural heritage sources.
Part of a Larger Initiative
CommonLingua is the first joint release under the GSMA’s “AI Language Models in Africa, by Africa, for Africa” initiative; a coalition whose mandate is to move African language AI from fragmented individual projects to shared, reusable infrastructure.
GSMA Director of AI Initiatives Louis Powell framed it as a foundational intervention: progress has long been held back by the lack of infrastructure, beginning with something as basic as language identification, and CommonLingua addresses this gap to enable the development of richer datasets and more representative AI systems at scale.
Pleias co-founder and CTO Pierre-Carl Langlais was direct about the stakes: African languages are the working languages of hundreds of millions of people, and CommonLingua is deliberately the first brick being laid, because you cannot curate what you cannot identify.
Why It Matters for Africa’s AI Future
The release comes as investment in African AI infrastructure accelerates, with governments and private players across the continent pushing to build locally relevant digital tools. But without reliable language identification, every downstream application is built on a flawed foundation.
The GSMA plans to continue the conversation at MWC26 Kigali, where partners will convene to accelerate progress on African-language AI. CommonLingua, small as it is at two million parameters, may end up being one of the more consequential releases in that effort.










