Does Your AI Speak Your Language?

How aim2balance.ai Selects for Multilingual Performance

aim2balance.ai Research Team

March 2026

‍

Language Support Is Beyond a Checkbox

Most AI platforms list dozens of supported languages. What that claim rarely comes with is evidence. A model can be exposed to French text during training without being able to follow a French instruction reliably, handle a morphologically complex Polish question correctly, or reason through a Brazilian Portuguese legal problem coherently.

At aim2balance.ai, our users come from all over the globe and possess a linguistically diverse background. They write in French, German, Czech, Italian, Arabic, Chinese and more. In order to ensure that we are delivering consistent quality across those languages, we need to go beyond English-language benchmark scores. In this blog post, we will walk you through how we built a multilingual evaluation layer on top of the same pipeline that powers our five capability routes, what the data shows, and which models earned a place in the language-aware routing layer as a result.

‍

Gathering Language Leaderboards

The same pipeline used to rank models on capability routes handles multilingual evaluation with one structural difference: instead of aggregating metrics by capability tag, it aggregates by language tag.

‍

Key insight: We never limit ourselves to exact model matches. Each leaderboard contains variants of the same family, and no single model appears everywhere. Our system dynamically aggregates evidence from the closest available relatives to build the most complete capability profile possible, using family-level performance as an honest proxy when exact matches are sparse.

‍

Similarly to the performance leaderboards, scores are normalised, confidence-weighted by name-match quality, and averaged per language per model to produce a clean per-language performance profile.

We draw on nine dedicated multilingual leaderboards covering 21+ languages: French, Italian, Portuguese, Spanish, Catalan, Basque, Galician, Polish, Czech, Chinese, to name a few. Together they provide something general capability benchmarks cannot: direct measurement of how well a model performs when the language changes. The same task asked in French, German, and Polish should produce comparable quality if the model genuinely supports those languages.

‍

The Three Pillars of Multilingual Performance

‍

Figure 1: Left: Overall multilingual ELO leaderboard. Right: Radar profile across individuallanguage axes for the top-ranked models.

‍

The radar on the right of Figure 1 shows which model leads in each language, revealing three distinct multilingual profiles.

Llama-3.1-70B-Instruct traces the broadest and most balanced polygon of the group. Its performance is uniformly strong across major European languages such as French, German, Spanish, and Italian, while also extending reliably into Arabic and high-resource Global South languages. This even coverage reflects Meta’s sustained investment in multilingual alignment within the Llama 3.1 generation and confirms its status as the most dependable general-purpose backbone in our catalog.

Qwen2.5-VL-72B-Instruct displays a narrower European profile but demonstrates exceptional strength in Chinese, Arabic, and several non-Western languages. This pattern directly results from Alibaba’s training corpus, built around one of the richest Chinese-language datasets available. More broadly, the Qwen family achieves strong multilingual grounding across Asian and Global South languages—a depth that also underpins its success in multimodal and agentic tasks. Its inclusion in our routing system ensures that tool-driven and multilingual queries are served fluently and contextually.

Gemma-3-27B-it competes effectively in the European sector despite its smaller size, maintaining robust performance in French, German, Spanish, and Italian. Its lighter architecture makes it particularly valuable for high-volume or resource-efficient deployments, offering reliable linguistic performance without requiring a 70B-scale model.

‍

Key insight: All five SmartRouter routes are backed by models with good to excellent multilingual support. This is not a coincidence—the same training investment that makes a model capable across tasks tends to make it capable across languages. By selecting for performance breadth first, we ended up with a cocktail that holds up linguistically across the full range of languages our European users bring to it.

‍

These language profiles directly inform model routing within aim2balance.ai. Llama models anchor general and multilingual queries through their consistent coverage across major language groups. The Qwen family, known for its strength in Chinese, Arabic, and Global South languages, drives agentic, creative, and technical routes while Gemma models complement these strengths by providinga compact yet capable option for scientific tasks. Together, this multilingual balance ensures that aim2balance.ai dynamically selects the best model for each query while maintaining high-quality output across diverse linguistic contexts.