Reading: Multilingual Isn’t Cross-Lingual: Inside My Benchmark of 11 LLMs on Mid- & Low-Resource Languages | HackerNoon

Multilingual Isn’t Cross-Lingual: Inside My Benchmark of 11 LLMs on Mid- & Low-Resource Languages | HackerNoon

Last updated: 2025/11/20 at 10:11 PM

News Room Published 20 November 2025

I built an evaluation pipeline for multilingual and cross-lingual LLM performance on 11 mid/low-resource languages (e.g., Basque, Kazakh, Amharic, Hausa, Sundanese). I combined native-language datasets (KazMMLU, BertaQA, BLEnD), zero-shot chain-of-thought prompts, and a new metric – LASS (Language-Aware Semantic Score) – that rewards semantic correctness and outputting answers in the requested language. Findings: (1) scale helps but with diminishing returns; (2) reasoning-optimized models often beat larger non-reasoning models; (3) the best open-weight model is ~7% behind the best closed model; (4) “multilingual” models underperform on culturally specific cross-lingual tasks when evaluations move beyond translated English content. Code & data: see GitHub link in Reproducibility.