Abstract
Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .
Chinese Translation
大型语言模型(LLMs)展现出强大的多语言能力,但仍然受到全球语言资源严重不平衡的根本限制。尽管全球有超过7000种语言,但只有少数(不到100种)具有足够的数字存在感,能够对现代LLM训练产生实质性影响。这种差异导致了系统性的表现不足、文化不对齐以及对低资源和极低资源语言使用者的有限可及性。为了解决这一问题,我们提出了“将您的语言带入”(BYOL),这是一个统一的框架,用于可扩展的、语言感知的LLM开发,旨在根据每种语言的数字足迹进行定制。BYOL首先进行语言资源分类,将语言映射为四个层级(极低资源、低资源、中等资源、高资源),并利用经过策划的网络规模语料库进行分类,以选择合适的整合路径。对于低资源语言,我们提出了一个全栈数据精炼和扩展管道,结合了语料清理、合成文本生成、持续预训练和监督微调。应用于奇切瓦语和毛利语,该管道生成的语言特定LLM在12个基准测试中实现了约12%的平均提升,相较于强大的多语言基线,同时通过权重空间模型合并保留了英语和多语言能力。对于极低资源语言,我们引入了一种翻译介导的纳入路径,并在因纽克提特语上展示了一个定制的机器翻译系统比商业基线提高了4 BLEU,从而在直接语言建模不可行时实现高准确度的LLM访问。最后,我们发布了奇切瓦语、毛利语和因纽克提特语的全球MMLU-Lite基准的人类翻译版本,并在 https://github.com/microsoft/byol 上公开了我们的代码库和模型。