75447

Technology and Language

Технологии в инфосфере

2712-9934 18+

10.48417/technolang.2026.01.06

Technological Development Models in the Context of Speech Corpora Imbalance

Модели технологического развития в контексте дисбаланса речевых корпусов

0009-0002-3281-057X

Bairamova

Khumai

0000-0002-9917-6609

Gavrilov

Anton

0000-0001-6493-3801

Kharitonova

Anastassia

0000-0002-3224-3934

Nikolaev

Vladimir

ITMO University

31 03 2026

7 1

80 102

The development of speech and language technologies in the era of artificial intelligence critically depends on the availability of large-scale, high-quality linguistic data. While low-resource languages have been widely studied, less attention has been paid to data imbalances among languages that are considered digitally well-supported. This paper examines the uneven distribution of open speech corpora across languages with established infrastructure of speech technologies and available datasets, showing that this disparity creates structural bottlenecks for sovereign AI development. We conduct a comparative analysis of open and non-commercial speech datasets, accounting for demographic factors, licensing conditions, and models of technological development. To quantify resource inequality, we propose the Digital Resource Saturation Index (DRSI), which relates the availability of speech data to the potential for content generation and consumption within language communities. Our findings reveal a strong dominance of English for open speech resources, while many non-Western languages – including Russian – remain systematically underrepresented. While interpreting these results through the lens of Western and non-Western technological modernization models, we suggest that language inequality in AI is not merely a technical or demographic issue, but a self-reinforcing structurally reproduced outcome of data governance, institutional coordination, and political choices regarding openness and digital sovereignty. The study further provides practical recommendations for mitigating these imbalances and fostering a more equitable technological landscape.

Digital language divide Speech corpora imbalance Language inequality Technological development models Resource disparity analysis Digital resource saturation index DRSI