Technological Development Models in the Context of Speech Corpora Imbalance
The development of speech and language technologies in the era of artificial intelligence critically depends on the availability of large-scale, high-quality linguistic data. While low-resource languages have been widely studied, less attention has been paid to data imbalances among languages that are considered digitally well-supported. This paper examines the uneven distribution of open speech corpora across languages with established infrastructure of speech technologies and available datasets, showing that this disparity creates structural bottlenecks for sovereign AI development. We conduct a comparative analysis of open and non-commercial speech datasets, accounting for demographic factors, licensing conditions, and models of technological development. To quantify resource inequality, we propose the Digital Resource Saturation Index (DRSI), which relates the availability of speech data to the potential for content generation and consumption within language communities. Our findings reveal a strong dominance of English for open speech resources, while many non-Western languages — including Russian — remain systematically underrepresented. While interpreting these results through the lens of Western and non-Western technological modernization models, we suggest that language inequality in AI is not merely a technical or demographic issue, but a self-reinforcing structurally reproduced outcome of data governance, institutional coordination, and political choices regarding openness and digital sovereignty. The study further provides practical recommendations for mitigating these imbalances and fostering a more equitable technological landscape.