<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "https://jats.nlm.nih.gov/publishing/1.3/JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xml:lang="en">
  <front xmlns:xlink="http://www.w3.org/1999/xlink">
    <journal-meta>
      <journal-id journal-id-type="elibrary">75447</journal-id>
      <journal-title-group>
        <journal-title>Technology and Language</journal-title>
        <trans-title-group xml:lang="ru">
          <trans-title>Технологии в инфосфере</trans-title>
        </trans-title-group>
      </journal-title-group>
      <issn pub-type="epub">2712-9934 18+</issn>
    </journal-meta>
    <article-meta xmlns:xlink="http://www.w3.org/1999/xlink">
      <article-id pub-id-type="publisher-id">6</article-id>
      <article-id pub-id-type="doi">10.48417/technolang.2026.01.06</article-id>
      <title-group>
        <article-title>Technological Development Models  in the Context of Speech Corpora Imbalance</article-title>
        <trans-title-group xml:lang="ru">
          <trans-title>Модели технологического развития в контексте дисбаланса речевых корпусов</trans-title>
        </trans-title-group>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">0009-0002-3281-057X</contrib-id>
          <name>
            <surname>Bairamova</surname>
            <given-names>Khumai</given-names>
          </name>
          <xref ref-type="aff" rid="aff1"/>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">0000-0002-9917-6609</contrib-id>
          <name>
            <surname>Gavrilov</surname>
            <given-names>Anton</given-names>
          </name>
          <xref ref-type="aff" rid="aff1"/>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">0000-0001-6493-3801</contrib-id>
          <name>
            <surname>Kharitonova</surname>
            <given-names>Anastassia</given-names>
          </name>
          <xref ref-type="aff" rid="aff1"/>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">0000-0002-3224-3934</contrib-id>
          <name>
            <surname>Nikolaev</surname>
            <given-names>Vladimir</given-names>
          </name>
          <xref ref-type="aff" rid="aff1"/>
        </contrib>
      </contrib-group>
      <aff id="aff1">ITMO University</aff>
      <pub-date publication-format="electronic" date-type="pub" iso-8601-date="2026-03-31">
        <day>31</day>
        <month>03</month>
        <year>2026</year>
      </pub-date>
      <volume>7</volume>
      <issue>1</issue>
      <issue-id pub-id-type="publisher-id">22</issue-id>
      <fpage>80</fpage>
      <lpage>102</lpage>
      <self-uri xmlns:xlink="http://www.w3.org/1999/xlink" content-type="pdf" xlink:href="https://soctech.spbstu.ru/userfiles/files/articles/2026/1/80-102.pdf"/>
      <abstract xml:lang="en">
        <p>The development of speech and language technologies in the era of artificial intelligence critically depends on the availability of large-scale, high-quality linguistic data. While low-resource languages have been widely studied, less attention has been paid to data imbalances among languages that are considered digitally well-supported. This paper examines the uneven distribution of open speech corpora across languages with established infrastructure of speech technologies and available datasets, showing that this disparity creates structural bottlenecks for sovereign AI development. We conduct a comparative analysis of open and non-commercial speech datasets, accounting for demographic factors, licensing conditions, and models of technological development. To quantify resource inequality, we propose the Digital Resource Saturation Index (DRSI), which relates the availability of speech data to the potential for content generation and consumption within language communities. Our findings reveal a strong dominance of English for open speech resources, while many non-Western languages – including Russian – remain systematically underrepresented. While interpreting these results through the lens of Western and non-Western technological modernization models, we suggest that language inequality in AI is not merely a technical or demographic issue, but a self-reinforcing structurally reproduced outcome of data governance, institutional coordination, and political choices regarding openness and digital sovereignty. The study further provides practical recommendations for mitigating these imbalances and fostering a more equitable technological landscape.</p>
      </abstract>
      <kwd-group xml:lang="en">
        <kwd>Digital language divide</kwd>
        <kwd>Speech corpora imbalance</kwd>
        <kwd>Language inequality</kwd>
        <kwd>Technological development models</kwd>
        <kwd>Resource disparity analysis</kwd>
        <kwd>Digital resource saturation index</kwd>
        <kwd>DRSI</kwd>
      </kwd-group>
    </article-meta>
  </front>
</article>
