Linguistics for a Brazilian Artificial Intelligence (AI)

By Raquel Freitag

Digital illustration showing a representation of a brain made up of circuits.

Not long ago, telephone companies implemented an automated customer service system, a precursor to the virtual assistants that dominate technology today. I wanted to resolve a simple issue: my internet bill had been issued with an incorrect amount. I called the customer service number but was unable to speak to a human representative.

“Please briefly state the reason for your call,” said a very friendly, but artificial voice. “Wrong bill,” I replied. “I understand, duplicate bill. Is that correct?” “No! It’s about the wrong amount!” “I’m sorry, I don’t understand your request. Let’s try again: tell me the reason for your call.”

And so it went for a while: “billing error,” “incorrect amount,” “wrong bill,” even though I spoke slowly, angrily, emotionally, the answer was always the same: “I’m sorry, I don’t understand.” I thought it was better to pay the extra R$20 in the bill than to keep getting angry.

Nowadays, despite advances and all the technologies we have, with all the progress in generative artificial intelligence (Gen AI), unfortunately, the result would be very similar if the model were implemented to serve people at the INSS (Brazil’s National Social Security Institute, Instituto Nacional do Seguro Social) or to transcribe teleconsultations from the SUS (Brazil’s Unified Health System, Sistema Único de Saúde). The reason is that the language technologies that support these systems are still dependent on models translated from English.

Besides the work of developers, the contributions of linguists, professionals who work not only in the description and theorization of languages, but also in the constitution of linguistic samples of different language varieties and different people’s profiles, can help train AI with greater efficiency and with more equity and social justice, sensitive to Brazilian linguistic diversity.

While we humans learn the rules of grammar from our experiences, generative artificial intelligence is based on large scale language (not speech) models known as large language models (LLMs), which are trained with linguistic data from which statistical patterns of word occurrences in contexts are identified. To reach these patterns, a large volume of linguistic data is required, a very large volume indeed. LLMs are trained with billions of words and millions of parameters to achieve such accuracy in their responses that it is difficult to tell whether they are human or machine, passing the Turing test.

Currently, we do not know exactly which texts are selected for training dataset or which parameters are controlled. The developers do not disclose this information, as the volume of data required far exceeds what is available on the Internet in the public domain.

In most cases, the data is collected without consent or in violation of copyright, which has led large media conglomerates to initiate legal action. The reliability of the responses we obtain, however, only reinforces that the training dataset is growing to encompass all dimensions of variability in human speeches.

Ethical and copyright issues are not the only caveats in this process. The environmental costs involved in training models are very high and could be reduced by adopting structured data for supervised learning.

Model training can be performed with structured data (supervised learning) and unstructured data (unsupervised learning). While unsupervised learning requires a large volume of data, demanding increasingly higher computational costs and generating energy and environmental impacts, supervised learning, with structured and labeled data, can optimize this process. More than that: supervised learning with structured and labeled data, such as that resulting from linguistic documentation projects, can result in lower processing demands (and lower energy and environmental costs) to obtain more optimized results.

LLMs can be trained with language data and by calculating the probabilities of word co-occurrence, reaching patterns and inferring rules. And to reach these patterns, many, many words are needed. For example, the word “cobra” in Portuguese can be a noun, as in “A cobra mordeu João” (The snake bit João) or it can be a verb, as in “João cobra o serviço” (João charges for the service). To identify when “cobra” is a verb or a noun, the model needs a large number of contexts in which that word occurs to reach a generalization. This is, roughly speaking, unsupervised training.

On the other hand, it is possible to train models with categorized data: each word has a label explaining some aspect of its functioning so that it is just a matter of following the model. Each linguistic element receives a label. In the previous examples, a morphological tagging would be

A [ART] cobra [NOUN] mordeu [VERB] João [NOUN]

João [NOUN] cobra [VERB] o [ART] serviço [NOUN]

In the case of the word “cobra,” the noun or verb label is assigned by a morphosyntactic rule: if it has an [ART] element to the left, it is [NOUN]; if not, it is [VERB].

The labeling of linguistic data is still a process that requires specialized human resources, which makes it, in principle, costly in terms of investment. However, there is a large volume of structured data in Brazil that is idly lost on flash drives, computer hard drives, and unsystematic repositories. This is the reality of products derived from linguistic documentation and description research.

Linguistics is one of the most widespread fields in Brazil, with over 100 graduate programs and several research projects that result in collections of linguistic data. Some of these collections are more famous, such as the Cultured Urban Norm (Norma Urbana Culta, NURC) project, which was established in the late 1960s and 1970s with speech samples in three different stylistic situations in five Brazilian capitals, and provides the basis for contemporary Portuguese grammars.

The linguistic data collected for the NURC project has supported a huge body of scientific research on Brazilian Portuguese, contributing not only to the consolidation of linguistics, but also to the training of specialized human resources.

Other linguistic data collections are more specific, smaller, but no less important: these are those that were created for a dissertation or thesis, and then “forgotten” in some repository.

In a scenario where linguists have collections of linguistically annotated data, with scientific rigor, and developers are looking for any type of linguistic data to train their models, a synergistic partnership between the areas is the proposal of the Brazilian Linguistic Diversity Platform (Plataforma da Diversidade Linguística Brasileira), submitted to the call for proposals CNPq/SECTICS/CAPES/FAPs Nº 46/2024 – Programa Institutos Nacionais de Ciência e Tecnologia – INCT (approved on merit, but not funded) and shared in SciELO Preprints on Brazilian Linguistic Diversity Platform: Linguistic data for a Brazilian AI.¹

In Brazil, besides Portuguese and its varieties, there are more than 250 other languages (indigenous, immigrant, sign languages) that are neglected in digital inclusion due to a lack of structured data. Even Portuguese is neglected, since training LLMs with translations from English results in asymmetries and biases.

The consortium of laboratories and research groups that came together for the Brazilian Linguistic Diversity Platform proposed to work on the preparation of linguistic data for the training of LLMs, considering Brazilian linguistic diversity, with the development of a joint protocol for collecting linguistic data in the field, to be replicated in groups and laboratories longitudinally. Also within the scope of this proposal is the standardization of procedures for transcription, alignment, and labeling of linguistic data for the constitution of datasets that represent Brazilian linguistic diversity.

The Brazilian Linguistic Diversity Platform directly responds to the objective of the Brazilian Artificial Intelligence Plan (Plano Brasileiro de Inteligência Artificial, PBIA ²) to “develop large-scale language models (LLM) for artificial intelligence in Portuguese, based on national data.” (PBIA, 2025, pp. 13).

Recently published, the final version of the Brazilian Artificial Intelligence Plan² proposes to improve the quality of life of Brazilians through technological innovations in strategic areas such as health, agriculture, the environment, and education. In this context, linguistic research plays a strategic role. Sociolinguistic and natural language processing studies help develop more inclusive technologies capable of dealing with Brazil’s linguistic diversity and avoiding biases in AI models.

Specifically, action 9 of the PBIA proposes an

AI based on national data (LLM in Portuguese), promoting the curation of national data sets and supporting the development of foundational models, in particular large-scale language models (LLM) specialized in Portuguese.² (PBIA, 2025, pp. 70)

The proposal for a Brazilian Linguistic Diversity Platform directly responds to the PBIA’s challenge of creating and improving national databases for training AI models, with a focus on reducing dependence on foreign data and recognizing the linguistic and cultural specificities of Brazil, as recommended.

The curatorial proposal of the Brazilian Linguistic Diversity Platform, which brings together structured and documented data from different varieties of Brazilian Portuguese and other languages spoken in Brazil, is directly aligned with the goals of the initiative to expand the availability of national datasets and enable the development of an LLM that is sensitive to the real diversity of language use in Brazil.

Instead of replicating English translation patterns, the structured data curated by the Brazilian Linguistic Diversity Platform enables the training of LLMs that reflect the linguistic reality of Brazil, which is essential for the success of technological applications in health, education, justice, digital inclusion, and other strategic sectors.

Structured data from spoken linguistic documentation of varieties of Brazilian Portuguese is key to the success of PBIA Impact Action 1—the development of an AI system for automatic transcription of teleconsultations in SUS.

Without data representing the linguistic diversity present in Brazil, transcription models have not achieved the necessary accuracy for speech recognition, which is sensitive to regional, age, and social differences. Without this diversity in the training data, there is a high risk that the developed system will be inaccurate or exclusionary, especially in regions where the Portuguese spoken deviates from the hegemonic norm.

For the development of an “AI system to automate the transcription of teleconsultations.”² (PBIA, pp. 47), linguistic documentation with structured annotation, marking pauses, intonation, hesitations, and speech overlap, can improve the accuracy of models in real call center contexts, which involve spontaneous language and often unfavorable acoustic conditions, with noise and speech overlap.

We must not forget that the official sign language of Brazil, Libras (Língua Brasileira de Sinais), is a recognized language, and by law sign language services are required in public services. AI systems must also consider sign languages in Brazil (and Libras is just one of them), which requires structured linguistic documentation data in sign languages as well.

Besides AI systems for speech transcription and signaling, the implementation of PBIA Impact Action 7, aimed at creating an AI platform to promote the health of older adults, requires structured data on this age group, considering not only regional and socioeconomic variations, but also the effects of cognitive difficulties resulting from aging.

Language models trained based on a dataset of this linguistic profile are essential for more empathetic, clear, and accurate communication between older adults and automated healthcare systems. Going even further, structured linguistic data can support the development of early screening tools for neurodegenerative diseases by identifying linguistic patterns associated with early symptoms of Alzheimer’s, Parkinson’s and other dementias, such as lexical impoverishment, hesitations, and changes in fluency and speech coherence.

As we can see, linguistic data diversity is essential for LLMs to ensure social justice and equity, with representation of speeches from different regions and social groups.

The Brazilian Linguistic Diversity Platform is a proposal to bring together experts in structured data on Brazilian languages, in different situations and use contexts, and developers of LLM-based applications.

Given the demand from PBIA, we decided to share the proposal as it was submitted to the CNPq/SECTICS/CAPES/FAPs call Nº 46/2024 – Programa Institutos Nacionais de Ciência e Tecnologia – INCT, together with its assessments, in order to stimulate and contribute to the improvement of other proposals, and to show that we, linguists, have something to offer to the Brazilian Artificial Intelligence Plan and contribute to improving Brazilian’s quality of life.

Notes

1. FREITAG, R.M.K. Plataforma da Diversidade Linguística Brasileira: Dados linguísticos para uma IA brasileira. SciELO Preprints [online]. 2025. [viewed 18 July 2025]. https://doi.org/10.1590/SciELOPreprints.11957. Available from: https://preprints.scielo.org/index.php/scielo/preprint/view/11957/version/12598 ↩

2. Plano Brasileiro de Inteligência Artificial (PBIA) [online]. MCTI — Ministério da Ciência, Tecnologia e Inovação. 2025 [viewed 18 July 2025]. Available from: https://www.gov.br/mcti/pt-br/centrais-de-conteudo/publicacoes-mcti/plano-brasileiro-de-inteligencia-artificial/pbia_mcti_2025.pdf ↩

References

BENDER, E., et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In: FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event, 2021 [viewed 18 July 2025]. https://doi.org/10.1145/3442188.3445922. Available from: https://dl.acm.org/doi/10.1145/3442188.3445922

CASTILHO, A.T. Gramática do Português Brasileiro: fundamentos, perspectivas. Cadernos de Linguística [online]. 2021, vol. 2, no. 1, e252–e252, ISSN: 2675-4916 [viewed 18 July 2025]. https://doi.org/10.25189/2675-4916.2021.v2.n1.id252. Available from: https://cadernos.abralin.org/index.php/cadernos/article/view/252

FERRO, M. et al. Towards a sustainable artificial intelligence: A case study of energy efficiency in decision tree algorithms. Concurrency and Computation: Practice and Experience [online]. 2021, vol. 33, e6815, ISSN: 1532-0634 [viewed 18 July 2025]. https://doi.org/10.1002/cpe.6815. Available from: https://onlinelibrary.wiley.com/doi/10.1002/cpe.6815

FREITAG, R. Variação linguística: Diversidade e cotidiano. São Paulo: Contexto, 2025.

FREITAG, R., et al. Função na língua, generalização e reprodutibilidade. Revista da ABRALIN [online]. 2021, vol. 20, no. 1, pp. 1–27, ISSN: 0102-7158 [viewed 18 July 2025]. https://doi.org/10.25189/rabralin.v20i1.1827. Available from: https://revista.abralin.org/index.php/abralin/article/view/1827

FREITAG, R.M.K. Plataforma da Diversidade Linguística Brasileira: Dados linguísticos para uma IA brasileira. SciELO Preprints [online]. 2025. [viewed 18 July 2025]. https://doi.org/10.1590/SciELOPreprints.11957. Available from: https://preprints.scielo.org/index.php/scielo/preprint/view/11957/version/12598

FREITAG, R.M.K. Preconceito linguístico para humanizar as máquinas. Cadernos de Linguística [online]. 2021, vol. 2, no. 4, e495, ISSN: 2675-4916 [viewed 18 July 2025]. https://doi.org/10.25189/2675-4916.2021.v2.n4.id495. Available from: https://cadernos.abralin.org/index.php/cadernos/article/view/495

GALDINO, J.C. and OLIVEIRA JR, M. Prosódia e síntese da fala: uma revisão integrativa da literatura. Revista da ABRALIN [online]. 2023, vol. 22, no. 1, pp. 1–15 [viewed 18 July 2025]. https://doi.org/10.25189/rabralin.v22i1.2130. Available from: https://revista.abralin.org/index.php/abralin/article/view/2130

HÜBNER, L.C., et al. Nomeação e aprendizagem verbal na doença de Alzheimer, no comprometimento cognitivo leve e no envelhecimento sadio com baixa escolaridade. Arquivos de Neuro-Psiquiatria [online]. 2018, vol. 76, pp. 93–99, ISSN: 0004-282X [viewed 18 July 2025]. https://doi.org/10.1590/0004-282X2017019. Available from: https://www.scielo.br/j/anp/a/F6Kf9M7WVBsnpcFMKQXYcnC/

OLIVEIRA JR., M. NURC Digital: Um protocolo para a digitalização, anotação, arquivamento e disseminação do material do Projeto da Norma Urbana Linguística Culta (NURC). Chimera: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos [online]. 2016, vol. 3, no. 2, pp. 149–174, ISSN e: 2386 2629 [viewed 18 July 2025]. https://doi.org/10.15366/chimera2016.3.2.004. Available from: https://revistas.uam.es/chimera/article/view/6519

Plano Brasileiro de Inteligência Artificial (PBIA) [online]. MCTI — Ministério da Ciência, Tecnologia e Inovação. 2025 [viewed 18 July 2025]. Available from: https://www.gov.br/mcti/pt-br/centrais-de-conteudo/publicacoes-mcti/plano-brasileiro-de-inteligencia-artificial/pbia_mcti_2025.pdf

QUADROS, R.M., et al. Inventário Nacional de Libras. Fórum Linguístico [online]. 2020, vol. 17, no. 4, pp. 5457–5474, ISSN: 1984-8412 [viewed 18 July 2025]. https://doi.org/10.5007/1984-8412.2020.e77334. Available from: https://periodicos.ufsc.br/index.php/forum/article/view/77334

TORRENT, T. Plano brasileiro para turbinar IA ignora conceito básico da tecnologia. Tilt [online]. 2025 [viewed 18 July 2025]. Available from: https://www.uol.com.br/tilt/analises/ultimas-noticias/2025/06/23/plano-brasileiro-para-turbinar-ia-ignora-conceito-basico-da-tecnologia.htm

About Raquel Freitag

Raquel Freitag is a linguist and full professor at the Universidade Federal de Sergipe, where she teaches in the Graduate Programs in Language and Psychology. She holds a PhD in Linguistics from the Universidade Federal de Santa Catarina and researches linguistic variation, linguistic processing, and reproducibility in science. She is the coordinator of the Sociolinguistics Working Group at ANPOLL (2023-2025) and author of Variação linguística: Diversidade e cotidiano, published by Editora Contexto (2025).

Translated from the original in Portuguese by Lilian Nassi-Calò.

Como citar este post [ISO 690/2010]:

FREITAG, R. Linguistics for a Brazilian Artificial Intelligence (AI) [online]. SciELO in Perspective, 2025 [viewed ]. Available from: https://blog.scielo.org/en/2025/07/18/linguistics-for-a-brazilian-artificial-intelligence-ai/