ITU researcher secures DKK 6.99 million for linguistically grounded language models
Carlsberg Foundation funds project to embed real-world language knowledge into AI – beyond scale and compute.
Researchartificial intelligencegrants
Written 18 December, 2025 09:41 by Theis Duelund Jensen
Associate Professor Rob van der Goot from ITU’s Data Science section has received DKK 6,988,496 under Semper Ardens: Accelerate for his new project, LMLM: Linguistically Motivated Language Models. The research will explore how to incorporate human linguistic knowledge directly into the design of language models, challenging the dominant paradigm that relies primarily on ever-growing datasets and compute.
“Humans are the gold standard for processing language,” says Rob van der Goot. “Current language models are mostly machine learning systems with very little truly linked to how languages work. If we can better mimic human language processing, we may gain capabilities that are closer to how people understand meaning.”
Modern language models typically break text into statistical subword units (“tokens”) derived from algorithms dating back to the 1990s. While efficient, these units often ignore linguistic structure. In earlier work with Danish, Rob van der Goot led a project that segmented words by morphemes – the smallest meaning carrying units of language – rather than statistically likely character sequences and found improved performance for small language models.
Building on that success, LMLM will scale up to more languages and add multiple layers of linguistic signal – from morphemes inside words to syntax and phrase structures across sentences. The project will prioritise linguistic diversity beyond English and Danish, including languages with different scripts and rich morphological systems such as Finnish and Turkish, where long, unique word forms demand more sophisticated modeling.
While scaling data and compute has delivered rapid gains, Rob van der Goot stresses that bigger isn’t always better – and may even be reaching practical limits. “I believe we can only get so far by scaling data and compute, and there’s no way I can compete with that at the university,” he says. “This project may lead to models that are a bit less efficient or even slightly less performant on some benchmarks, but better at actually using language – for example, distinguishing bad, not good from good, not bad, where shallow models fail.”
The project will use targeted benchmarks designed to test whether models reason with syntax and context rather than relying on shortcut learning. The aim is not to chase leaderboard scores, but to expand the range of tasks models can solve reliably – especially in underrepresented languages. A key early focus will be mapping available data and identifying languages that ensure diversity in script and linguistic typology. The funding includes support for professional annotation, critical for building and validating systems that accurately identify linguistic units.
Theis Duelund Jensen, Press Officer, phone +45 2555 0447, email thej@itu.dk