ITU led project will make automated translation more reliable

According to associate professor at the IT University, Leon Derczynski, the Danish Gigaword Project has the potential to improve everything from automated translation to misinformation detection.

Leon Derczynski Computer Science Department Research algorithms data science

Written 2 June, 2021 10:32 by Theis Duelund Jensen

In today’s world we conduct a lot of text and language processing with computers but compared to humans, computers need much more data to understand language. Here is a good example of the problem and the reason why you should never rely solely on Google Translate:

Google Translate forsøger at oversætte idiom

Instead of sounding like a nutcase, you will sound like a “nøddetaske” (literally, “nut purse,” which is not a word in Danish) instead of one of many appropriate Danish translations of the term, such as “galning” or “tosse.” There is a good explanation, though. Google Translate is working with a model – an algorithm trained on data to replicate a specific decision process, for instance, deciding on the proper translation of a sentence – whose Danish language data set is very limited. This is where the IT University-led Danish Gigaword Project comes into play.

The research project, led by associate professor at ITU, Leon Derczynski, and Manuel R. Ciosici, a research scientist at the University of Southern California Information Sciences Institute and visiting scholar at ITU, has compiled the first gigaword dataset with over a billion Danish words, which has the potential to make a service like Google’s much more accurate when translating Danish.

- For a language like English, there was a billion-word data set some thirty years ago. Even the 360,000 speakers of Icelandic have a gigaword project. Danish lacks behind and the gap is widening. It is important, because if you want to do any kind of language understanding for Danish, you need a large dataset to make proper tools, says Leon Derczynski.

And that is exactly the goal of the Danish Gigaword Project. In terms of Natural Language Processing, it takes Danish from a so-called low-resource language to a high-resource language, which ultimately means, we will see better machine translation quality, better speech recognition, and better search results with search engines once the dataset is in use.

Diverse input

But what exactly is a gigaword corpus? In short, it is a vast set of data on the Danish language as it appears in written sources. But to build a dataset that reflects all the nuances and complexities of written communication in a particular language, you need more than just a lot of data; you need a lot of data from a lot of different sources.

- If you train computers only on news text than they can only understand news text, but in our day-to-day life we do not really communicate like, for instance, DR or Weekendavisen. We use text in much more varied ways. I wanted to get as many different types of digital Danish as I possibly could, says Leon Derczynski, who started the project in 2019 and has since led a group of volunteers from all corners of the Danish tech and research spheres.

The paper Leon Derczynski and his co-authors have written on the Danish Gigaword project (“DAGW”), which they are presenting today at the Nordic Conference on Computational Linguistics, details the many sources from which data has been gathered, among them everything from the Danish Parliament’s records of meetings and speeches and a research project on spontaneous speech to Danish Wikipedia pages and a digitized version of the Bible.

Copyright challenges

However, compiling a billion-word dataset comes with a set of unique challenges when you are working in a Danish context. For one thing, licensing is much more restrictive in Denmark compared to for instance in the USA.

- One of the big barriers to our work in Denmark is that people are very cautious about sharing data. In the USA, The New York Times, Associated Press, Xinhua News Agency, and Agence France-Presse donated a combined billion words’ worth of articles to become a public English corpus. Licensing is a significant issue in Denmark, so it has been harder to build this dataset and make it available – which is the core goal. It must be available to researchers as well as to companies, so they can develop new technologies, says Leon Derczynski.

Although copyright laws in Denmark are reasonable, they do present significant challenges to researchers, according to Leon Derczynski. However, he is in positive dialogue with major news outlets about data donations and TV2 Regionerne has already supplied DAGW with approximately 50,000 news articles published between 2010 and 2019.

Better data, better tools

Language technology is not always visible, but it is applied in almost every conceivable context, which is why a vast language corpus is necessary to develop good tools. The Danish billion-word corpus does not only have the potential to improve translation services like Google’s or pave the way for automated grammar correction services like the ones that exist for English; it can help us improve online discourse:

- A lot of my other research is about misinformation detection, online bullying, and harassment. This is hard to detect in a Danish context, because the models for Danish do not have adequate data. But now we have much more detailed information about Danish. This means we can have better misinformation detection and ultimately a better online discourse, says Leon Derczynski.

Ultimately, the DAGW is all about providing researchers and developers with more data to work from. As such, the billion-word corpus’ possibilities are endless and that is exactly what motivated Leon Derczynski to start the project in the first place:

- The big language models which occasionally make the news in relation to artificial intelligence only speak English; that sucks if the language you work in is Danish. Now that we have Danish Gigaword, we can train much more advanced models than before, and start to catch up.

More information:

Follow updates about the Danish Gigaword Project at gigaword.dk

Theis Duelund Jensen, Press Officer, Tel: +45 2555 0447, email: thej@itu.dk

News

ITU offers admission to 413 new students

28 July, 2026

1,624 students applied for the IT University’s four bachelor’s degree programmes in 2026. 413 applicants have now been offered admission.

When the Hospital Moves into the Living Room

8 July, 2026

A new PhD thesis by Cæcilie Sloth Laursen from the IT University of Copenhagen examines how video consultations are changing interactions between patients and clinicians, and challenging the idea of seamless digital access to healthcare.

ERC Grant to bring self-healing AI hardware into space

6 July, 2026

The GROW-AI team, led by Professor Sebastian Risi at the IT University of Copenhagen, has been awarded a 2025 ERC Proof of Concept Grant for a project aimed at making future space technologies more resilient to radiation and hardware failures.

ITU secures prestigious DDSA fellowship for foundational research in temporal data

30 June, 2026

Highly competitive national grant awarded to postdoc Sampson Wong will strengthen ITU’s position in algorithmic research and data science.

How do we restore trust in technology in the age of AI?

17 June, 2026

As AI systems become more powerful and more pervasive, the question is no longer only what these technologies can do. It is also whether we can trust the knowledge that underpins them. That’s where ITU Associate Professor Aske Mottelson comes into the picture.

New book examines the relationship between technology, the welfare state, and everyday life

10 June, 2026

In the Danish book The Citizen in the Digital Welfare State, Irina Papazu and Morten Hjelholt from the IT University of Copenhagen and Anja Svejgaard Pors from University College Copenhagen investigate how public digitalisation has transformed the relationship between citizens and the state.

Why human-centred computing is the key to navigating the AI era

8 June, 2026

As artificial intelligence continues to reshape how we work, communicate, and make decisions, one question is becoming increasingly urgent: who ensures that technology works for people? ITU's Head of Research, Morten Hjelholt, talks AI, human-computer-interaction, and future demands.

ITU project will study how humans react when AI systems fail

1 June, 2026

New research led by Associate Professor Paolo Burelli aims to prevent disasters at sea by tackling alarm overload, loss of trust, and confusion in increasingly autonomous vessels.

What can game studies teach us about soccer?

20 May, 2026

A new book co‑authored by ITU Professor Miguel Sicart examines soccer through the lens of game studies, analysing how rules, interpretation, and play shape the world’s most popular game.

New report: Could a cyberattack paralyse Denmark?

11 May, 2026

A report by the IT University of Copenhagen, the University of Southern Denmark, and the Danish Institute of Fire and Security shows that successful cyberattacks on the telecommunications sector have had major consequences in Ukraine and offers a number of recommendations for how Denmark can strengthen both defence and emergency preparedness.

DKK 3-5 million for a project on payment options in times of crisis

13 April, 2026

A grant from the Inge Lehmann Programme will support a project investigating vulnerabilities in Danes’ payment options in the event of a crisis.

New method could close major gaps in Danish cyber defence

8 April, 2026

A research project aims to raise awareness about the uncertainties involved in the acquisition and use of so-called “off-the-shelf” software and operating systems, which are currently used by both the Danish Armed Forces and institutions critical to national infrastructure. In the long term, the goal is to equip critical sectors with tools that enable them to verify the security of their software.

ITU Professor involved in the upgrading of European theoretical computer science

30 March, 2026

The European conference in theoretical computer science, ICALP, has achieved top ranking. The current Chair of ICALP’s Steering Committee, Thore Husfeldt, Professor at the IT University of Copenhagen, is excited to see that the upgrade has already resulted in more world-class researchers presenting their results in Europe.

Basic research to help the shipping industry predict shocks in global networks

26 March, 2026

A new research project at ITU investigates how complex networks respond to sudden and localised disruptions. The results may influence both fundamental research and the maritime sector.

IT University to launch ten maritime research projects

25 March, 2026

The Orient’s Fond and The Danish Maritime Fund fund ten research projects that will contribute to maritime innovation in Denmark.

Alumni and students from the IT University honoured at Spilprisen 2026

18 March, 2026

At Spilprisen, the prestigious awards ceremony for the Danish games industry, the Debut Award, Talent Award, and Student Award were all presented to teams connected to the IT University in Copenhagen.

Danish ITU alumni win awards at the Apple App Store Awards

13 March, 2026

Two alumni from the IT University of Copenhagen have received the highly prestigious Apple Arcade Game of the Year award for the game WHAT THE CLASH?. This is the second time this year that a Danish company with ties to ITU has gained recognition from Apple.

ITU researchers secure NordForsk grant to investigate responsible AI in design

3 March, 2026

Two researchers from the IT University of Copenhagen are investigating how AI is transforming the way we design digital services – and how we can ensure that this transformation is responsible.

How to prepare for the threat of quantum computers

27 February, 2026

The dawn of quantum computers threatens to break the security we have relied on for decades. To counter this, Bernardo David, associate professor at the IT University of Copenhagen, is developing information-theoretic cryptography schemes.

Privacy is not dead yet

19 February, 2026

While some politicians keep pushing for “lawful access” to our private messages, a new method may make it possible to keep our private communications private – even if end-to-end encryption in, for instance, Signal is “lawfully” decrypted. Associate professor at the IT University of Copenhagen, Rosario Giustolisi, explains how.