Skip to main content ITU
Logo
  • Programmes
    • BSc Programmes
    • BSc in Global Business Informatics
    • BSc in Digital Design and Interactive Technologies
    • BSc in Software Development
    • BSc in Data Science
    • Applying for a BSc programme
    • MSc Programmes
    • MSc in Digital Innovation & Management
    • MSc in Digital Design and Interactive Technologies
    • MSc in Software Design
    • MSc in Data Science
    • MSc in Computer Science
    • MSc in Games
    • Applying for an MSc programme
    • Student Life
    • Practical information for international students
    • Ask a student
    • Women in tech
    • Student organisations at ITU
    • Study start
    • Labs for students
    • Special Educational Support (SPS)
    • Study and Career Guidance
    • Exchange student
    • Become an exchange student
    • Guest Students
    • Who can be a guest student?
    • ITU Summer University
    • Open House
    • Open House - BSc programmes
    • Open House - MSc programmes
  • Professional Education
    • Master in IT Management
    • Master in IT Management
    • Admission and entry requirements
    • Contact
    • Single Subjects
    • About single subjects
    • Admission and entry requirements
    • Contact
    • Short courses | ITU Professional Courses
    • See all short courses
    • Contact
    • Contact
    • Contact us here
  • Research
    • Sections
    • Data Science
    • Data, Systems, and Robotics
    • Digital Business Innovation
    • Digitalization Democracy and Governance
    • Human-Computer Interaction and Design
    • Play Culture and AI
    • Software Engineering
    • Technologies in Practice
    • Theoretical Computer Science
    • Research Centres
    • Centre for Digital Play
    • Center for Climate IT
    • Center for Computing Education Research
    • Centre for Digital Welfare
    • Centre for Information Security and Trust
    • Research Centre for Government IT
    • Danish Institute for IT Program Management
    • Research entities
    • Research centers
    • Sections
    • Research groups
    • Labs
    • ITU Research Portal
    • Find Researcher
    • Find Research
    • Research Ethics and Integrity
    • Good Scientific Practice
    • Technical Reports
    • Technical Reports
    • PhD Programme
    • About the PhD Programme
    • PhD Courses
    • PhD Defences
    • PhD Positions
    • Types of Enrolment
    • PhD Admission Requirements
    • PhD Handbook
    • PhD Support
  • Collaboration
    • Collaboration with students
    • Project collaboration
    • Project Market
    • Student worker
    • Project postings
    • Job and Project bank
    • Employer Branding
    • IT Match Making
    • Hiring an ITU student or graduate
    • Make a post in the job bank
    • Research collaboration
    • Read more about research collaboration at ITU
    • Industrial PhD
    • Hire an Industrial PhD
    • Maritime Hub
    • Innovation and entrepreneurship
    • ITU Business Development
    • ITU NextGen
  • About ITU
    • About ITU
    • Press
    • Vacancies
    • Contact
  • DK
ITU  /  Press  /  News from ITU  /  ITU led project will make automated translation more reliable

ITU led project will make automated translation more reliable

According to associate professor at the IT University, Leon Derczynski, the Danish Gigaword Project has the potential to improve everything from automated translation to misinformation detection.

Leon DerczynskiComputer Science DepartmentResearchalgorithmsdata science

Written 2 June, 2021 10:32 by Theis Duelund Jensen

In today’s world we conduct a lot of text and language processing with computers but compared to humans, computers need much more data to understand language. Here is a good example of the problem and the reason why you should never rely solely on Google Translate:

 Google Translate forsøger at oversætte idiom

Instead of sounding like a nutcase, you will sound like a “nøddetaske” (literally, “nut purse,” which is not a word in Danish) instead of one of many appropriate Danish translations of the term, such as “galning” or “tosse.” There is a good explanation, though. Google Translate is working with a model – an algorithm trained on data to replicate a specific decision process, for instance, deciding on the proper translation of a sentence – whose Danish language data set is very limited. This is where the IT University-led Danish Gigaword Project comes into play.

The research project, led by associate professor at ITU, Leon Derczynski, and Manuel R. Ciosici, a research scientist at the University of Southern California Information Sciences Institute and visiting scholar at ITU, has compiled the first gigaword dataset with over a billion Danish words, which has the potential to make a service like Google’s much more accurate when translating Danish.

- For a language like English, there was a billion-word data set some thirty years ago. Even the 360,000 speakers of Icelandic have a gigaword project. Danish lacks behind and the gap is widening. It is important, because if you want to do any kind of language understanding for Danish, you need a large dataset to make proper tools, says Leon Derczynski.

And that is exactly the goal of the Danish Gigaword Project. In terms of Natural Language Processing, it takes Danish from a so-called low-resource language to a high-resource language, which ultimately means, we will see better machine translation quality, better speech recognition, and better search results with search engines once the dataset is in use.

Diverse input

But what exactly is a gigaword corpus? In short, it is a vast set of data on the Danish language as it appears in written sources. But to build a dataset that reflects all the nuances and complexities of written communication in a particular language, you need more than just a lot of data; you need a lot of data from a lot of different sources.

Mapping abusive language

The Danish Gigaword Project also has the potential to improve abusive language detection on various digital platforms. Recently, the data analytics agency, Analyse & Tal, published a report on mapping abusive language discourses on Facebook which was enabled by the Danish Gigaword Project.

You can read the full report here (Danish).


- If you train computers only on news text than they can only understand news text, but in our day-to-day life we do not really communicate like, for instance, DR or Weekendavisen. We use text in much more varied ways. I wanted to get as many different types of digital Danish as I possibly could, says Leon Derczynski, who started the project in 2019 and has since led a group of volunteers from all corners of the Danish tech and research spheres.

The paper Leon Derczynski and his co-authors have written on the Danish Gigaword project (“DAGW”), which they are presenting today at the Nordic Conference on Computational Linguistics, details the many sources from which data has been gathered, among them everything from the Danish Parliament’s records of meetings and speeches and a research project on spontaneous speech to Danish Wikipedia pages and a digitized version of the Bible.

Copyright challenges

However, compiling a billion-word dataset comes with a set of unique challenges when you are working in a Danish context. For one thing, licensing is much more restrictive in Denmark compared to for instance in the USA.

- One of the big barriers to our work in Denmark is that people are very cautious about sharing data. In the USA, The New York Times, Associated Press, Xinhua News Agency, and Agence France-Presse donated a combined billion words’ worth of articles to become a public English corpus. Licensing is a significant issue in Denmark, so it has been harder to build this dataset and make it available – which is the core goal. It must be available to researchers as well as to companies, so they can develop new technologies, says Leon Derczynski.

Although copyright laws in Denmark are reasonable, they do present significant challenges to researchers, according to Leon Derczynski. However, he is in positive dialogue with major news outlets about data donations and TV2 Regionerne has already supplied DAGW with approximately 50,000 news articles published between 2010 and 2019.

Better data, better tools

Language technology is not always visible, but it is applied in almost every conceivable context, which is why a vast language corpus is necessary to develop good tools. The Danish billion-word corpus does not only have the potential to improve translation services like Google’s or pave the way for automated grammar correction services like the ones that exist for English; it can help us improve online discourse:

- A lot of my other research is about misinformation detection, online bullying, and harassment. This is hard to detect in a Danish context, because the models for Danish do not have adequate data. But now we have much more detailed information about Danish. This means we can have better misinformation detection and ultimately a better online discourse, says Leon Derczynski.

Ultimately, the DAGW is all about providing researchers and developers with more data to work from. As such, the billion-word corpus’ possibilities are endless and that is exactly what motivated Leon Derczynski to start the project in the first place:

- The big language models which occasionally make the news in relation to artificial intelligence only speak English; that sucks if the language you work in is Danish. Now that we have Danish Gigaword, we can train much more advanced models than before, and start to catch up.

More information:

Follow updates about the Danish Gigaword Project at gigaword.dk

Theis Duelund Jensen, Press Officer, Tel: +45 2555 0447, email: thej@itu.dk




News

"The aim is our trust"

"The aim is our trust"

6 May, 2025

As part of the Danish Science Festival, the IT University and the newspaper Dagbladet Information gathered a number of experts to discuss cyber warfare in Denmark and how prepared we are for it. The Minister of Resilience and Preparedness, Thorsten Schack Pedersen, also participated in the talk.

Professor portrait: Nutan Limaye is pushing the boundaries of complexity theory

Professor portrait: Nutan Limaye is pushing the boundaries of complexity theory

1 May, 2025

On 22 May 2025 at 14:30, Professor Nutan Limaye from the section Theoretical Computer Science will present her inaugural lecture in Auditorium 0 at the IT University of Copenhagen. The lecture is entitled “My reflections on the last two decades and Complexity Theory”.

Professor portrait Anna Vallgårda challenges the design of care technology

Professor portrait Anna Vallgårda challenges the design of care technology

24 April, 2025

On 9 May 2025 at 14:30, Professor Anna Vallgårda will give her inaugural lecture in Auditorium 0 at the IT University of Copenhagen. The lecture is entitled: ”Radical Redesign of Care Technologies”.

Is Denmark prepared for cyberwarfare?

Is Denmark prepared for cyberwarfare?

8 April, 2025

A group of researchers from the IT University of Copenhagen is investigating what Denmark can learn from Ukraine in terms of preparing for cyberwarfare. Cyberwarfare does not just affect governments and companies, but also civilians, and the researchers ask what should be done if we come under attack.

Researchers aim to teach math students critical thinking with data science

Researchers aim to teach math students critical thinking with data science

31 March, 2025

In a new research project at the IT University of Copenhagen and the University of Copenhagen, a group of researchers will investigate how data science can become part of high school mathematics education to provide students with a better foundation for critical thinking and the ability to illuminate and nuance claims they encounter in their daily lives.

ITU researcher secures grant to improve safety of AI systems

ITU researcher secures grant to improve safety of AI systems

19 March, 2025

At Advanced Institute of Science and Technology in Japan, Associate Professor Alessandro Bruni from ITU is currently conducting research on the mathematical foundation for developing verifiably correct machine learning frameworks. The project is supported by the Carlsberg Foundation.

Professor portrait: Vasilis Galis found his way in research on the Athens metro

Professor portrait: Vasilis Galis found his way in research on the Athens metro

13 March, 2025

On 28 March 2025 at 14:30, Professor Vasilis Galis from the section Technologies in Practice will present his inaugural lecture in Auditorium 0 at the IT University of Copenhagen. The lecture is entitled “Research against dead time”.

ITU researcher investigates elections in Greenland

ITU researcher investigates elections in Greenland

11 March, 2025

On 11 March 2025, the election for Inatsisartut (Greenland's parliament) will take place. For several years, researchers from ITU, led by Professor Carsten Schürmann and Center for Information Security and Trust, have been investigating election and the possibility of internet elections in Greenland, and the election today is no exception.

IRFD funded ITU project to develop theoretical foundation for probabilistic session types

IRFD funded ITU project to develop theoretical foundation for probabilistic session types

6 March, 2025

The increasing technological complexity makes probabilistic understanding and management of critical computing systems a necessity. A new research project, led by Associate Professor Marco Carbone, aims to develop the foundation for probabilistic session types to that end.

Urban highways are barriers to social connections

Urban highways are barriers to social connections

5 March, 2025

Researchers from IT University of Copenhagen have proved that urban highways limit social connections in the 50 largest cities in the US. It is the first ever quantitative evaluation of the barrier effect of urban highways in reducing social connections across neighborhoods.

New research to find efficient strategies for prevention of epidemics

New research to find efficient strategies for prevention of epidemics

26 February, 2025

Assistant Professor at ITU, Jonas Juul, receives a Novo Nordisk Foundation Data Science Investigator grant of DKK 6.5 million for a project that aims to improve statistical methods for predicting outbreaks of infections.

Within Limits – an exhibition on computation and constraint

Within Limits – an exhibition on computation and constraint

24 February, 2025

On 7 March, join Artist Jacob Remin, Associate Professor James Maguire and Postdoc Frauke Mennes from the Center for Climate IT at ITU for the launch of Within Limits – an art installation that questions and reimagines the scalar logics inherent in computational worlds.

ITU students and alumni win awards at Copenhagen Gaming Week

ITU students and alumni win awards at Copenhagen Gaming Week

21 February, 2025

ITU was represented with games developed by both students and alumni from the university at Copenhagen Gaming Week and ‘Spilprisen’ that took place last week. Students from the MSc Games won the award for ’Best Student Game’, while alumni from the same study programme won for ‘Best Debut’.

New research project to find a more inclusive way to develop algorithms

New research project to find a more inclusive way to develop algorithms

10 February, 2025

Associate Professor Veronika Cheplygina has received a Novo Nordisk Data Science Investigator Grant of almost DKK 11 million. The grant will fund research on how more inclusive teaching and research environments may lead to better algorithms for medical imaging.

Thesis on digital divide in prisons wins award

Thesis on digital divide in prisons wins award

31 January, 2025

Three students from ITU have won the Danish Institute for Human Rights' Thesis Award for their thesis "The Digital Divide in Prisons". The thesis examines how the digital divide between inmates in Danish prisons and the surrounding society can be bridged.

New ITU research analyses attacks on Large Language Models

New ITU research analyses attacks on Large Language Models

16 January, 2025

What are the intentions and profile of someone trying to use LLMs for malicious purposes? And how do they do it? In a new study, researchers from ITU define so-called “red teaming” of LLMs to enable better security in the future.

Jakob Grue Simonsen named new prorector at IT University in Copenhagen

Jakob Grue Simonsen named new prorector at IT University in Copenhagen

22 November, 2024

Jakob Grue Simonsen, who comes from a position as head of department at the Department of Computer Science, University of Copenhagen, will focus on well-being and collaboration when he takes over as prorector at ITU on 1 January 2025.

IT University of Copenhagen reveals two new members of management

IT University of Copenhagen reveals two new members of management

18 November, 2024

At IT University of Copenhagen, future head of education, Luís Cruz-Filipe (L), and future head of research, Morten Hjelholt (R), will become part of the university management when both take up their positions on 1 February and 1 January 2025 respectively.

Video: Is artificial intelligence the key to human consciousness?

Video: Is artificial intelligence the key to human consciousness?

12 November, 2024

"Our future is going to look like science fiction." Associate professor at IT University of Copenhagen, Paolo Burelli, uses artificial intelligence to approach a better understanding of the human brain and consciousness.

ITU researcher awarded Villum Synergy grant for qualitative data project

ITU researcher awarded Villum Synergy grant for qualitative data project

15 October, 2024

Associate Professor at IT University of Copenhagen, Anna Rogers, and Associate Professor Hjalmar Carlsen at University of Copenhagen have received a Villum Synergy grant from Villum Fonden to develop a new tool for conducting large-scale, high-quality qualitative interviews.

Contact us

Phone
+45 7218 5000
E-mail
itu@itu.dk

All contact information

Web Accessibility Statement

Find us

IT University of Copenhagen
Rued Langgaards Vej 7
DK-2300 Copenhagen S
Denmark
How to get here

Follow us

ITU Student /
Privacy /
EAN-nr. 5798000417878/
CVR-nr. 29 05 77 53 /
P-nummer 1005162959

This page is printed from https://en.itu.dk/Programmes/BSc-Programmes/Data-Science