Skip to main content ITU
Logo
  • Programmes
    • BSc Programmes
    • BSc in Global Business Informatics
    • BSc in Digital Design and Interactive Technologies
    • BSc in Software Development
    • BSc in Data Science
    • Applying for a BSc Programme
    • MSc Programmes
    • MSc in Digital Innovation & Management
    • MSc in Digital Design and Interactive Technologies
    • MSc in Software Design
    • MSc in Computer Science
    • MSc in Data Science
    • MSc in Games
    • Applying for an MSc Programme
    • Student Life
    • Women in tech
    • Student Organisations at ITU
    • Labs for students
    • Practical Information for International Students
    • Study Start
    • Study and Career Guidance
    • Guest Students
    • Who can be a Guest Student
    • ITU Summer University
    • Exchange Student
    • Become an exchange student at ITU
    • Open House
    • Open House - MSc Programmes
    • Open House - BSc programmes
  • Professional Education
    • Master in IT
    • Master in IT Management
    • Single Subjects
    • About single subjects
    • Contact
    • Contact us here
  • Research
    • Departments
    • Business IT Department
    • Computer Science Department
    • Digital Design Department
    • Research Groups and Labs
    • Research Groups
    • Labs
    • Research Centres
    • Centre for Computer Games Research
    • Center for Computing Education Research
    • Centre for Digital Welfare
    • Centre for Information Security and Trust
    • European Blockchain Centre
    • Research Centre for Government IT
    • Research Institutes
    • Danish Institute for IT Program Management
    • Selected Research Themes
    • Artificial intelligence
    • Big Data
    • Climate IT
    • Computer games
    • Blockchain
    • Digitalization
    • IT security
    • Find a Researcher
    • Faculty Search
    • PhD Programme
    • About the PhD Programme
    • Available PhD Positions
    • PhD Courses
    • Research Ethics and Integrity
    • Good Scientific Practice
    • PhD Defences
    • Technical Reports
    • Technical Reports
  • Collaboration
    • Collaboration with students
    • Project collaboration
    • Project Market
    • Student worker
    • Project postings
    • Job and Project bank
    • Portraits of ITU graduates
    • Employer Branding
    • IT Match Making
    • Hiring an ITU student or graduate
    • Make a post in the job bank
    • Research collaboration
    • Licensing Opportunities
    • Open Entrepreneurship
    • Research collaboration
    • Industrial PhD
    • Hire an Industrial PhD
    • Innovation and entrepreneurship
    • ITU Business Development
    • ITU Startup programme
    • Startup stories
  • About ITU
    • Vacancies
    • Press
    • ITU Alumni
    • News
    • Calendar
    • Contact
    • About ITU
  • DK
News from ITU
ITU  /  Press  /  News from ITU  /  ITU led project will make automated translation more reliable

ITU led project will make automated translation more reliable

According to associate professor at the IT University, Leon Derczynski, the Danish Gigaword Project has the potential to improve everything from automated translation to misinformation detection.

Leon DerczynskiComputer Science Departmentalgorithmsbig data

Written May 4, 2021 11:46 AM by Theis Duelund Jensen

In today’s world we conduct a lot of text and language processing with computers but compared to humans, computers need much more data to understand language. Here is a good example of the problem and the reason why you should never rely solely on Google Translate:

Instead of sounding like a nutcase, you will sound like a “nøddetaske” (literally, “nut purse,” which is not a word in Danish) instead of one of many appropriate Danish translations of the term, such as “galning” or “tosse.” There is a good explanation, though. Google Translate is working with a model–an algorithm trained on data to replicate a specific decision process, for instance, deciding on the proper translation of a sentence–whose Danish language data set is very limited. This is where the IT University lead Danish Gigaword Project comes into play.

The research project, led by associate professor at ITU, Leon Derczynski, and Manuel R. Ciosici, a research scientist at the University of Southern California Information Sciences Institute and visiting scholar at ITU, has compiled the first gigaword dataset with over a billion Danish words, which has the potential to make a service like Google’s much more accurate when translating Danish.

- For a language like English, there was a billion-word data set some thirty years ago. Even the 360,000 speakers of Icelandic have a gigaword project. Danish lacks behind and the gap is widening. It is important, because if you want to do any kind of language understanding for Danish, you need a large dataset to make proper tools, says Leon Derczynski.

And that is exactly the goal of the Danish Gigaword Project. In terms of Natural Language Processing, it takes Danish from a co-called low-resource language to a high-resource language, which ultimately means, we will see better machine translation quality, better speech recognition, and better search results with search engines once the dataset is in use.

Diverse input

But what exactly is a gigaword corpus? It short, it is a vast set of data on the Danish language as it appears in written sources. But to build a dataset that reflects all the nuances and complexities of written communication in a particular language, you need more than just a lot of data; you need a lot of data from a lot of different sources.

- If you train computers only on news text than they can only understand news text, but in our day-to-day life we do not really communicate like, for instance, the DR or Weekendavisen. We use text in much more varied ways. I wanted to get as many different types of digital Danish as I possibly could, says Leon Derczynski, who started the project in 2019 and has since led a group of volunteers from all corners of the Danish tech and research spheres.

The paper Leon Derczynski and his co-authors have written on the DAGW, which they are presenting at the Nordic Conference on Computational Linguistics on May 31, details the many sources from which data has been culled, among them everything from the Danish Parliament’s records of meetings and speeches and a research project on spontaneous speech to Danish Wikipedia pages and a digitized version of the Bible.

Copyright challenges

However, compiling a billion-word dataset comes with a set of unique challenges when you are working in a Danish context. For one thing, licensing is much more restrictive in Denmark compared to for instance in the USA.

- One of the big barriers to our work in Denmark is the fact that people are very cautious about sharing data. In the USA, The New York Times, Associated Press, Xinhua News Agency, and Agence France-Presse donated a combined billion words worth of articles and this became a public English corpus. But licensing is a significant issue in Denmark, so it has meant that it has been harder to build this dataset and make it publicly available which is the core goal. It must be available to researchers as well as to companies, so they can develop new technologies, says Leon Derczynski.

Although copyright laws in Denmark are reasonable, they do present significant challenges to researchers, according to Leon Derczynski. However, he is in positive dialogue with major news outlets about data donations and TV2 Regionerne has already supplied DAGW with approximately 50,000 news articles published between 2010 and 2019.

Better data, better tools

Language technology is not always visible, but it is applied in almost every conceivable context, which is why a vast language corpus is necessary to develop good tools. The Danish billion-word corpus does not only have the potential to improve translation services like Google’s or pave the way for automated grammar correction services like the ones that exist for English; it can help us improve online discourse:

- A lot of my other research is about misinformation detection, online bullying, and harassment. This is hard to detect in a Danish context, because the models for Danish do not have adequate data. But now we have much more detailed information about everything described in Danish. So, this means we can have better misinformation detection and ultimately a better online discourse, says Leon Derczynski.

Ultimately, the DAGW is all about providing researchers and developers with more data to work from. As such, the billion-word corpus’ possibilities are endless and that is exactly what motivated Leon Derczynski to start the project in the first place:

- These big language models that inform artificial intelligence which we see occasionally make headlines in the news, they only speak English, and that sucks if the language you work in is Danish. Now that we have this dataset, we can train much more advanced models than before.

More information:

Follow updates about the Danish Gigaword Project at www.gigaword.dk

Theis Duelund Jensen, Press Officer, Tel: +45 2555 0447, email: thej@itu.dk





News

Rasmus Ejlers Møgelberg and Jonas Fritsch receive grants from IRFD

Rasmus Ejlers Møgelberg and Jonas Fritsch receive grants from IRFD

May 10, 2022

The two researchers – from Computer Science and Digital Design at the IT University respectively – have each secured a grant of approximately 2.8 million kroner for their research projects.

European Blockchain Center announces new partnership

European Blockchain Center announces new partnership

April 28, 2022

The European Blockchain Center at the IT University of Copenhagen is partnering with SupraOracles to explore future collaborative efforts to provide value to the blockchain industry on a global scale.

Can robots help prevent anxiety attacks in children?

Can robots help prevent anxiety attacks in children?

April 26, 2022

Morten Roed Frederiksen from Computer Science at ITU has received 1.6 million kroner from Independent Research Fund Denmark for at research project that aims to make robots better at understanding human emotion. The goal is to create technology that may help children with anxiety.

Professor Sebastian Risi receives grant to develop policy learning with neural networks

Professor Sebastian Risi receives grant to develop policy learning with neural networks

April 20, 2022

Professor in the Digital Design Department at the IT University of Copenhagen, Sebastian Risi, has received a grant of approximately 550,000 Danish kroner for his work on neural cellular automata to grow neural network policies capable of adapting to novel complex reinforcement learning tasks.

New research may help improve Copenhagen’s bicycle infrastructure

New research may help improve Copenhagen’s bicycle infrastructure

March 28, 2022

A newly released research paper shows how network analysis can serve as a cost-efficient support tool for bicycle infrastructure planning. The research has been conducted by analyzing the street network in Copenhagen based on data from OpenStreetMap.

ITU professor takes critical approach to green tech, secures prestigious grant

ITU professor takes critical approach to green tech, secures prestigious grant

March 17, 2022

Professor at the IT University of Copenhagen’s Business IT department Steffen Dalsgaard has won the prestigious ERC Consolidator Grant. The almost two million Euro grant will enable him and his colleagues to critically examine IT and sustainability across the globe.

We need to talk about marine renewable energy

We need to talk about marine renewable energy

March 3, 2022

Denmark lacks neither ocean waves nor pioneering scientists and innovators who are dreaming about harnessing energy from them. Professor at IT University of Copenhagen Brit Ross Winthereik, who headed research into initiatives in wave energy innovation around the Atlantic, explains why we are still waiting to see a booming wave energy industry.

IT University to suspend all cooperation with Russia and Belarus

IT University to suspend all cooperation with Russia and Belarus

March 2, 2022

Along with Denmark’s other universities, ITU will be suspending all bilateral institutional cooperation with the states of Russia and Belarus going forward. Current exchange programme participants will not be affected by the measure.

New blockchain governance standard to generate trust and accountability in the market

New blockchain governance standard to generate trust and accountability in the market

February 28, 2022

The new standard, developed by Professor Roman Beck at the IT-University and Danish Standard, will ensure greater transparency, interoperability, and accountability in the use of blockchain and DLT systems. Professor Roman Beck has recently received AIS’s Impact Award for his work in the field.

Using Minecraft to explore climate change topics with school children

Using Minecraft to explore climate change topics with school children

February 8, 2022

Few of the world’s most popular games deal directly with climate issues. However, games can still be used to make gamers think about real world problems, says Associate Professor, Hans-Joachim Backe, who is developing tools for teachers to discuss climate issues based on for example Minecraft.

Digitizing the Amazon: How to use the rainforest without ruining it

Digitizing the Amazon: How to use the rainforest without ruining it

February 8, 2022

There is a lot of hope that the digital, material, and biological innovations of the Fourth Industrial Revolution will help us save the Amazon rainforest. However, the real impact of the new bio economy is yet to be revealed. Researchers from the IT University will study the outcome of Amazônia 4.0.

How do we secure critical technologies from tampering? The answer may be sending them into orbit

How do we secure critical technologies from tampering? The answer may be sending them into orbit

January 20, 2022

At the IT University of Copenhagen, Associate Professor Bernardo David, is working on a joint research project that aims to send satellites that power cryptographic and blockchain applications into orbit in order to secure them from tampering and to enable time-based primitives based on communication delays.

ITU PhD fellow named Game Changer by gaming industry mainstay

ITU PhD fellow named Game Changer by gaming industry mainstay

January 5, 2022

PhD fellow at the Center for Computer Games Research at the IT University of Copenhagen, Leon Y. Xiao, is new to ITU, but his research into the legal and ethical aspects of the gaming industry has already made an impact. Gaming industry mainstay, GamesIndustry.biz, recently named him one of 2021’s Game Changers.

Martin Tvede Zachariasen steps down as Vice-Chancellor

Martin Tvede Zachariasen steps down as Vice-Chancellor

December 20, 2021

Martin Tvede Zachariasen steps down as Vice-Chancellor of the IT University after three years in office. Martin Tvede Zachariasen’s resignation takes effect December 31, 2021.

European Blockchain Center announces International blockchain school and Nordic Blockchain Summit in early 2022

European Blockchain Center announces International blockchain school and Nordic Blockchain Summit in early 2022

December 8, 2021

The International Blockchain School will be held at IT University of Copenhagen on January 24-28th, 2022. On the final day of classes, the students will present their projects at the public event, Nordic Blockchain Summit.

ITU’s Mogens Jacobsen wins prestigious art grant

ITU’s Mogens Jacobsen wins prestigious art grant

December 8, 2021

The artist Mogens Jacobsen who teaches at Digital Design at the IT University of Copenhagen creates art inspired by technology and scientific research. Today, he is the recipient of a grant from the prestigious Niels Wessel Bagge Art Foundation.

ITU researcher secures prestigious fellowship to empower citizen cooperation

ITU researcher secures prestigious fellowship to empower citizen cooperation

December 7, 2021

Associate Professor of Computer Science at the IT University of Copenhagen, Luca Maria Aiello, has won the Carlsberg Foundation’s prestigious Young Researcher Fellowship for a project that aims to facilitate citizen coordination in the face of global challenges such as climate change.

CDV Podcast: Could your cat be used to store your family photos?

CDV Podcast: Could your cat be used to store your family photos?

December 1, 2021

The incomprehensible amounts of digital data we store is a real climate burden. In the Center for Digital Welfare's podcast, host Anders Kjærulff and Assistant Professor James Maguire, discuss research initiatives on environmentally friendly data storage.

Business travel plays vital role in economic growth, according to ITU researcher

Business travel plays vital role in economic growth, according to ITU researcher

November 24, 2021

As a thought experiment Associate Professor of Computer Science at the IT-University, Michele Coscia, theorized that global GDP would drop significantly if business travel on a global scale ceased – now the global COVID-19 shutdown has proven his theory in practice.

Public debate: Do Big Data and law enforcement go hand in hand?

Public debate: Do Big Data and law enforcement go hand in hand?

November 10, 2021

On December 14, the IT University headed research project CUPP and PROSA are hosting a public seminar and debate on data driven policing. Panelists include Palantir Technologies representative Paula Kift, former DPO of the National Police of Denmark, Christian Wiese Svanberg, and Jesper Lund from the IT Political Association.

Contact us

Phone
+45 7218 5000
E-mail
itu@itu.dk

All contact information

Web Accessibility Statement

Find us

IT University of Copenhagen
Rued Langgaards Vej 7
DK-2300 Copenhagen S
Denmark
How to get here

Follow us

ITU Student /
Privacy /
EAN-nr. 5798000417878/
CVR-nr. 29 05 77 53

This page is printed from https://en.itu.dk/research/portalplaceholder?layoutfraction=top&langRef=https://pure.itu.dk/portal/da/organisations/digital-design(d065e6c2-4fae-4d68-ad04-70aec25d2d5e)/clippings.html?page=29