New ITU research analyses attacks on Large Language Models
What are the intentions and profile of someone trying to use LLMs for malicious purposes? And how do they do it? In a new study, researchers from ITU define so-called “red teaming” of LLMs to enable better security in the future.
Nanna InieLeon DerczynskiComputer Science DepartmentResearchartificial intelligence
Written 16 January, 2025 09:12 by Mette Strange Mortensen
When Large Language Models (LLMs) became publicly available in 2022, Associate Professor at ITU, Leon Derczynski, became fascinated with the ways in which some people tried to interact with the models in an aggressive way, to see how the technology behaves “under attack”.
This would eventually lead Leon Derczynski and his colleague, Assistant Professor Nanna Inie, alongside Jonathan Stray from University of California, Berkeley, to define LLM “red teaming”. Red teaming is a known phenomenon in the military and cybersecurity worlds but has not been defined in relation to LLMs until now. LLM red teaming is a way to make an LLM act in non-intended ways, for instance tricking ChatGPT into giving you the recipe for napalm. Similar to red teaming in other contexts, LLM red teaming is characterized as limit-seeking, with non-malicious attacks, using manual processes and team effort.
The results of the researchers’ work are now available in the article “Summon a Demon and Bind It: A Grounded Theory of LLM Red Teaming in the Wild” which has been published in the journal PLOS ONE.
“Before 2022 there were no widely available large language models, and therefore manipulation of LLMs has not been defined formally. It was a brand-new human activity. In order to be able to talk about it, we needed a definition and description of the phenomenon,” says Leon Derczynski.
“The technology is so hot right now, and is used in so many places, that it is important to see and point out the holes that might be in the models. We hope our research can be used to learn more about the weaknesses of LLMs.”
To investigate this new way of interacting with LLMs, the researchers interviewed people who attack LLMs to gain an understanding of their motives and practices. The participants were both people who worked professionally with red teaming in leading tech companies, and people who had a general interest in the subject.
What motivated the participants is of particular interest to the researchers:
“It is a way of creative problem solving. How do you get the model to output certain things that should be off limits? A qualitative deep dive into something as traditionally computing-heavy as cybersecurity teaches us a lot about how to predict future attacks om LLMs, but also about how humans relate to this new technology,” says Nanna Inie.
The hope is that the outcomes of the paper can be used to be on the forefront of patching security holes in LLMs, but it also leads to questions of what the optimal functionality is for LLMs.
“The more fluent LLM output becomes, the less attentive people become towards spotting errors and harmful output. Should that be patched or should we just leave LLM output a bit stupid, which could make the systems ultimately safer to the end user?” asks Nanna Inie.
Theis Duelund Jensen, Press Officer, phone +45 2555 0447, email