Regenerating Toxicity in LLMs: Here’s How to Combat It

Written media has long been a source of social harm. Language has a huge impact on reinforcing offensive and discriminatory stereotypes. These problems are often referred to as “hate speech,” which is a phenomenon that has become much more extreme in recent years.

With the advent of Large Language Models (LLMs), which are trained with pre-existing datasets, toxic content (characterized as inappropriate, insulting, or irrational language) can sometimes be regenerated indeliberately. These harmful elements in the content stem from biases that already exist in training data and societal preconceptions.

By filtering the training data set and adjusting models to comply with ethical standards, certain attempts are being made to eradicate this problem. Nonetheless, because these methods do not completely eliminate flaws, they are insufficiently effective. Thus, it is crucial to carry out automated testing before implementation to assess LLMs’ vulnerability to potentially harmful outputs.

Jailbreak prompts, often referred to as adversarial attacks, are the focus of recent automated testing strategies for LLMs with ethical issues. These methods induce improper outputs from aligned LLMs by using hostile prefixes or suffixes.

The methods employed to elicit harmful outputs are inadequate due to having semantically no meaning and being unrealistic in terms of natural interaction between humans and LLMs. In the research conducted by Corbo et al. (2025), EvoTox is addressed as a solution to this problem.

What is EvoTox?

EvoTox is a search-based, automated testing tool that is developed to analyze the potential toxicity level of large language models (LLMs). Although large language models are “aligned” (trained to mitigate toxic output), they might still produce harmful responses in some cases. This tool leverages an Evolution Strategy (ES) that begins with an initial (seed) prompt and repeatedly produces different prompts, namely “mutants.” via the Prompt Generator (PG) which is a secondary LLM.

The Prompt Generator creates progressively more harmful prompts, which lead the System Under Test (SUT), the LLM being tested for toxic response generation, to produce more and more toxic outputs. By deploying pre-trained tools like Perspective API, the level of toxicity is examined. These classifiers assign confidence scores to assess the probability of toxicity in the given content.

Let’s see how it happens:

The classifier interprets the output generated by LLMs to a written prompt and gives a value between 0-1 for a variety of types of toxicity, such as slurs, foul language, and identity attacks.
Scores near 0 show little toxicity, whereas scores near 1 signify the contrary.
Since EvoTox functions with a black-box framework, there is no requirement for internal SUT access. Furthermore, it integrates several prompt evaluation strategies.

In the study, EvoTox was tested on four advanced Large Language Models (LLMs), ranging in scalability from 7 to 13 billion parameters. The aim was to measure its efficiency in identifying toxic outputs across various models that differ in complexity and alignment.

As a result, EvoTox proved to be highly effective in detecting toxic responses, achieving high levels of performance while maintaining computational costs at a manageable level (22%-35%). Additionally, it has been validated by human raters that prompts produced by EvoTox are natural and similar to the way that humans interact. Also, when researchers compared EvoTox with other methods, the fluency level turned out to be higher.

For further research, it is planned to upgrade the study by testing LLMs as evaluators, improving the PG with advanced training, and exploring different methods. Researchers also intend to test EvoTox with larger models such as ChatGPT and Gemini.

Why is it important?

Large language models have become an integral part of our daily lives, finding applications in communication, healthcare, customer services, and education. However, when these systems generate misguided or discriminatory content, they can reinforce existing biases and perpetuate stereotypes. This not only undermines efforts toward equality and inclusivity but also risks normalizing such prejudices in societal discourse.

As individuals, we all desire to interact with reliable, unbiased, and stereotype-free artificial intelligence. Yet, language-based discriminatory outputs, when repeated, can erode users’ trust in these technologies. This loss of trust impacts not only the users but the organizations that develop and implement these systems.

In an era where artificial intelligence is rapidly expanding and being adopted globally, it is imperative for these systems to operate within ethical boundaries and embody an inclusive, equitable structure. Achieving this ensures the societal value of technology while fostering a fairer and more inclusive digital future.

Reference

Corbo, S., Bancale, L., De Gennaro, V., Lestingi, L., Scotti, V., & Camilli, M. (2025). How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models. arXiv preprint arXiv:2501.01741.