Unsafe Synthetic Image Generation

Safeguarding Against the Dark Potential of Text-to-Image Generative AI Models

More Info
expand_more

Abstract

In recent years, the field of artificial intelligence (AI) has witnessed rapid advancements, particularly in the domain of text-to-image generative AI (T2I GenAI) models. These models, including Stable Diffusion and DALL-E, have demonstrated remarkable capabilities, enabling the creation of lifelike images from textual descriptions. Their applications span across various fields such as graphic design and personalized content generation. However, alongside these advantages, there are significant downsides. These include the generation of unsafe content, such as sexually explicit and violent images, which can be exploited for disinformation or harassment. Furthermore, such graphic content can mentally harm (young) individuals. Some research exists that examines filter performance. However, a gap exists in understanding how these filters’ effectiveness varies across different languages. Analyzing filter effectiveness across languages may reveal weak spots that can be utilized to generate unsafe content. This research therefore aids in the ethical use of T2IGenAI models. To address the aforementioned knowledge gap, the main research question of this study is: ”How do T2IGenAI models’ safety filters perform in blocking unsafe content from English and Dutch prompts?” The main research question is unravelled into several sub-questions. Addressing these offers a comprehensive answer to the main research question. The sub-questions are formulated as follows: 1. How do blocking rates vary between English and Dutch across different domains and prompt types across the T2IGenAI models? 2. How accurately do the images generated by T2IGenAI models reflect the content of the input prompts in English and Dutch across various domains and prompt types? 3. How do unsafe scores differ between English and Dutch for various domains and prompt types in the T2IGenAI models? 4. How do the T2IGenAI models compare in their blocking rates, image-text correspondence scores, and unsafe scores for both English and Dutch prompts? The research focuses on two particularly concerning domains: violence and sexually explicit content. Two T2IGenAI models were evaluated within this research, namely Stable Diffusion (V1-5) and DALLE (3). These models were selected based on their distinct filtering mechanisms, accessibility, and price. Two types of prompts are used, being ’direct’ and ’diluted’ prompts. First, 500 ’direct’ prompts are developed for both domains in English and Dutch. These prompts have very few details that are considered safe. An example is ’A soldier shoots another man through his head’. The diluted sentences are based on the direct prompts that are initially blocked. Contrary to their direct counterpart, diluted prompts have extra, safe, details added to the unsafe direct prompt. An example is ’Amid a landscape draped in the soft hues of dusk, a camouflaged soldier stands still in the dark mud while he shoots an innocent man through his head.’ Using this dilution tactic will provide additional information on filter circumvention strategies. The developed prompts are inserted automatically inserted into these models using the programming software Python. The T2IGenAI model filter performance is evaluated based on its blocking rate, text alignment (imagetext correspondence degree), and unsafe grading. The data from the first two quantitative metrics are 2 3 automatically generated and analyzed within the Python environment. The latter, being qualitative in nature, requires manual annotation of the developed images. These images are graded based on a novel grading scheme, thereby quantifying the annotation process. This research considers a ’weak spot’ when there is a significant difference in blocking rate across language, when this language has a similar or higher text alignment score, and when this language has more unsafe images. The outcomes from all three metrics are synthesized into the following results: • For most domains, it is concluded that although initial blocking rates may be significantly lower across languages, this does not mean that more unsafe content will be generated. • One weak spot is revealed within the sexually explicit domain, where images were generated using the Stable Diffusion model. Here, Dutch blocking rates for diluted prompts were significantly lower (37%) compared to English prompts (75%). The two-sample KS test proved that there was no statistically proven relationship between a lower alignment score for Dutch prompts. Moreover, fewer safe images were generated for Dutch prompts (88.89%) compared to the English diluted prompt (93.06%), albeit the difference is small. • DALL E outperforms Stable Diffusion in terms of blocking rates across all domains and prompt types. Stable Diffusion does not flag violent prompts at all, signified by the 0% blocking rate. For sexually explicit content, DALL E scores 95% (English) and 94% (Dutch), whereas Stable Diffusion only presents blocking rates of 58% and 11% for English and Dutch prompts, respectively. This indicates a less sophisticated initial filter mechanism. • Contrary to the blocking rates, text alignment scores are often lower for DALL E images. Safety scores are slightly better for DALL E within the sexually explicit domain. The violence domain often scores worse compared to Stable Diffusion. • DALL E’s content moderation system has seen improvements since it implemented a prompt rewriting policy. Non-English prompts are first translated into English. Moving forward, DALL E further rewrites the prompt, adding extra details. This is also performed for English prompts. Doing so, they decrease the probability of generating unsafe content, while improving the image aesthetics. Using ’Perspective API’ it was found that often the prompt’s unsafe characteristics indeed did decrease, with the exemption of Dutch violence prompts. This research further builds on existing research regarding T2IGenAI model filter performance. In contrast to previous research, our work makes significant contributions to the scientific landscape by conducting a cross-language analysis that compares the performance of safety filters across English and Dutch prompts, thereby examining potential linguistic variability in AI safety mechanisms. Additionally, the introduction of a novel image grading scheme improves previous research, since the image output is not only analyzed quantitatively but also qualitatively. Furthermore, this study provides critical insights into the impact of prompt rewriting on safety filter performance. While previous work by Hwang et al. (2024) explored DALL-E’s prompt rewriting policy, it only considered English prompts. Furthermore, no research existed that quantitatively evaluated whether the rewritten prompt is actually safer. Our research addresses this gap. The societal relevance of this research lies in its potential to enhance the safety and reliability of text-toimage generative AI models. By identifying, addressing, and informs readers about the weaknesses in current AI safety filters, particularly across different languages and domains, this study can contribute to the development of more robust safeguards against the generation of harmful content. This is crucial in a digital age where AI-generated images can significantly impact public perception and information dissemination. Improved safety mechanisms can help mitigate the risks associated with disinformation, and inappropriate content, thereby fostering a safer and more trustworthy online environment. Keywords: Text-to-Image Generative AI, AI Safety Filters, Stable Diffusion, DALL-E, Content Moderation, Prompt Dilution.