Haase, Jennifer and Hanel, Paul HP and Polkutta, Sebastian (2025) Has the Creativity of Large-Language Models peaked? Journal of Creativity. p. 100113. DOI https://doi.org/10.1016/j.yjoc.2025.100113 (In Press)
Haase, Jennifer and Hanel, Paul HP and Polkutta, Sebastian (2025) Has the Creativity of Large-Language Models peaked? Journal of Creativity. p. 100113. DOI https://doi.org/10.1016/j.yjoc.2025.100113 (In Press)
Haase, Jennifer and Hanel, Paul HP and Polkutta, Sebastian (2025) Has the Creativity of Large-Language Models peaked? Journal of Creativity. p. 100113. DOI https://doi.org/10.1016/j.yjoc.2025.100113 (In Press)
Abstract
Numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear if LLMs have become more creative over time and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs—including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek—across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). We found no evidence of increased creative performance over the past 18–24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.
| Item Type: | Article |
|---|---|
| Uncontrolled Keywords: | Generative AI, benchmark testing, creativity, Large Language Models, LLMs |
| Divisions: | Faculty of Science and Health Faculty of Science and Health > Psychology, Department of |
| SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
| Depositing User: | Unnamed user with email elements@essex.ac.uk |
| Date Deposited: | 07 Nov 2025 14:17 |
| Last Modified: | 10 Nov 2025 12:27 |
| URI: | http://repository.essex.ac.uk/id/eprint/41909 |
Available files
Filename: C_of_LLMs___Version2_Rev_2.docx
Embargo Date: 1 January 2100