Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models

Autor/inn/en	Organisciak, Peter; Acar, Selcuk; Dumas, Denis; Berthiaume, Kelly
Titel	Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models
Quelle	(2023), (32 Seiten) PDF als Volltext Verfügbarkeit
Zusatzinformation	ORCID (Organisciak, Peter) ORCID (Acar, Selcuk) ORCID (Dumas, Denis) ORCID (Berthiaume, Kelly) Weitere Informationen
Sprache	englisch
Dokumenttyp	gedruckt; online; Monographie
Schlagwörter	Automation; Computer Assisted Testing; Scoring; Creative Thinking; Creativity Tests; Semantics; Prompting; Test Items; Alternative Assessment; Models; Natural Language Processing; Effect Size; Robustness (Statistics) + Suchen Sie Ihr Suchwort? Bewertung; Kreatives Denken; Creativity test; Kreativitätstest; Semantik; Benutzerführung; Test content; Testaufgabe; Analogiemodell; Natürliche Sprache; Widerstandsfähigkeit
Abstract	Automated scoring for divergent thinking (DT) seeks to overcome a key obstacle to creativity measurement: the effort, cost, and reliability of scoring open-ended tests. For a common test of DT, the Alternate Uses Task (AUT), the primary automated approach casts the problem as a semantic distance between a prompt and the resulting idea in a text model. This work presents an alternative approach that greatly surpasses the performance of the best existing semantic distance approaches. Our system, "Ocsai," fine-tunes deep neural network-based large-language models (LLMs) on human-judged responses. Trained and evaluated against one of the largest collections of human-judged AUT responses, with 27 thousand responses collected from nine past studies, our fine-tuned large-language-models achieved up to r = 0.81 correlation with human raters, greatly surpassing current systems (r = 0.12-0.26). Further, learning transfers well to new test items and the approach is still robust with small numbers of training labels. We also compare prompt-based zero-shot and few-shot approaches, using GPT-3, ChatGPT, and GPT-4. This work also suggests a limit to the underlying assumptions of the semantic distance model, showing that a purely semantic approach that uses the stronger language representation of LLMs, while still improving on existing systems, does not achieve comparable improvements to our fine-tuned system. The increase in performance can support stronger applications and interventions in DT and opens the space of automated DT scoring to new areas for improving and understanding this branch of methods. [This paper was published in "Thinking Skills and Creativity" v49 Article 101356 2023.] (As Provided).
Erfasst von	ERIC (Education Resources Information Center), Washington, DC
Update	2024/1/01