PCI Registered Reports

HUSSAIN Zak

Faculty of Psychology, University of Basel, Basel, Switzerland
Computer science, Social sciences

Recommendations: 0

Review: 1

Website https://zak-hussain.github.io/

Areas of expertise

Cognitive science, language models, natural language processing

Review: 1

Yesterday

STAGE 2
(Go to stage 1)

Language models accurately infer correlations between psychological items and scales from text alone

Björn E. Hommel, Ruben C. Arslan https://osf.io/preprints/psyarxiv/kjuce_v4

Using large language models to predict relationships among survey scales and items

Recommended by Matti Vuorre based on reviews by Hu Chuan-Peng, Johannes Breuer and Zak Hussain

How are the thousands of existing, and yet to be created, psychological measurement instruments related, and how reliable are they? Here, Hommel and Arslan (2025) trained a language model--SurveyBot3000--to provide answers to these questions efficiently and without human intervention.

In their Stage 1 submission, the authors described the training and pilot validation of a statistical model whose inputs are psychological measurement items or scales, and outputs are the interrelationships between the items and scales, and their reliabilities. The pilot results were promising: SurveyBot3000's predicted inter-scale correlations were strongly associated with empirical correlations from existing human data.

In this Stage 2 report, the authors further examined the model's performance and validity. In accordance with their Stage 1 plans, they collected new data from 450 demographically diverse participants, and tested the model's performance fully out of sample. The model's item-to-item correlations correlated at r=.59 with corresponding item-to-item correlations from human participants. The scale-to-scale correlations were even more accurate at r=.83, indicating reasonable performance. Nevertheless, the authors remain justifiably cautious in their recommendation that the "synthetic estimates can guide an investigation but need to be followed up by human researchers with human data."

The authors documented all deviations between Stage 2 execution and Stage 1 plans in extensive online supplements. These supplements also addressed other potential issues, such as the potential for data leakage (finding the results in the training data) and robustness of results across different exclusion criteria.

The authors' proposed psychometric approach and tool, which is freely available as an online app, could prove valuable for researchers either looking to use or adapt existing scales or items, or when developing new scales or items. More generally, these results add to the growing literature on human-AI research collaboration and highlight a practical application of these tools that remain novel to many researchers in the field. As such, this Stage 2 report and SurveyBot3000 promise to contribute positively to the field.

The Stage 2 report was evaluated by two reviewers who also reviewed the Stage 1 report, and a new expert in the field. On aggregate, the reviewers' comments were helpful but relatively minor; the authors improved their work in a resubmission, and the recommender judged accordingly that the manuscript met the Stage 2 criteria for recommendation.

URL to the preregistered Stage 1 protocol: https://osf.io/2c8hf

Level of bias control achieved: Level 6. No part of the data or evidence that was used to answer the research question was generated until after IPA.

List of eligible PCI RR-friendly journals:

References

Hommel, B. E. & Arslan, R. C. (2025). Language models accurately infer correlations between psychological items and scales from text alone [Stage 2]. Acceptance of Version 4 by Peer Community in Registered Reports. https://doi.org/10.31234/osf.io/kjuce_v4

HUSSAIN Zak

Faculty of Psychology, University of Basel, Basel, Switzerland
Computer science, Social sciences

Recommendations: 0

Review: 1

Website https://zak-hussain.github.io/

Areas of expertise

Cognitive science, language models, natural language processing