Using large language models to predict relationships among survey scales and items from text

Matti Vuorre

Recommendation

Share Post

Printable page

Using large language models to predict relationships among survey scales and items from text

Matti Vuorre based on reviews by Hu Chuan-Peng, Johannes Breuer and 1 anonymous reviewer

A recommendation of:

STAGE 1

Language models accurately infer correlations between psychological items and scales from text alone

Björn E. Hommel, Ruben C. Arslan https://osf.io/preprints/psyarxiv/kjuce version 4

Read report on server

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Submission: posted 22 April 2024
Recommendation: posted 15 October 2024, validated 16 October 2024

Cite this recommendation as:
Vuorre, M. (2024) Using large language models to predict relationships among survey scales and items from text. Peer Community in Registered Reports, . https://rr.peercommunityin.org/articles/rec?id=762

Related stage 2 preprints:

Language models accurately infer correlations between psychological items and scales from text alone
Björn E. Hommel, Ruben C. Arslan
https://osf.io/preprints/psyarxiv/kjuce_v4

Recommendation

How are the thousands of existing, and yet to be created, psychological measurement instruments related, and how reliable are they? Hommel and Arslan (2024) have trained a language model--SurveyBot3000--to provide answers to these questions efficiently and without human intervention.

In their Stage 1 submission, the authors describe the training and pilot validation of a statistical model whose inputs are psychological measurement items or scales, and outputs are the interrelationships between the items, scales, and their reliabilities. The pilot results are promising: SurveyBot3000's predicted inter-scale correlations were extremely strongly associated with empirical correlations from existing human data.

The authors now plan for a further examination their model's performance and validity. They will collect novel test data across a large number of subjects, and again test the model's performance fully out of sample. Reviewers deemed these plans, and their associated planned analyses suitable. The anticipated results--along with already existing pilot results--promise a very useful methodological innovation to aid researchers in both selecting and evaluating existing measures, and developing and testing new measures.

The Stage 1 submission was reviewed twice by three reviewers each with expertise in the area. All reviewers identified the initial submission as timely and important, and suggested mostly editorial improvements that could be made to the Stage 1 report. After two rounds of review, the relatively minor remaining suggestions can be taken into account during preparation of the Stage 2 report.

URL to the preregistered Stage 1 protocol: https://osf.io/2c8hf

Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.

List of eligible PCI RR-friendly journals:

References

Hommel, B. E., & Arslan, R. C. (2024). Language models accurately infer correlations between psychological items and scales from text alone. In principle acceptance of Version 4 by Peer Community in Registered Reports. https://osf.io/2c8hf

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviews

Toggle reviews

Reviewed by anonymous reviewer 1, 07 Oct 2024

Thank you for the opportunity to read through the comments from other reviewers and the revisions. The revisions are well done (thorough and thoughtful), and I have no further suggestions for the authors at this time.

Still, I will take the opportunity to repeat my earlier comment that I am quite excited to see this registered report move forward -- I think it will make a *major* contribution to the psychological literature in several respects/sub-disciplines!

Reviewed by Johannes Breuer, 25 Sep 2024

As before, I very much enjoyed reading the manuscript and am confident that the study as well as the paper will make a meaningful and valuable contribution to the field.

The authors have done a very good job in addressing my comments from the previous round of reviews and, I believe, also those from the other reviewers (but they are certainly better able to judge this than me).

I can say that I look forward to reviewing the Stage 2 manuscript. I only have a few remaining/further minor remarks that can be taken into account for the Stage 2 manuscript (so no need for another Stage 1 revision from my side).

Regarding my previous comment about (additional) data quality checks and the use of LLMs by crowdworkers and the authors’ response to this: There is quite some work on bot detection for survey studies (see, e.g., Xu et al., 2022; Yarrish et al., 2019; Zhang et al., 2022). Additionally, two recent preprints have looked at the use of generative AI a) “in crowdwork for web and social media research” (Christoforou et al., 2024) and b) survey participants for open-ended questions (Zhang et al., 2024). A study that I recently saw presented at a conference also investigated the detection of “AI-based bots in web surveys” (see https://www.destatis.de/DE/Ueber-uns/Kolloquien-Tagungen/Veranstaltungen/15-wissenschaftliche-tagung/Downloads/B2_06_hoehne.pdf?__blob=publicationFile). While I agree with the authors that the use of LLMs by survey respondents may not (yet?) be widespread, I believe that this is an important issue and potential limitation to be addressed in the Stage 2 report.

Also, as a follow-up to my previous comment regarding data quality checks and (participant) exclusion criteria (which the authors have very thoroughly addressed), the authors might be interested to hear that there is a new R package that can be used to check for response patterns and calculate response time indicators: https://matroth.github.io/resquin/

In the response letter, the authors write that they have rephrased “a representative US sample” to “a quota sample balanced for age, sex and ethnicity according to the 2021 US Census using Prolific's inbuilt quota sampling”. However, in the revised version of the manuscript that I received, both the text in the “Sampling Plan” section as well as the Design Table still used the term “representative”. I also think it is important to note that this is a non-probability sample.

Literature cited in this review

Christoforou, E., Demartini, G., & Otterbacher, J. (2024). Generative AI in Crowdwork for Web and Social Media Research: A Survey of Workers at Three Platforms. Proceedings of the International AAAI Conference on Web and Social Media, 18, 2097–2103. https://doi.org/10.1609/icwsm.v18i1.31452

Xu, Y., Pace, S., Kim, J., Iachini, A., King, L. B., Harrison, T., DeHart, D., Levkoff, S. E., Browne, T. A., Lewis, A. A., Kunz, G. M., Reitmeier, M., Utter, R. K., & Simone, M. (2022). Threats to Online Surveys: Recognizing, Detecting, and Preventing Survey Bots. Social Work Research, 46(4), 343–350. https://doi.org/10.1093/swr/svac023

Yarrish, C., Groshon, L., Mitchell, J. D., Appelbaum, A., Klock, S., Winternitz, T., & Friedman-Wheeler, D. G. (2019). Finding the signal in the noise: Minimizing responses from bots and inattentive humans in online research. The Behavior Therapist, 42(7), 235–242.

Zhang, S., Xu, J., & Alvero, A. (2024). Generative AI Meets Open-Ended Survey Responses: Participant Use of AI and Homogenization. https://doi.org/10.31235/osf.io/4esdp

Zhang, Z., Zhu, S., Mink, J., Xiong, A., Song, L., & Wang, G. (2022). Beyond Bot Detection: Combating Fraudulent Online Survey Takers✱. Proceedings of the ACM Web Conference 2022, 699–709. https://doi.org/10.1145/3485447.3512230

Reviewed by Hu Chuan-Peng, 18 Sep 2024

The authors addressed my questions well. I am looking forward to seeing the final results of this interesting project!

Evaluation round #1

DOI or URL of the report: https://osf.io/preprints/psyarxiv/kjuce

Version of the report: 2

Author's Reply, 05 Sep 2024

Download author's reply Download tracked changes file

Decision by Matti Vuorre, posted 15 Jun 2024, validated 17 Jun 2024

Dear Dr Hommel & Dr Arslan,

Thank you for submitting your Stage 1 Registered Report "Language models accurately infer correlations between psychological items and scales from text alone" for evaluation at PCI: RR. Thank you also to the three reviewers who submitted their evaluations. I concur with the reviewers that the manuscript is interesting and you have already done exciting work with the pilot study; yet there are aspects of the proposed study that need additional work. I am therefore inviting you to submit a revised manuscript with responses to the reviewer's comments.

I would suggest you pay particular attention to ensuring data quality, and further clarify and justify the US-based representative sampling plan (as suggested by JB). An anonymous reviewer also suggested sticking closer to a (psych) traditional intro, study 1-methods-findings,... structure to guide readers. Finally if hypotheses must be made, HC-P suggests that they should be further clarified.

Thank you, and looking forward to the revision
Matti Vuorre

Reviewed by anonymous reviewer 1, 13 Jun 2024

Download the review

Reviewed by Johannes Breuer, 24 May 2024

First of all, I want to say that I sincerely enjoyed reading the manuscript. It covers a highly relevant and timely topic and I am quite certain that the study can make a very valuable contribution to the field. The manuscript is also written in an engaging and accessible manner, which is especially relevant for a fairly technical and sophisticated methodological study such as this one. I also very much appreciate the provision of the ample and helpful supplementary materials (code & data on OSF & GitHub; statistical reports and interactive plots via website + app on huggingface). As a sidenote on this: I tested the app on huggingface with a scale from my area of research and it worked quite well.

That being said, I have a few (mostly minor) questions and suggestions that the authors might want to consider a) for the preregistration and the design of the study or b) for writing a full paper later on. I have divided my remarks accordingly in the following.

a) Preregistration and study design

The authors might want to consider additional data quality checks (besides attention check items and completion time) as recent research has shown that LLM use is fairly common among crowdworkers (Veselovsky et al., 2023). This could, e.g., be checks for patterns in the responses or contradictory responses to items that should be (positively) related (or, vice versa, same responses to items that should not be related).

The "Measures" section states that the study will also include scales “from other social sciences”. While it makes sense to assess generalizability beyond psychology, maybe it may be better to work with smaller steps and focus on psychology first, and then separately (and maybe also more systematically) assess generalizability to other disciplines/areas within the social and behavioral sciences? It may also make a difference whether a scale assesses (personality) attributes or attitudes. On a related note, the authors may want to also consider the domain of the scales in the analyses (e.g., to compare prediction accuracies for different domains).

In the “Sampling Plan” section, the authors report that they plan to “collect a representative US sample of n = 450” (note: in Table 1, row 1, column 3, it says N = 400). Regardless of my personal issue with the term and concept of representativeness, it would be helpful to indicate what “representative” refers to here (e.g., distribution of age, gender, etc. reflecting the respective distributions in the 2022 US Census data. Also, I assume that this is a non-probability sample, which is a detail that should be reported explicitly, I would say.

Is the exclusion of participants “who fail at least three out of five attention checks” based on recommendations from the literature or previous research by the authors? Relatedly: What is the threshold of a minimum survey completion time of 11 minutes based on?

Regarding the hypotheses as presented in Table 1: Are the hypotheses really meant to test exact point estimates (e.g., an accuracy of exactly r = .71 for inter-item correlations as in the 1st hypothesis)? This would then mean that finding an accuracy of .7 would not confirm this hypothesis. Hence, I was wondering whether it may be more feasible to test for a certain range, such as the 95%-CIs reported in the "Introduction" section.

b) Full paper

The authors discuss several application possibilities for their work. Another application area where this work can be useful are simulation studies (e.g., also for a-priori power analyses) for which the correlation and reliability estimates produced by the model could be used.

Could expert (or simply respondent) ratings or prediction markets servr as a further benchmark for assessing the model estimates?

Sampling error is mentioned and addressed in various places. As the focus of the manuscript and study is on survey data, it might make sense to refer and relate the work to the Total Survey Error (TSE) framework (see, e.g., Biemer, 2010).

For a full paper, I would also suggest discussing how this work relates to broader discussions of how AI can influence and maybe also improve research in psychology and the social and behavioral sciences more generally (see, e.g., Bail, 2024; Demszky et al., 2023). In that, it would also be important to take into account and refer to more critical recent takes on the use of AI in science (e.g., Messeri & Crockett, 2024).

Finally, on a somewhat more political note: Can the use case of this study be (yet) a(nother) argument for not making psychological scales proprietary (esp. if their development is paid for by public funding)? The argument I see here is that we need open (source) models as well as open training data, and, in this case, open training data means open access scales.

Literature cited in this review

Bail, C. A. (2024). Can Generative AI improve social science? Proceedings of the National Academy of Sciences, 121(21). https://doi.org/10.1073/pnas.2314021121

Biemer, P. P. (2010). Total Survey Error: Design, Implementation, and Evaluation. Public Opinion Quarterly, 74(5), 817–848. https://doi.org/10.1093/poq/nfq058

Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., Jones, M., Krettek-Cobb, D., Lai, L., JonesMitchell, N., Ong, D. C., Dweck, C. S., Gross, J. J., & Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology. https://doi.org/10.1038/s44159-023-00241-5

Messeri, L., & Crockett, M. J. (2024). Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002), 49–58. https://doi.org/10.1038/s41586-024-07146-0

Veselovsky, V., Ribeiro, M. H., Cozzolino, P., Gordon, A., Rothschild, D., & West, R. (2023). Prevalence and prevention of large language model use in crowd work. https://doi.org/10.48550/ARXIV.2310.15683

Reviewed by Hu Chuan-Peng, 19 May 2024

I read this manuscript with great interest.

The goal of this RR is to test the generalizability of the predictive power of a language model, "surveybot3000", by collecting a new dataset. The "hard" part of this study has finished and been tested with the pilot study. The sampling plan is clear and the strategy is straightforward.

The only critical issue is the ambiguity in the hypotheses testing. In the design table, the authors described the following situations: the accuracy "matches or exceeds that found in the pilot study", "deteriorates but is still substantial", "halved", and "reduced below the accuracy of the pre-trained model", the words here are ambiguous. Using quantitative criteria will be more helpful. For example, employing an equivalence test to compare the correlations from the confirmative study to that of the pilot study.

There are also a few minor issues:
(1) One of the osf links is invalid (https://osf.io/xbp8v), please re-check it.
(2) The logic flow of the method section would be clearer if there was a roadmap for the whole study (model training → pilot study → confirmative study). This roadmap may help readers chunk the first few technical parts (pre-training and domain adaptation) together and focus more on the confirmative study, which is main body of the current RR.
(3) Constrain the statement about "generalizability" in the future. Given the current training is largely based on English data, the utility of the model for other languages/samples needs further validation.