Does data from students and crowdsourced online platforms measure the same thing? Determining the external validity of combining data from these two types of subjects
Convenience Samples and Measurement Equivalence in Replication Research
Abstract
Recommendation: posted 21 March 2023, validated 21 March 2023
Logan, C. (2023) Does data from students and crowdsourced online platforms measure the same thing? Determining the external validity of combining data from these two types of subjects. Peer Community in Registered Reports, . https://rr.peercommunityin.org/PCIRegisteredReports/articles/rec?id=349
Related stage 2 preprints:
Lindsay J. Alley, Jordan Axt, Jessica Kay Flake
https://osf.io/s5t3v
Recommendation
The Stage 1 manuscript was evaluated over two rounds of in-depth review. Based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #3
DOI or URL of the report: https://osf.io/32unb
Version of the report: v1
Author's Reply, 18 Mar 2023
Decision by Corina Logan, posted 16 Mar 2023, validated 16 Mar 2023
Dear Lindsay, Jordan, and Jessica,
Thank you for your revision, which appropriately addressed the remaining comments.
In drafting my recommendation text, I went back through your answers to the questions at the submission page and determined that your answer to question 7 was incorrect. This is a quantitative study that tests a hypothesis (see research questions 1 and 2 in your introduction), therefore it needs a study design table. Please make a study design table according to the author guidelines (https://rr.peercommunityin.org/help/guide_for_authors#h_27513965735331613309625021), reference it in your main text, and include the table in your manuscript document either in the main text or as supplementary material. Note that research question 2 tests a hypothesis because it uses a significance test that tests for the existence of something, not just the amount of something. Research question 1 is an estimation problem, but would also benefit from being included in the study design table.
Also, please update your answer to question 7 in the report survey to choose the first option: "YES - THE RESEARCH INVOLVES AT LEAST SOME QUANTITATIVE HYPOTHESIS-TESTING AND THE REPORT INCLUDES A STUDY DESIGN TEMPLATE". If you need help with this, or would like me to change your answer for you, just let me know.
All my best,
Corina
Evaluation round #2
DOI or URL of the report: https://osf.io/32unb
Version of the report: v1
Author's Reply, 15 Mar 2023
Decision by Corina Logan, posted 28 Feb 2023, validated 28 Feb 2023
Thank you for your comprehensive revision and for responding to all of the comments so completely. I sent the manuscript back to one of the reviewers, who is very positive about the revision. Please see their review, which contains the answer to your question about where to place the additional terms. I have one additional note...
You wrote: “On the advice of a colleague, we have switched from using Bartlett’s factor scores for the sensitivity analysis to regression factor scores”
Please justify why this change was needed and why the new analysis is better suited to your study.
Once these minor revisions are made and resubmitted, I will be ready to give IPA.
All my best,
Corina
Reviewed by Shinichi Nakagawa, 28 Feb 2023
Evaluation round #1
DOI or URL of the report: https://osf.io/32unb
Version of the report: v1
Author's Reply, 17 Feb 2023
Decision by Corina Logan, posted 29 Dec 2022, validated 29 Dec 2022
Dear Lindsay, Jordan, and Jessica,
Many thanks for submitting your registered report, which is on a topic that is very important for replication studies, as well as for other research that uses students and/or online crowdsourcing as their subjects. I have received two expert reviews and both are positive with comments to address. Please revise your registered report and address all comments, then submit your revision. I will then send it back to the reviewers.
I'm looking forward to your revision,
Corina
Reviewed by Benjamin Farrar, 27 Dec 2022
In this Stage 1 Registered Report submission, Alley et al. propose to test for measurement non-equivalence between online and student samples across ManyLabs samples, including student samples, MTurk samples and a sample from Project Implicit. The analysis to decide which scales to include for equivalence testing has already been performed, and this is appropriately outlined in the level of bias control section of the RR. I was able to fully reproduce the inclusion analysis from the provided code.
Overall, I consider this to be a well thought-out and very well written submission that will be widely read, and it is great that the authors have chosen to submit this as a RR. The code annotation, in particular, was very clear - which is appreciated. The research question is valid and there is a strong rationale for testing measurement equivalence between convenience samples. While I am not an expert in testing measurement equivalence, the analysis pipeline was clear and, combined with the annotated code, there is sufficient methodological detail to closely replicate the proposed study procedures and to prevent undisclosed flexibility in the procedures and analyses. I could not find the individual analysis code mentioned in the paper (pg. 15 “Code for the following analyses can be found in the supplementary materials. A separate R file for each measure, and the files are named for the measure analyzed) but this may be a placeholder for when each individual analysis has been performed, and I consider the “Planned Analysis Code” to sufficiently outline the analysis pipeline for all measures, which I was able to fully reproduce.
Below I list minor comments and questions for the authors, but I am confident this is a scientifically valid, thorough and very valuable submission that I look forward to reading once it is complete.
General:
Throughout the report, there was not much discussion of the power of the planned analyses to detect measurement inequivalence. This may be a moot point as the sample sizes in the current project are quite large, and the exact analyses to be conducted are contingent on the hierarchical analysis plan. However, it would be a good addition to have a power curve to make explicit the range of effect sizes that would be detected in a “typical” analysis, and their theoretical meaning – although to some extent this is already achieved in the sensitivity analysis, which is great.
Ordered minor comments
Page 6: This contains a very clear and accessible description of measurement equivalence. Please provide a more explicit formal definition also be provided after the first sentence, with any appropriate references.
Page 8: “It would also be ideal for replication researchers to test for ME between original and replication studies (Fabrigar & Wegener, 2016), but this is difficult in practice: for the most part, the original studies that are replicated do not have publicly available data.” Presumably the small sample size of many original studies would also mean it would be difficult to detect meaningful measurement invariance between samples.
Page 9: If the authors want a reference on between-platform differences in participant responses: Peer, E., Rothschild, D., Gordon, A., Evernden, Z., & Damer, E. (2022). Data quality of platforms and panels for online behavioral research. Behavior Research Methods, 54(4), 1643-1662.
Page 9/10: “While mean differences on a measure of a construct do not necessarily indicate non-equivalence, any of these differences in sample characteristics could potentially contribute to non-equivalent measurement.” Could this be unpacked? If there are substantial differences in some sample characteristics, is absolute equivalence plausible, or would you expect some non-equivalence but perhaps with very small effect sizes?
Page 11: RQ2, it could be made explicit here that this will be tested by looking at effects sizes as well as significance.
Page 12: “which is useful more broadly than just replication research” – good justification, if anything you are unselling the importance of these research questions for interpreting the results of any experiments!
Page 12: “which requires a series of preliminary analyses to determine if the data satisfy assumptions”. It would be useful to list all of the assumptions here.
Page 13: “Type I error rates for equivalence tests may be inflated when the baseline model is misspecified”. For accessibility, could an example of model misspecification increasing Type 1 error rates be given here, even if hypothetical?
Page 17: The authors may be more aware of this than I am and this may be a non-issue, but some concerns have been raised about using scaled χ2-difference tests to compare nested models (see http://www.statmodel.com/chidiff.shtml)
Results: The proposed data presentation looks appropriate, and thank-you to the authors for including these.
Finally, I don’t think this is necessary to address in the current project, but I thought about this several times while reading the manuscript: While the focus is on detecting measurement inequivalence between samples, there is some ambiguity about what the authors consider the sample to be – is it the experimental units (individual participants) only, the experimental units plus the setting (participants + online/location), are the experimenters included too? Replications vary across experimental units, treatments, measurements and settings (Machery, E. [2020]. What is a replication?. Philosophy of Science, 87(4), 545-567), and so it might be worth considering the degree to which measurement invariance could be introduced from other sources, e.g. different experimenters, that would not normally be associated with the sample, but nevertheless would differ between them.
Best of luck with the project!
Benjamin Farrar