Does data from students and crowdsourced online platforms measure the same thing? Determining the external validity of combining data from these two types of subjects

ORCID_LOGO based on reviews by Benjamin Farrar and Shinichi Nakagawa
A recommendation of:

Convenience Samples and Measurement Equivalence in Replication Research


Submission: posted 29 November 2022
Recommendation: posted 21 March 2023, validated 21 March 2023
Cite this recommendation as:
Logan, C. (2023) Does data from students and crowdsourced online platforms measure the same thing? Determining the external validity of combining data from these two types of subjects. Peer Community in Registered Reports, .

Related stage 2 preprints:


Comparative research is how evidence is generated to support or refute broad hypotheses (e.g., Pagel 1999). However, the foundations of such research must be solid if one is to arrive at the correct conclusions. Determining the external validity (the generalizability across situations/individuals/populations) of the building blocks of comparative data sets allows one to place appropriate caveats around the robustness of their conclusions (Steckler & McLeroy 2008).
In this registered report, Alley and colleagues plan to tackle the external validity of comparative research that relies on subjects who are either university students or participating in experiments via an online platform (Alley et al. 2023). They will determine whether data from these two types of subjects have measurement equivalence - whether the same trait is measured in the same way across groups. Although they use data from studies involved in the Many Labs replication project to evaluate this question, their results will be of crucial importance to other comparative researchers whose data are generated from these two sources (students and online crowdsourcing). If Alley and colleagues show that these two types of subjects have measurement equivalence, then this indicates that it is more likely that equivalence could hold for other studies relying on these type of subjects as well. If measurement equivalence is not found, then it is a warning to others to evaluate their experimental design to improve validity. In either case, it gives researchers a way to test measurement equivalence for themselves because the code is well annotated and openly available for others to use.

The Stage 1 manuscript was evaluated over two rounds of in-depth review. Based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).

URL to the preregistered Stage 1 protocol:
Level of bias control achieved: Level 2. At least some data/evidence that will be used to answer the research question has been accessed and partially observed by the authors, but the authors certify that they have not yet observed the key variables within the data that will be used to answer the research question AND they have taken additional steps to maximise bias control and rigour (e.g. conservative statistical threshold; recruitment of a blinded analyst; robustness testing, multiverse/specification analysis, or other approach) 
List of eligible PCI RR-friendly journals:
Alley L. J., Axt, J., & Flake J. K. (2023). Convenience Samples and Measurement Equivalence in Replication Research, in principle acceptance of Version 4 by Peer Community in Registered Reports.
Steckler, A. & McLeroy, K. R. (2008). The importance of external validity. American Journal of Public Health 98, 9-10.
Pagel, M. (1999). Inferring the historical patterns of biological evolution. Nature, 401, 877-884.
Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Evaluation round #3

DOI or URL of the report:

Version of the report: v1

Author's Reply, 18 Mar 2023

Decision by ORCID_LOGO, posted 16 Mar 2023, validated 16 Mar 2023

Dear Lindsay, Jordan, and Jessica,

Thank you for your revision, which appropriately addressed the remaining comments. 

In drafting my recommendation text, I went back through your answers to the questions at the submission page and determined that your answer to question 7 was incorrect. This is a quantitative study that tests a hypothesis (see research questions 1 and 2 in your introduction), therefore it needs a study design table. Please make a study design table according to the author guidelines (, reference it in your main text, and include the table in your manuscript document either in the main text or as supplementary material. Note that research question 2 tests a hypothesis because it uses a significance test that tests for the existence of something, not just the amount of something. Research question 1 is an estimation problem, but would also benefit from being included in the study design table.

Also, please update your answer to question 7 in the report survey to choose the first option: "YES - THE RESEARCH INVOLVES AT LEAST SOME QUANTITATIVE HYPOTHESIS-TESTING AND THE REPORT INCLUDES A STUDY DESIGN TEMPLATE". If you need help with this, or would like me to change your answer for you, just let me know.

All my best,


Evaluation round #2

DOI or URL of the report:

Version of the report: v1

Author's Reply, 15 Mar 2023

Decision by ORCID_LOGO, posted 28 Feb 2023, validated 28 Feb 2023

Thank you for your comprehensive revision and for responding to all of the comments so completely. I sent the manuscript back to one of the reviewers, who is very positive about the revision. Please see their review, which contains the answer to your question about where to place the additional terms. I have one additional note...

You wrote: “On the advice of a colleague, we have switched from using Bartlett’s factor scores for the sensitivity analysis to regression factor scores”

Please justify why this change was needed and why the new analysis is better suited to your study.

Once these minor revisions are made and resubmitted, I will be ready to give IPA.

All my best,


Reviewed by , 28 Feb 2023

Evaluation round #1

DOI or URL of the report:

Version of the report: v1

Author's Reply, 17 Feb 2023

Decision by ORCID_LOGO, posted 29 Dec 2022, validated 29 Dec 2022

Dear Lindsay, Jordan, and Jessica,

Many thanks for submitting your registered report, which is on a topic that is very important for replication studies, as well as for other research that uses students and/or online crowdsourcing as their subjects. I have received two expert reviews and both are positive with comments to address. Please revise your registered report and address all comments, then submit your revision. I will then send it back to the reviewers.

I'm looking forward to your revision,


Reviewed by , 27 Dec 2022

In this Stage 1 Registered Report submission, Alley et al. propose to test for measurement non-equivalence between online and student samples across ManyLabs samples, including student samples, MTurk samples and a sample from Project Implicit. The analysis to decide which scales to include for equivalence testing has already been performed, and this is appropriately outlined in the level of bias control section of the RR. I was able to fully reproduce the inclusion analysis from the provided code.

Overall, I consider this to be a well thought-out and very well written submission that will be widely read, and it is great that the authors have chosen to submit this as a RR. The code annotation, in particular, was very clear - which is appreciated. The research question is valid and there is a strong rationale for testing measurement equivalence between convenience samples. While I am not an expert in testing measurement equivalence, the analysis pipeline was clear and, combined with the annotated code, there is sufficient methodological detail to closely replicate the proposed study procedures and to prevent undisclosed flexibility in the procedures and analyses. I could not find the individual analysis code mentioned in the paper (pg. 15 “Code for the following analyses can be found in the supplementary materials. A separate R file for each measure, and the files are named for the measure analyzed) but this may be a placeholder for when each individual analysis has been performed, and I consider the “Planned Analysis Code” to sufficiently outline the analysis pipeline for all measures, which I was able to fully reproduce.

Below I list minor comments and questions for the authors, but I am confident this is a scientifically valid, thorough and very valuable submission that I look forward to reading once it is complete.


Throughout the report, there was not much discussion of the power of the planned analyses to detect measurement inequivalence. This may be a moot point as the sample sizes in the current project are quite large, and the exact analyses to be conducted are contingent on the hierarchical analysis plan. However, it would be a good addition to have a power curve to make explicit the range of effect sizes that would be detected in a “typical” analysis, and their theoretical meaning – although to some extent this is already achieved in the sensitivity analysis, which is great.

Ordered minor comments

Page 6: This contains a very clear and accessible description of measurement equivalence. Please provide a more explicit formal definition also be provided after the first sentence, with any appropriate references.

Page 8: “It would also be ideal for replication researchers to test for ME between original and replication studies (Fabrigar & Wegener, 2016), but this is difficult in practice: for the most part, the original studies that are replicated do not have publicly available data.” Presumably the small sample size of many original studies would also mean it would be difficult to detect meaningful measurement invariance between samples.

Page 9: If the authors want a reference on between-platform differences in participant responses: Peer, E., Rothschild, D., Gordon, A., Evernden, Z., & Damer, E. (2022). Data quality of platforms and panels for online behavioral research. Behavior Research Methods, 54(4), 1643-1662.

Page 9/10: “While mean differences on a measure of a construct do not necessarily indicate non-equivalence, any of these differences in sample characteristics could potentially contribute to non-equivalent measurement.” Could this be unpacked? If there are substantial differences in some sample characteristics, is absolute equivalence plausible, or would you expect some non-equivalence but perhaps with very small effect sizes?

Page 11: RQ2, it could be made explicit here that this will be tested by looking at effects sizes as well as significance.

Page 12: “which is useful more broadly than just replication research” – good justification, if anything you are unselling the importance of these research questions for interpreting the results of any experiments!

Page 12: “which requires a series of preliminary analyses to determine if the data satisfy assumptions”. It would be useful to list all of the assumptions here.

Page 13: “Type I error rates for equivalence tests may be inflated when the baseline model is misspecified”. For accessibility, could an example of model misspecification increasing Type 1 error rates be given here, even if hypothetical?

Page 17: The authors may be more aware of this than I am and this may be a non-issue, but some concerns have been raised about using scaled χ2-difference tests to compare nested models (see

Results: The proposed data presentation looks appropriate, and thank-you to the authors for including these.

Finally, I don’t think this is necessary to address in the current project, but I thought about this several times while reading the manuscript: While the focus is on detecting measurement inequivalence between samples, there is some ambiguity about what the authors consider the sample to be – is it the experimental units (individual participants) only, the experimental units plus the setting (participants + online/location), are the experimenters included too? Replications vary across experimental units, treatments, measurements and settings (Machery, E. [2020]. What is a replication?. Philosophy of Science, 87(4), 545-567), and so it might be worth considering the degree to which measurement invariance could be introduced from other sources, e.g. different experimenters, that would not normally be associated with the sample, but nevertheless would differ between them.  

Best of luck with the project!

Benjamin Farrar

Reviewed by , 29 Dec 2022

User comments

No user comments yet