Recommendation

Does data from students and crowdsourced online platforms measure the same thing? Determining the external validity of combining data from these two types of subjects

Corina Logan based on reviews by Benjamin Farrar and Shinichi Nakagawa

A recommendation of:

STAGE 1

Convenience Samples and Measurement Equivalence in Replication Research

Lindsay J. Alley, Jordan Axt, Jessica Kay Flake https://osf.io/32unb version 4

Read report on server

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Convenience Samples and Measurement Equivalence in Replication Research

A great deal of research in psychology employs either university student or online crowdsourced convenience samples (Chandler & Shapiro, 2016; Strickland & Stoops, 2019) and there is evidence that these groups differ in meaningful ways (Behrend et al., 2011). This could result in the presence of unaccounted-for measurement differences across convenience sample sources, which may bias results when these groups are compared, or the resulting data are pooled. In this registered report, we used the openly available data from the Many Labs replication projects to test for measurement equivalence across different convenience sample sources. We examined 9 measures that showed acceptable baseline model fit and tested them for non-equivalence across convenience samples from different sources, including university participant pools, MTurk, and Project Implicit. We then examined whether replication results are robust to non-equivalence by fitting partial invariance models and sensitivity analyses of replication results. [Results and discussion summarized here.]

measurement, psychometrics, equivalence, invariance, metascience, replication

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

عينات الراحة ومعادلة القياس في أبحاث النسخ المتماثل

يستخدم قدر كبير من الأبحاث في علم النفس إما طلابًا جامعيين أو عينات ملائمة من مصادر جماعية عبر الإنترنت (Chandler & Shapiro, 2016; Strickland & Stoops, 2019) وهناك أدلة على أن هذه المجموعات تختلف بطرق ذات معنى (Behrend et al. ، 2011). وقد يؤدي ذلك إلى وجود اختلافات غير محسوبة في القياس عبر مصادر العينات الملائمة، مما قد يؤدي إلى تحيز النتائج عند مقارنة هذه المجموعات، أو عند تجميع البيانات الناتجة. في هذا التقرير المسجل، استخدمنا البيانات المتاحة بشكل مفتوح من مشاريع النسخ المتماثل Many Labs لاختبار تكافؤ القياس عبر مصادر عينات ملائمة مختلفة. قمنا بفحص 9 مقاييس أظهرت ملاءمة النموذج الأساسي المقبول واختبرناها للتأكد من عدم تكافؤها عبر عينات ملائمة من مصادر مختلفة، بما في ذلك مجموعات المشاركين في الجامعات، وMTurk، وProject Implicit. قمنا بعد ذلك بفحص ما إذا كانت نتائج النسخ المتماثل قوية لعدم التكافؤ من خلال تركيب نماذج الثبات الجزئي وتحليلات الحساسية لنتائج النسخ المتماثل. [تلخيص النتائج والمناقشة هنا.]

القياس، القياسات النفسية، التكافؤ، الثبات، ما وراء العلوم، التكرار

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Muestras de conveniencia y equivalencia de medidas en la investigación de replicación

Gran parte de la investigación en psicología emplea muestras de conveniencia de estudiantes universitarios o de crowdsourcing en línea (Chandler & Shapiro, 2016; Strickland & Stoops, 2019) y existe evidencia de que estos grupos difieren de manera significativa (Behrend et al. , 2011). Esto podría dar lugar a la presencia de diferencias de medición no contabilizadas entre fuentes de muestras de conveniencia, lo que puede sesgar los resultados cuando se comparan estos grupos o se combinan los datos resultantes. En este informe registrado, utilizamos los datos disponibles abiertamente de los proyectos de replicación de Many Labs para probar la equivalencia de mediciones entre diferentes fuentes de muestras de conveniencia. Examinamos 9 medidas que mostraron un ajuste aceptable del modelo de referencia y las probamos para detectar no equivalencia entre muestras de conveniencia de diferentes fuentes, incluidos grupos de participantes universitarios, MTurk y Project Implicit. Luego examinamos si los resultados de la replicación son robustos ante la no equivalencia mediante el ajuste de modelos de invarianza parcial y análisis de sensibilidad de los resultados de la replicación. [Los resultados y la discusión se resumen aquí.]

medición, psicometría, equivalencia, invariancia, metaciencia, replicación

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Échantillons de commodité et équivalence de mesures dans la recherche sur la réplication

De nombreuses recherches en psychologie utilisent soit des étudiants universitaires, soit des échantillons pratiques en ligne (Chandler & Shapiro, 2016 ; Strickland & Stoops, 2019) et il existe des preuves que ces groupes diffèrent de manière significative (Behrend et al. , 2011). Cela pourrait entraîner la présence de différences de mesure non prises en compte entre les sources d’échantillons de commodité, ce qui pourrait biaiser les résultats lorsque ces groupes sont comparés ou lorsque les données obtenues sont regroupées. Dans ce rapport enregistré, nous avons utilisé les données librement disponibles des projets de réplication Many Labs pour tester l'équivalence des mesures entre différentes sources d'échantillons de commodité. Nous avons examiné 9 mesures qui présentaient un ajustement acceptable du modèle de base et avons testé leur non-équivalence sur des échantillons de commodité provenant de différentes sources, notamment des pools de participants universitaires, MTurk et Project Implicit. Nous avons ensuite examiné si les résultats de la réplication étaient robustes à la non-équivalence en ajustant des modèles d'invariance partielle et des analyses de sensibilité des résultats de la réplication. [Résultats et discussion résumés ici.]

mesure, psychométrie, équivalence, invariance, métascience, réplication

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

प्रतिकृति अनुसंधान में सुविधा नमूने और माप तुल्यता

मनोविज्ञान में शोध का एक बड़ा हिस्सा या तो विश्वविद्यालय के छात्र या ऑनलाइन क्राउडसोर्स सुविधा नमूनों को नियोजित करता है (चांडलर और शापिरो, 2016; स्ट्रिकलैंड और स्टूप्स, 2019) और इस बात के सबूत हैं कि ये समूह सार्थक तरीकों से भिन्न हैं (बेहरेंड एट अल। , 2011). इसके परिणामस्वरूप सुविधाजनक नमूना स्रोतों में बेहिसाब माप अंतर की उपस्थिति हो सकती है, जो इन समूहों की तुलना करने पर, या परिणामी डेटा को पूल करने पर पूर्वाग्रहपूर्ण परिणाम दे सकता है। इस पंजीकृत रिपोर्ट में, हमने विभिन्न सुविधाजनक नमूना स्रोतों में माप तुल्यता के परीक्षण के लिए कई लैब प्रतिकृति परियोजनाओं से खुले तौर पर उपलब्ध डेटा का उपयोग किया। हमने 9 उपायों की जांच की, जिन्होंने स्वीकार्य बेसलाइन मॉडल को फिट दिखाया और विश्वविद्यालय प्रतिभागी पूल, एमटर्क और प्रोजेक्ट इंप्लिसिट सहित विभिन्न स्रोतों से सुविधा नमूनों में गैर-समतुल्यता के लिए उनका परीक्षण किया। फिर हमने जांच की कि क्या प्रतिकृति परिणाम आंशिक अपरिवर्तनीय मॉडल और प्रतिकृति परिणामों के संवेदनशीलता विश्लेषण को फिट करके गैर-समतुल्यता के लिए मजबूत हैं। [परिणाम और चर्चा का सारांश यहां दिया गया है।]

माप, साइकोमेट्रिक्स, तुल्यता, अपरिवर्तनीयता, मेटासाइंस, प्रतिकृति

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

複製研究におけるコンビニエンスサンプルと測定同等性

心理学の多くの研究では、大学生またはオンラインのクラウドソーシングによる便利なサンプルが使用されており (Chandler & Shapiro, 2016; Strickland & Stoops, 2019)、これらのグループが意味のある方法で異なるという証拠があります (Behrend et al. 、2011）。これにより、コンビニエンスサンプルソース間で説明されていない測定差が存在する可能性があり、これらのグループを比較したり、結果のデータをプールしたりするときに結果に偏りが生じる可能性があります。この登録済みレポートでは、Many Labs の複製プロジェクトから公開されているデータを使用して、さまざまなコンビニエンスサンプルソース間での測定の同等性をテストしました。私たちは、許容可能なベースラインモデルの適合性を示した 9 つの測定値を検討し、大学の参加者プール、MTurk、Project Implicit など、さまざまなソースからのコンビニエンスサンプル間で不等価性をテストしました。次に、部分不変モデルのフィッティングと複製結果の感度分析によって、複製結果が非等価性に対して頑健であるかどうかを調べました。 [結果と議論はここにまとめられています。]

測定、心理測定、等価性、不変性、メタサイエンス、複製

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Amostras de conveniência e equivalência de medição em pesquisa de replicação

Muitas pesquisas em psicologia empregam amostras de conveniência de estudantes universitários ou de crowdsourcing on-line (Chandler & Shapiro, 2016; Strickland & Stoops, 2019) e há evidências de que esses grupos diferem de maneiras significativas (Behrend et al. , 2011). Isto pode resultar na presença de diferenças de medição não contabilizadas entre fontes de amostras de conveniência, o que pode distorcer os resultados quando estes grupos são comparados ou quando os dados resultantes são agrupados. Neste relatório registrado, usamos os dados disponíveis abertamente dos projetos de replicação do Many Labs para testar a equivalência de medição em diferentes fontes de amostras de conveniência. Examinamos nove medidas que mostraram ajuste aceitável do modelo de linha de base e as testamos quanto à não equivalência em amostras de conveniência de diferentes fontes, incluindo grupos de participantes universitários, MTurk e Project Implicit. Em seguida, examinamos se os resultados da replicação são robustos à não equivalência, ajustando modelos de invariância parcial e análises de sensibilidade dos resultados da replicação. [Resultados e discussão resumidos aqui.]

medição, psicometria, equivalência, invariância, metaciência, replicação

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Удобные образцы и эквивалентность измерений в тиражных исследованиях

Во многих исследованиях в области психологии используются либо студенты университетов, либо удобные краудсорсинговые онлайн-образцы (Chandler & Shapiro, 2016; Strickland & Stoops, 2019), и есть свидетельства того, что эти группы существенно различаются (Behrend et al. , 2011). Это может привести к наличию неучтенных различий в измерениях между удобными источниками выборок, что может привести к искажению результатов при сравнении этих групп или объединении полученных данных. В этом зарегистрированном отчете мы использовали общедоступные данные из проектов репликации Many Labs для проверки эквивалентности измерений в различных источниках удобных образцов. Мы рассмотрели 9 показателей, которые показали приемлемое соответствие базовой модели, и протестировали их на неэквивалентность на удобных выборках из разных источников, включая пулы участников университетов, MTurk и Project Implicit. Затем мы проверили, устойчивы ли результаты репликации к неэквивалентности, путем подбора моделей частичной инвариантности и анализа чувствительности результатов репликации. [Результаты и обсуждение подведены здесь.]

измерение, психометрия, эквивалентность, инвариантность, метанаука, репликация

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

重复研究中的方便样本和测量等效性

大量心理学研究采用大学生或在线众包便利样本（Chandler & Shapiro，2016；Strickland & Stoops，2019），并且有证据表明这些群体在有意义的方面存在差异（Behrend 等人，2019）。，2011）。这可能会导致方便样本来源之间存在未说明的测量差异，这可能会在比较这些组或汇总结果数据时产生偏差。在这份注册报告中，我们使用来自 Many Labs 复制项目的公开数据来测试不同便利样本来源之间的测量等效性。我们检查了 9 项指标，这些指标显示出可接受的基线模型拟合度，并测试了来自不同来源（包括大学参与者池、MTurk 和 Project Implicit）的便利样本之间的非等价性。然后，我们通过拟合部分不变性模型和复制结果的敏感性分析来检查复制结果是否对非等价具有稳健性。 [结果和讨论总结如下。]

测量、心理测量学、等价性、不变性、元科学、复制

Submission: posted 29 November 2022
Recommendation: posted 21 March 2023, validated 21 March 2023

Cite this recommendation as:
Logan, C. (2023) Does data from students and crowdsourced online platforms measure the same thing? Determining the external validity of combining data from these two types of subjects. Peer Community in Registered Reports, . https://rr.peercommunityin.org/articles/rec?id=349

Related stage 2 preprints:

Convenience Samples and Measurement Equivalence in Replication Research
Lindsay J. Alley, Jordan Axt, Jessica Kay Flake
https://osf.io/s5t3v

Recommendation

Comparative research is how evidence is generated to support or refute broad hypotheses (e.g., Pagel 1999). However, the foundations of such research must be solid if one is to arrive at the correct conclusions. Determining the external validity (the generalizability across situations/individuals/populations) of the building blocks of comparative data sets allows one to place appropriate caveats around the robustness of their conclusions (Steckler & McLeroy 2008).

In this registered report, Alley and colleagues plan to tackle the external validity of comparative research that relies on subjects who are either university students or participating in experiments via an online platform (Alley et al. 2023). They will determine whether data from these two types of subjects have measurement equivalence - whether the same trait is measured in the same way across groups. Although they use data from studies involved in the Many Labs replication project to evaluate this question, their results will be of crucial importance to other comparative researchers whose data are generated from these two sources (students and online crowdsourcing). If Alley and colleagues show that these two types of subjects have measurement equivalence, then this indicates that it is more likely that equivalence could hold for other studies relying on these type of subjects as well. If measurement equivalence is not found, then it is a warning to others to evaluate their experimental design to improve validity. In either case, it gives researchers a way to test measurement equivalence for themselves because the code is well annotated and openly available for others to use.

The Stage 1 manuscript was evaluated over two rounds of in-depth review. Based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).

URL to the preregistered Stage 1 protocol: https://osf.io/7gtvf

Level of bias control achieved: Level 2. At least some data/evidence that will be used to answer the research question has been accessed and partially observed by the authors, but the authors certify that they have not yet observed the key variables within the data that will be used to answer the research question AND they have taken additional steps to maximise bias control and rigour (e.g. conservative statistical threshold; recruitment of a blinded analyst; robustness testing, multiverse/specification analysis, or other approach)

List of eligible PCI RR-friendly journals:

References

Alley L. J., Axt, J., & Flake J. K. (2023). Convenience Samples and Measurement Equivalence in Replication Research, in principle acceptance of Version 4 by Peer Community in Registered Reports. https://osf.io/7gtvf

Steckler, A. & McLeroy, K. R. (2008). The importance of external validity. American Journal of Public Health 98, 9-10. https://doi.org/10.2105/AJPH.2007.126847

Pagel, M. (1999). Inferring the historical patterns of biological evolution. Nature, 401, 877-884. https://doi.org/10.1038/44766

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviews

Evaluation round #3

DOI or URL of the report: https://osf.io/32unb

Version of the report: v1

Author's Reply, 18 Mar 2023

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.rr.100349.ar3

Decision by Corina Logan, posted 16 Mar 2023, validated 16 Mar 2023

Dear Lindsay, Jordan, and Jessica,

Thank you for your revision, which appropriately addressed the remaining comments.

In drafting my recommendation text, I went back through your answers to the questions at the submission page and determined that your answer to question 7 was incorrect. This is a quantitative study that tests a hypothesis (see research questions 1 and 2 in your introduction), therefore it needs a study design table. Please make a study design table according to the author guidelines (https://rr.peercommunityin.org/help/guide_for_authors#h_27513965735331613309625021), reference it in your main text, and include the table in your manuscript document either in the main text or as supplementary material. Note that research question 2 tests a hypothesis because it uses a significance test that tests for the existence of something, not just the amount of something. Research question 1 is an estimation problem, but would also benefit from being included in the study design table.

Also, please update your answer to question 7 in the report survey to choose the first option: "YES - THE RESEARCH INVOLVES AT LEAST SOME QUANTITATIVE HYPOTHESIS-TESTING AND THE REPORT INCLUDES A STUDY DESIGN TEMPLATE". If you need help with this, or would like me to change your answer for you, just let me know.

All my best,

Corina

https://doi.org/10.24072/pci.rr.100349.d3

Evaluation round #2

DOI or URL of the report: https://osf.io/32unb

Version of the report: v1

Author's Reply, 15 Mar 2023

Download author's reply Download tracked changes file

https://doi.org/10.24072/pci.rr.100349.ar2

Decision by Corina Logan, posted 28 Feb 2023, validated 28 Feb 2023

Thank you for your comprehensive revision and for responding to all of the comments so completely. I sent the manuscript back to one of the reviewers, who is very positive about the revision. Please see their review, which contains the answer to your question about where to place the additional terms. I have one additional note...

You wrote: “On the advice of a colleague, we have switched from using Bartlett’s factor scores for the sensitivity analysis to regression factor scores”

Please justify why this change was needed and why the new analysis is better suited to your study.

Once these minor revisions are made and resubmitted, I will be ready to give IPA.

All my best,

Corina

https://doi.org/10.24072/pci.rr.100349.d2

Reviewed by Shinichi Nakagawa, 28 Feb 2023

Download the review https://doi.org/10.24072/pci.rr.100349.rev21

Evaluation round #1

DOI or URL of the report: https://osf.io/32unb

Version of the report: v1

Author's Reply, 17 Feb 2023

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.rr.100349.ar1

Decision by Corina Logan, posted 29 Dec 2022, validated 29 Dec 2022

Dear Lindsay, Jordan, and Jessica,

Many thanks for submitting your registered report, which is on a topic that is very important for replication studies, as well as for other research that uses students and/or online crowdsourcing as their subjects. I have received two expert reviews and both are positive with comments to address. Please revise your registered report and address all comments, then submit your revision. I will then send it back to the reviewers.

I'm looking forward to your revision,

Corina

https://doi.org/10.24072/pci.rr.100349.d1

Reviewed by Benjamin Farrar, 27 Dec 2022

In this Stage 1 Registered Report submission, Alley et al. propose to test for measurement non-equivalence between online and student samples across ManyLabs samples, including student samples, MTurk samples and a sample from Project Implicit. The analysis to decide which scales to include for equivalence testing has already been performed, and this is appropriately outlined in the level of bias control section of the RR. I was able to fully reproduce the inclusion analysis from the provided code.

Overall, I consider this to be a well thought-out and very well written submission that will be widely read, and it is great that the authors have chosen to submit this as a RR. The code annotation, in particular, was very clear - which is appreciated. The research question is valid and there is a strong rationale for testing measurement equivalence between convenience samples. While I am not an expert in testing measurement equivalence, the analysis pipeline was clear and, combined with the annotated code, there is sufficient methodological detail to closely replicate the proposed study procedures and to prevent undisclosed flexibility in the procedures and analyses. I could not find the individual analysis code mentioned in the paper (pg. 15 “Code for the following analyses can be found in the supplementary materials. A separate R file for each measure, and the files are named for the measure analyzed) but this may be a placeholder for when each individual analysis has been performed, and I consider the “Planned Analysis Code” to sufficiently outline the analysis pipeline for all measures, which I was able to fully reproduce.

Below I list minor comments and questions for the authors, but I am confident this is a scientifically valid, thorough and very valuable submission that I look forward to reading once it is complete.

General:

Throughout the report, there was not much discussion of the power of the planned analyses to detect measurement inequivalence. This may be a moot point as the sample sizes in the current project are quite large, and the exact analyses to be conducted are contingent on the hierarchical analysis plan. However, it would be a good addition to have a power curve to make explicit the range of effect sizes that would be detected in a “typical” analysis, and their theoretical meaning – although to some extent this is already achieved in the sensitivity analysis, which is great.

Ordered minor comments

Page 6: This contains a very clear and accessible description of measurement equivalence. Please provide a more explicit formal definition also be provided after the first sentence, with any appropriate references.

Page 8: “It would also be ideal for replication researchers to test for ME between original and replication studies (Fabrigar & Wegener, 2016), but this is difficult in practice: for the most part, the original studies that are replicated do not have publicly available data.” Presumably the small sample size of many original studies would also mean it would be difficult to detect meaningful measurement invariance between samples.

Page 9: If the authors want a reference on between-platform differences in participant responses: Peer, E., Rothschild, D., Gordon, A., Evernden, Z., & Damer, E. (2022). Data quality of platforms and panels for online behavioral research. Behavior Research Methods, 54(4), 1643-1662.

Page 9/10: “While mean differences on a measure of a construct do not necessarily indicate non-equivalence, any of these differences in sample characteristics could potentially contribute to non-equivalent measurement.” Could this be unpacked? If there are substantial differences in some sample characteristics, is absolute equivalence plausible, or would you expect some non-equivalence but perhaps with very small effect sizes?

Page 11: RQ2, it could be made explicit here that this will be tested by looking at effects sizes as well as significance.

Page 12: “which is useful more broadly than just replication research” – good justification, if anything you are unselling the importance of these research questions for interpreting the results of any experiments!

Page 12: “which requires a series of preliminary analyses to determine if the data satisfy assumptions”. It would be useful to list all of the assumptions here.

Page 13: “Type I error rates for equivalence tests may be inflated when the baseline model is misspecified”. For accessibility, could an example of model misspecification increasing Type 1 error rates be given here, even if hypothetical?

Page 17: The authors may be more aware of this than I am and this may be a non-issue, but some concerns have been raised about using scaled χ2-difference tests to compare nested models (see http://www.statmodel.com/chidiff.shtml)

Results: The proposed data presentation looks appropriate, and thank-you to the authors for including these.

Finally, I don’t think this is necessary to address in the current project, but I thought about this several times while reading the manuscript: While the focus is on detecting measurement inequivalence between samples, there is some ambiguity about what the authors consider the sample to be – is it the experimental units (individual participants) only, the experimental units plus the setting (participants + online/location), are the experimenters included too? Replications vary across experimental units, treatments, measurements and settings (Machery, E. [2020]. What is a replication?. Philosophy of Science, 87(4), 545-567), and so it might be worth considering the degree to which measurement invariance could be introduced from other sources, e.g. different experimenters, that would not normally be associated with the sample, but nevertheless would differ between them.

Best of luck with the project!

Benjamin Farrar

https://doi.org/10.24072/pci.rr.100349.rev11

Reviewed by Shinichi Nakagawa, 29 Dec 2022

Download the review https://doi.org/10.24072/pci.rr.100349.rev12

User comments

No user comments yet

or Register
Submit a report