Close printable page

Recommendation

Data from students and crowdsourced online platforms do not often measure the same thing

Corina Logan based on reviews by Benjamin Farrar and Shinichi Nakagawa

A recommendation of:

STAGE 2

Convenience Samples and Measurement Equivalence in Replication Research

Lindsay J. Alley, Jordan Axt, Jessica Kay Flake https://osf.io/s5t3v version 2

Read report on server

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Convenience Samples and Measurement Equivalence in Replication Research

A great deal of research in psychology employs either university student or online crowdsourced convenience samples (Chandler & Shapiro, 2016; Strickland & Stoops, 2019) and there is evidence that these groups differ in meaningful ways (Behrend et al., 2011). This practice could result in the presence of unaccounted-for measurement differences across convenience sample sources, which may bias results when these groups are compared or the resulting data are pooled. In this registered report, we used the openly available data from the Many Labs replication projects to test for measurement equivalence across different convenience sample sources. We examined 89 measures that showed acceptable baseline model fit and tested them for non-equivalence across convenience samples from different sources, including university participant pools, MTurk, and Project Implicit. We then examined whether replication results are robust to non-equivalence by fitting partial invariance models and sensitivity analyses of replication results. Many of the measures examined were not equivalent across student and crowdsourced convenience samples, or across different types of convenience samples. Only two tests, comparing lab and online student samples, retained strict equivalence, while 14 of 30 tests rejected configural equivalence. However, correcting for non-equivalence changed the estimated effect sizes of the replication effects very little. Based on these results, we advise researchers to test for measurement equivalence when combining or comparing data from different convenience samples. At the same time, due to a lack of validity evidence for many of the measures and variable power of our tests, we interpret results with caution.

measurement, psychometrics, equivalence, invariance, metascience, replication

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

عينات الراحة ومعادلة القياس في أبحاث النسخ المتماثل

يستخدم قدر كبير من الأبحاث في علم النفس إما طلابًا جامعيين أو عينات ملائمة من مصادر جماعية عبر الإنترنت (Chandler & Shapiro, 2016; Strickland & Stoops, 2019) وهناك أدلة على أن هذه المجموعات تختلف بطرق ذات معنى (Behrend et al. ، 2011). يمكن أن تؤدي هذه الممارسة إلى وجود اختلافات غير محسوبة في القياس عبر مصادر العينات الملائمة، مما قد يؤدي إلى تحيز النتائج عند مقارنة هذه المجموعات أو عند تجميع البيانات الناتجة. في هذا التقرير المسجل، استخدمنا البيانات المتاحة بشكل مفتوح من مشاريع النسخ المتماثل Many Labs لاختبار تكافؤ القياس عبر مصادر عينات ملائمة مختلفة. قمنا بفحص 89 مقياسًا أظهرت ملاءمة النموذج الأساسي المقبول واختبرناها للتأكد من عدم تكافؤها عبر عينات ملائمة من مصادر مختلفة، بما في ذلك مجموعات المشاركين في الجامعات، وMTurk، وProject Implicit. قمنا بعد ذلك بفحص ما إذا كانت نتائج النسخ المتماثل قوية لعدم التكافؤ من خلال تركيب نماذج الثبات الجزئي وتحليلات الحساسية لنتائج النسخ المتماثل. لم تكن العديد من التدابير التي تم فحصها متكافئة عبر عينات الملاءمة للطلاب وعينات التعهيد الجماعي، أو عبر أنواع مختلفة من عينات الملاءمة. احتفظ اختباران فقط، يقارنان عينات الطلاب المختبرية وعينات الطلاب عبر الإنترنت، بالتكافؤ الصارم، في حين رفض 14 اختبارًا من أصل 30 اختبارًا التكافؤ التكويني. ومع ذلك، فإن تصحيح عدم التكافؤ أدى إلى تغيير أحجام التأثير المقدرة لتأثيرات التكرار بشكل طفيف جدًا. وبناء على هذه النتائج، ننصح الباحثين باختبار تكافؤ القياس عند دمج أو مقارنة البيانات من عينات ملائمة مختلفة. وفي الوقت نفسه، ونظرًا لعدم وجود أدلة صحة للعديد من المقاييس والقوة المتغيرة لاختباراتنا، فإننا نفسر النتائج بحذر.

إدك34407652c4e98b1528820aeea186c عينات الراحة ومعادلة القياس في أبحاث النسخ المتماثل d8e3d863e0094591aa4f0d23774237e0 القياس، القياسات النفسية، التكافؤ، الثبات، ما وراء العلوم، التكرار

القياس، القياسات النفسية، التكافؤ، الثبات، ما وراء العلوم، التكرار

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Muestras de conveniencia y equivalencia de medidas en la investigación de replicación

Gran parte de la investigación en psicología emplea muestras de conveniencia de estudiantes universitarios o de crowdsourcing en línea (Chandler & Shapiro, 2016; Strickland & Stoops, 2019) y existe evidencia de que estos grupos difieren de manera significativa (Behrend et al. , 2011). Esta práctica podría dar lugar a la presencia de diferencias de medición no contabilizadas entre fuentes de muestras de conveniencia, lo que puede sesgar los resultados cuando se comparan estos grupos o se combinan los datos resultantes. En este informe registrado, utilizamos los datos disponibles abiertamente de los proyectos de replicación de Many Labs para probar la equivalencia de mediciones entre diferentes fuentes de muestras de conveniencia. Examinamos 89 medidas que mostraron un ajuste aceptable del modelo de referencia y las probamos para detectar no equivalencia entre muestras de conveniencia de diferentes fuentes, incluidos grupos de participantes universitarios, MTurk y Project Implicit. Luego examinamos si los resultados de la replicación son robustos ante la no equivalencia mediante el ajuste de modelos de invarianza parcial y análisis de sensibilidad de los resultados de la replicación. Muchas de las medidas examinadas no fueron equivalentes entre muestras de conveniencia de estudiantes y de crowdsourcing, o entre diferentes tipos de muestras de conveniencia. Sólo dos pruebas, que compararon muestras de estudiantes de laboratorio y en línea, mantuvieron una equivalencia estricta, mientras que 14 de 30 pruebas rechazaron la equivalencia configural. Sin embargo, la corrección por no equivalencia cambió muy poco los tamaños de efecto estimados de los efectos de replicación. Con base en estos resultados, recomendamos a los investigadores que prueben la equivalencia de medidas al combinar o comparar datos de diferentes muestras de conveniencia. Al mismo tiempo, debido a la falta de evidencia de validez para muchas de las medidas y el poder variable de nuestras pruebas, interpretamos los resultados con cautela.

medición, psicometría, equivalencia, invariancia, metaciencia, replicación

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Échantillons de commodité et équivalence de mesures dans la recherche sur la réplication

De nombreuses recherches en psychologie utilisent soit des étudiants universitaires, soit des échantillons pratiques en ligne (Chandler & Shapiro, 2016 ; Strickland & Stoops, 2019) et il existe des preuves que ces groupes diffèrent de manière significative (Behrend et al. , 2011). Cette pratique pourrait entraîner la présence de différences de mesure non prises en compte entre les sources d’échantillons de commodité, ce qui pourrait biaiser les résultats lorsque ces groupes sont comparés ou que les données obtenues sont regroupées. Dans ce rapport enregistré, nous avons utilisé les données librement disponibles des projets de réplication Many Labs pour tester l'équivalence des mesures entre différentes sources d'échantillons de commodité. Nous avons examiné 89 mesures présentant un ajustement acceptable du modèle de base et testé leur non-équivalence sur des échantillons de commodité provenant de différentes sources, notamment des pools de participants universitaires, MTurk et Project Implicit. Nous avons ensuite examiné si les résultats de la réplication étaient robustes à la non-équivalence en ajustant des modèles d'invariance partielle et des analyses de sensibilité des résultats de la réplication. Bon nombre des mesures examinées n’étaient pas équivalentes entre les échantillons de commodité d’étudiants et de crowdsourcing, ou entre différents types d’échantillons de commodité. Seuls deux tests, comparant des échantillons d'étudiants en laboratoire et en ligne, ont conservé une équivalence stricte, tandis que 14 des 30 tests ont rejeté l'équivalence configurée. Cependant, la correction de la non-équivalence a très peu modifié l’ampleur estimée des effets de réplication. Sur la base de ces résultats, nous conseillons aux chercheurs de tester l’équivalence des mesures lorsqu’ils combinent ou comparent des données provenant de différents échantillons de convenance. Dans le même temps, en raison du manque de preuves de validité pour de nombreuses mesures et de la puissance variable de nos tests, nous interprétons les résultats avec prudence.

mesure, psychométrie, équivalence, invariance, métascience, réplication

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

प्रतिकृति अनुसंधान में सुविधा नमूने और माप तुल्यता

मनोविज्ञान में शोध का एक बड़ा हिस्सा या तो विश्वविद्यालय के छात्र या ऑनलाइन क्राउडसोर्स सुविधा नमूनों को नियोजित करता है (चांडलर और शापिरो, 2016; स्ट्रिकलैंड और स्टूप्स, 2019) और इस बात के सबूत हैं कि ये समूह सार्थक तरीकों से भिन्न हैं (बेहरेंड एट अल। , 2011). इस अभ्यास के परिणामस्वरूप सुविधाजनक नमूना स्रोतों में बेहिसाब माप अंतर की उपस्थिति हो सकती है, जो इन समूहों की तुलना करने या परिणामी डेटा को पूल करने पर पूर्वाग्रहपूर्ण परिणाम दे सकता है। इस पंजीकृत रिपोर्ट में, हमने विभिन्न सुविधाजनक नमूना स्रोतों में माप तुल्यता के परीक्षण के लिए कई लैब प्रतिकृति परियोजनाओं से खुले तौर पर उपलब्ध डेटा का उपयोग किया। हमने 89 उपायों की जांच की, जिन्होंने स्वीकार्य बेसलाइन मॉडल को फिट दिखाया और विश्वविद्यालय प्रतिभागी पूल, एमतुर्क और प्रोजेक्ट इंप्लिसिट सहित विभिन्न स्रोतों से सुविधा नमूनों में गैर-समतुल्यता के लिए उनका परीक्षण किया। फिर हमने जांच की कि क्या प्रतिकृति परिणाम आंशिक अपरिवर्तनीय मॉडल और प्रतिकृति परिणामों के संवेदनशीलता विश्लेषण को फिट करके गैर-समतुल्यता के लिए मजबूत हैं। जांचे गए कई उपाय छात्र और क्राउडसोर्स सुविधा नमूनों, या विभिन्न प्रकार के सुविधा नमूनों में समान नहीं थे। प्रयोगशाला और ऑनलाइन छात्र नमूनों की तुलना करने वाले केवल दो परीक्षणों ने सख्त तुल्यता बरकरार रखी, जबकि 30 में से 14 परीक्षणों ने विन्यास तुल्यता को खारिज कर दिया। हालाँकि, गैर-समतुल्यता को सही करने से प्रतिकृति प्रभावों के अनुमानित प्रभाव आकार में बहुत कम बदलाव आया। इन परिणामों के आधार पर, हम शोधकर्ताओं को विभिन्न सुविधा नमूनों से डेटा का संयोजन या तुलना करते समय माप तुल्यता के लिए परीक्षण करने की सलाह देते हैं। साथ ही, कई उपायों और हमारे परीक्षणों की परिवर्तनीय शक्ति के लिए वैधता साक्ष्य की कमी के कारण, हम सावधानी के साथ परिणामों की व्याख्या करते हैं।

माप, साइकोमेट्रिक्स, तुल्यता, अपरिवर्तनीयता, मेटासाइंस, प्रतिकृति

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

複製研究におけるコンビニエンスサンプルと測定同等性

心理学の多くの研究では、大学生またはオンラインのクラウドソーシングによる便利なサンプルが使用されており (Chandler & Shapiro, 2016; Strickland & Stoops, 2019)、これらのグループが意味のある方法で異なるという証拠があります (Behrend et al. 、2011）。この方法では、コンビニエンスサンプルソース間で説明されていない測定差が存在する可能性があり、これらのグループを比較したり、結果のデータをプールしたりするときに結果に偏りが生じる可能性があります。この登録済みレポートでは、Many Labs の複製プロジェクトから公開されているデータを使用して、さまざまなコンビニエンスサンプルソース間での測定の同等性をテストしました。私たちは、許容可能なベースラインモデルの適合性を示した 89 の測定値を検討し、大学の参加者プール、MTurk、Project Implicit など、さまざまなソースからのコンビニエンスサンプル間で非同等かどうかをテストしました。次に、部分不変モデルのフィッティングと複製結果の感度分析によって、複製結果が非等価性に対して頑健であるかどうかを調べました。調査された尺度の多くは、学生およびクラウドソーシングの便利なサンプル間、またはさまざまな種類の便利なサンプル間で同等ではありませんでした。研究室サンプルとオンライン学生サンプルを比較した 2 つのテストのみが厳密な同等性を維持しましたが、30 件のテストのうち 14 件では構成上の同等性が拒否されました。ただし、非同等性を補正しても、複製効果の推定効果量はほとんど変化しません。これらの結果に基づいて、さまざまなコンビニエンスサンプルからのデータを組み合わせたり比較したりするときに、測定の同等性をテストすることを研究者にアドバイスします。同時に、テストの尺度や検出力の多くには有効性の証拠が不足しているため、結果を慎重に解釈します。

測定、心理測定、等価性、不変性、メタサイエンス、複製

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Amostras de conveniência e equivalência de medição em pesquisa de replicação

Muitas pesquisas em psicologia empregam amostras de conveniência de estudantes universitários ou de crowdsourcing on-line (Chandler & Shapiro, 2016; Strickland & Stoops, 2019) e há evidências de que esses grupos diferem de maneiras significativas (Behrend et al. , 2011). Esta prática pode resultar na presença de diferenças de medição não contabilizadas entre fontes de amostras de conveniência, o que pode distorcer os resultados quando estes grupos são comparados ou os dados resultantes são agrupados. Neste relatório registrado, usamos os dados disponíveis abertamente dos projetos de replicação do Many Labs para testar a equivalência de medição em diferentes fontes de amostras de conveniência. Examinamos 89 medidas que mostraram ajuste aceitável do modelo de linha de base e as testamos quanto à não equivalência em amostras de conveniência de diferentes fontes, incluindo grupos de participantes universitários, MTurk e Project Implicit. Em seguida, examinamos se os resultados da replicação são robustos à não equivalência, ajustando modelos de invariância parcial e análises de sensibilidade dos resultados da replicação. Muitas das medidas examinadas não eram equivalentes entre amostras de conveniência de estudantes e de crowdsourcing, ou entre diferentes tipos de amostras de conveniência. Apenas dois testes, comparando amostras de estudantes de laboratório e on-line, mantiveram a equivalência estrita, enquanto 14 dos 30 testes rejeitaram a equivalência configuracional. No entanto, a correção da não equivalência alterou muito pouco os tamanhos dos efeitos estimados dos efeitos de replicação. Com base nesses resultados, aconselhamos os pesquisadores a testar a equivalência de medidas ao combinar ou comparar dados de diferentes amostras de conveniência. Ao mesmo tempo, devido à falta de evidências de validade para muitas das medidas e ao poder variável dos nossos testes, interpretamos os resultados com cautela.

medição, psicometria, equivalência, invariância, metaciência, replicação

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Удобные образцы и эквивалентность измерений в тиражных исследованиях

Во многих исследованиях в области психологии используются либо студенты университетов, либо удобные краудсорсинговые онлайн-образцы (Chandler & Shapiro, 2016; Strickland & Stoops, 2019), и есть свидетельства того, что эти группы существенно различаются (Behrend et al. , 2011). Такая практика может привести к появлению неучтенных различий в измерениях между удобными источниками выборок, что может привести к искажению результатов при сравнении этих групп или объединении полученных данных. В этом зарегистрированном отчете мы использовали общедоступные данные из проектов репликации Many Labs для проверки эквивалентности измерений в различных источниках удобных образцов. Мы изучили 89 показателей, которые показали приемлемое соответствие базовой модели, и протестировали их на неэквивалентность на удобных выборках из разных источников, включая пулы участников университетов, MTurk и Project Implicit. Затем мы проверили, устойчивы ли результаты репликации к неэквивалентности, путем подбора моделей частичной инвариантности и анализа чувствительности результатов репликации. Многие из рассмотренных показателей не были эквивалентными в выборках студентов и краудсорсинговых выборках удобства, а также в разных типах выборок удобства. Только два теста, сравнивающие лабораторные и онлайн-выборки студентов, сохранили строгую эквивалентность, в то время как 14 из 30 тестов отвергли конфигуральную эквивалентность. Однако поправка на неэквивалентность очень мало изменила предполагаемые размеры эффектов репликации. Основываясь на этих результатах, мы советуем исследователям проверять эквивалентность измерений при объединении или сравнении данных из различных удобных образцов. В то же время из-за отсутствия доказательств достоверности многих показателей и переменной мощности наших тестов мы интерпретируем результаты с осторожностью.

измерение, психометрия, эквивалентность, инвариантность, метанаука, репликация

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

重复研究中的方便样本和测量等效性

大量心理学研究采用大学生或在线众包便利样本（Chandler & Shapiro，2016；Strickland & Stoops，2019），并且有证据表明这些群体在有意义的方面存在差异（Behrend 等人，2019）。，2011）。这种做法可能会导致方便样本来源之间存在未说明的测量差异，这可能会在比较这些组或汇总结果数据时产生偏差结果。在这份注册报告中，我们使用来自 Many Labs 复制项目的公开数据来测试不同便利样本来源之间的测量等效性。我们检查了 89 项显示出可接受的基线模型拟合的指标，并测试了来自不同来源（包括大学参与者池、MTurk 和 Project Implicit）的便利样本之间的非等价性。然后，我们通过拟合部分不变性模型和复制结果的敏感性分析来检查复制结果是否对非等价具有稳健性。在学生和众包便利样本中，或者在不同类型的便利样本中，所检查的许多措施并不相同。只有两项比较实验室和在线学生样本的测试保留了严格等效性，而 30 项测试中有 14 项拒绝了配置等效性。然而，校正不等价性对复制效应的估计效应大小的影响很小。基于这些结果，我们建议研究人员在组合或比较不同便利样本的数据时测试测量等效性。同时，由于许多措施缺乏有效性证据以及我们测试的可变功效，我们谨慎解释结果。

测量、心理测量学、等价性、不变性、元科学、复制

Submission: posted 31 August 2023
Recommendation: posted 13 November 2023, validated 13 November 2023

Cite this recommendation as:
Logan, C. (2023) Data from students and crowdsourced online platforms do not often measure the same thing. Peer Community in Registered Reports, 100551. https://doi.org/10.24072/pci.rr.100551

This is a stage 2 based on:

Convenience Samples and Measurement Equivalence in Replication Research
Lindsay J. Alley, Jordan Axt, Jessica Kay Flake
https://osf.io/32unb

Recommendation

Comparative research is how evidence is generated to support or refute broad hypotheses (e.g., Pagel 1999). However, the foundations of such research must be solid if one is to arrive at the correct conclusions. Determining the external validity (the generalizability across situations/individuals/populations) of the building blocks of comparative data sets allows one to place appropriate caveats around the robustness of their conclusions (Steckler & McLeroy 2008).

In the current study, Alley and colleagues (2023) tackled the external validity of comparative research that relies on subjects who are either university students or participating in experiments via an online platform. They determined whether data from these two types of subjects have measurement equivalence - whether the same trait is measured in the same way across groups.

Although they use data from studies involved in the Many Labs replication project to evaluate this question, their results are of crucial importance to other comparative researchers whose data are generated from these two sources (students and online crowdsourcing). The authors show that these two types of subjects do not often have measurement equivalence, which is a warning to others to evaluate their experimental design to improve validity. They provide useful recommendations for researchers on how to to implement equivalence testing in their studies, and they facilitate the process by providing well annotated code that is openly available for others to use.

After one round of review and revision, the recommender judged that the manuscript met the Stage 2 criteria and awarded a positive recommendation.

URL to the preregistered Stage 1 protocol: https://osf.io/7gtvf

Level of bias control achieved: Level 2. At least some data/evidence that was used to answer the research question had been accessed and partially observed by the authors prior to Stage 1 IPA, but the authors certify that they had not yet observed the key variables within the data that were used to answer the research question AND they took additional steps to maximise bias control and rigour.

List of eligible PCI RR-friendly journals:

References

1. Pagel, M. (1999). Inferring the historical patterns of biological evolution. Nature, 401, 877-884. https://doi.org/10.1038/44766

2. Steckler, A. & McLeroy, K. R. (2008). The importance of external validity. American Journal of Public Health 98, 9-10. https://doi.org/10.2105/AJPH.2007.126847

3. Alley L. J., Axt, J., & Flake J. K. (2023). Convenience Samples and Measurement Equivalence in Replication Research [Stage 2 Registered Report] Acceptance of Version 2 by Peer Community in Registered Reports. https://osf.io/s5t3v

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviews

Evaluation round #1

DOI or URL of the report: https://doi.org/10.17605/OSF.IO/HT48Z

Version of the report: 1

Author's Reply, 06 Nov 2023

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.rr.100551.ar1

Decision by Corina Logan, posted 27 Sep 2023, validated 27 Sep 2023

Dear authors,

Congratulations on completing your study and finishing your Stage 2 article! The unexpected bumps that came up along the way were normal, and your solutions to the problems upheld the scientific integrity of the registered report - nice work. The same two reviewers who evaluated the Stage 1 came back to evaluate Stage 2, and both found that your manuscript meets the PCI RR Stage 2 criteria. I am ready to issue IPA after you revise per my and Nakagawa’s comments. Note that my comments are so minor that you do not need to address them if you feel they are not useful, but please do make sure to address Nakagawa’s comment.

To answer your question, there are no space constraints at PCI RR so you don’t need to move anything to supplementary material.

Here are my comments on the manuscript…

1) Results: I found it extremely useful that you clarified the size of the effects in relation to what your tests were powered for (e.g., “Item 1 (“I find satisfaction in deliberating hard for long hours”) was the only item above the cut-off for a medium effect, all others were small or negligible”). I noticed that some paragraphs discussed the a small effect being the cut-off, while others discussed a medium effect being the cut-off. It might be even clearer if you noted in each paragraph that the effect size cut-off related to the power/sensitivity/etc analyses you conducted at Stage 1 for each analysis, which is why it differed.

2) Discussion: “power in ME testing is impact by the strength of inter-item correlations” - change “impact” to “impacted”

3) Discussion: “For this reason, researchers should not assume that different crowdsourced samples will be equivalent to each other, or even student samples collected in different settings”. Could you please clarify what “different settings” refers to? Different countries/languages/etc.?

4) Study design table: you could add a column to the right that shows your findings.

I'm looking forward to receiving your revision.

All my best,

Corina

https://doi.org/10.24072/pci.rr.100551.d1

Reviewed by Benjamin Farrar, 12 Sep 2023

2A. Whether the data are able to test the authors’ proposed hypotheses (or answer the proposed research question) by passing the approved outcome-neutral criteria, such as absence of floor and ceiling effects or success of positive controls or other quality checks.

Yes. The authors clearly outline when equivalence testing was unsuitable due to failing to find an appropriate anchor item, and also clearly discuss the pontential limits to the generalisability of their results.

2B. Whether the introduction, rationale and stated hypotheses (where applicable) are the same as the approved Stage 1 submission. This can be readily assessed by referring to the tracked-changes manuscript supplied by the authors. (Note that this is the DIRECT link to the tracked changes version: https://osf.io/download/hgku8/?direct%26mode=render)

Yes, the small number of tracked changes relate to changes of text or transparent and justified changes to the methods.

2C. Whether the authors adhered precisely to the registered study procedures.

As above, yes. In the occasion the authors added a criterion for determining configural equivalence after the Stage 1 process, this was clearly stated in the manuscript and was accompanied by a footnote to explain the nature of, and reason for, the deviation from the planned methodology. It was reported that this decision was made by an author blind to the results, that is appropriate.

2D. Where applicable, whether any unregistered exploratory analyses are justified, methodologically sound, and informative.

As above, yes.

2E. Whether the authors’ conclusions are justified given the evidence.

In my view, the author's have conducted a very thorough study that followed the Stage 1 approval. The conclusions that are justified by the evidence. The writing is of a very high standard throughout, and the discussion of the generalisability of their results is fully appropriate.

I re-ran the code for one comparison (EMA implicit vs MTurk) as a reproducibility check, and I was able to fully reproduce the equivalence test results for this comparison, although I did note the mean age for MTurk was 34.98400 (35.0) rather than the 34.0 reported. The code is clear, and excellently commented on throughout.

Yours sincerely,
Ben Farrar

https://doi.org/10.24072/pci.rr.100551.rev11

Reviewed by Shinichi Nakagawa, 26 Sep 2023

I have reviewed stage 1 of this MS and very much enjoyed it and was looking forward to reading stage 2. I first acknowledge that I am a quantitative ecologist so I do not know the relevant field and literature. Yet, I would be able to check whether the statistical analyses conducted were sound. Also, this is my first time reviewing stage 2, but my understanding is that I check whether they followed the stage 1 plan and check for deviations. The authors conducted the study with very minor deviations. I liked that the Discussion section had limitation and recommendation sections, which are very clearly and honestly written. Overall, I think this is a great stage 2. I have one question, tho. By reading this work, I got the impression that authors are encouraging to be cautious about mixing samples. Yet, some papers in biology encourage the mixing of samples knowing non-equivalence (differences, e.g. sex and strains). I wondered what authors make of this, and there should be some related discussion. I note this mixing process is called "heterogenization", which is encouraged by an increasing number of grant agencies. There is an example paper:

Voelkl, Bernhard, et al. "Reproducibility of animal research in light of biological variation." Nature Reviews Neuroscience 21.7 (2020): 384-393.

https://doi.org/10.24072/pci.rr.100551.rev12