Recommendation

A reliable measure of physical closeness in interpersonal relationships?

Moin Syed based on reviews by Jacek Buczny and Ian Hussey

A recommendation of:

STAGE 1

Test-Retest Reliability of the STRAQ-1: A Registered Report

Olivier Dujols; Richard A. Klein; Siegwart Lindenberg; Hans IJzerman https://psyarxiv.com/392g6 version 5

Read report on server

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Test-Retest Reliability of the STRAQ-1: A Registered Report

This Registered Report provides the first test of measurement invariance across time points and estimates of test-retest reliability for the Social Thermoregulation, Risk Avoidance Questionnaire (STRAQ-1, Vergara et al., 2019). The scale was developed and validated to understand the physiological drives underlying interpersonal bonding, measured by four constructs: the desire to socially regulate one’s temperature, the desire to solitary regulate one’s temperature, the sensitivity to higher temperatures, and the desire to avoid risk. Previous studies with large samples across 12 countries showed that the STRAQ-1 has a stable factorial structure, satisfying internal consistencies for the temperature subscales, and expected correlations in its nomological network. However, to date, this instrument has no estimates of test-retest reliability. Throughout four academic years (from 2018 to 2021), N = 184 French student participants took the STRAQ-1 at least two times. Out of the four STRAQ-1 subscales, X were longitudinally [non-invariant/invariant] across two-time points. The constructs and latent scores were thus [dissimilar/similar] and [incomparable/comparable] across time. We then conducted test-retest reliability using Intra Class Correlation coefficient (ICC) for the Social Thermoregulation, Solitary Thermoregulation, High-Temperature Sensitivity, and Risk Avoidance subscales. ICCs estimates were respectively for agreement and consistency: XX, XX overall [excellent/good/moderate/poor] , XX, XX overall [excellent/good/moderate/poor], XX, XX overall [excellent/good/moderate/poor], and XX, XX overall [excellent/good/moderate/poor], respectively. We discuss our findings in regard to the relatively long time between the repeated measure (minimum one year).

Test-Retest, Longitudinal Measurement Invariance, Attachment Theory, Social Thermoregulation, Registered Report

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

موثوقية اختبار إعادة الاختبار لـ STRAQ-1: تقرير مسجل

يوفر هذا التقرير المسجل أول اختبار لثبات القياس عبر النقاط الزمنية وتقديرات موثوقية الاختبار وإعادة الاختبار لاستبيان التنظيم الحراري الاجتماعي وتجنب المخاطر (STRAQ-1, Vergara et al., 2019). تم تطوير المقياس والتحقق من صحته لفهم الدوافع الفسيولوجية الكامنة وراء الترابط بين الأشخاص، والتي تم قياسها من خلال أربعة بنيات: الرغبة في تنظيم درجة حرارة الفرد اجتماعيًا، والرغبة في تنظيم درجة حرارة الفرد بشكل منفرد، والحساسية لدرجات الحرارة المرتفعة، والرغبة في تجنب المخاطر. أظهرت الدراسات السابقة التي أجريت على عينات كبيرة من 12 دولة أن STRAQ-1 يتمتع ببنية عاملية مستقرة، وتلبي الاتساق الداخلي لمقاييس درجة الحرارة الفرعية، والعلاقات المتوقعة في شبكته الاسمية. ومع ذلك، حتى الآن، لا تحتوي هذه الأداة على تقديرات لموثوقية الاختبار وإعادة الاختبار. على مدار أربع سنوات أكاديمية (من 2018 إلى 2021)، خضع 184 طالبًا فرنسيًا مشاركًا لاختبار STRAQ-1 مرتين على الأقل. من بين المقاييس الفرعية الأربعة STRAQ-1، كانت X طولية [غير ثابتة/ثابتة] عبر نقطتين زمنيتين. وهكذا كانت البنيات والنتائج الكامنة [مختلفة/متشابهة] و[غير قابلة للمقارنة/قابلة للمقارنة] عبر الزمن. أجرينا بعد ذلك موثوقية الاختبار وإعادة الاختبار باستخدام معامل الارتباط داخل الفئة (ICC) للتنظيم الحراري الاجتماعي، والتنظيم الحراري الانفرادي، وحساسية درجات الحرارة العالية، والمقاييس الفرعية لتجنب المخاطر. كانت تقديرات غرفة التجارة الدولية على التوالي من حيث الاتفاق والاتساق: XX، XX بشكل عام [ممتاز/جيد/متوسط/ضعيف]، XX، XX بشكل عام [ممتاز/جيد/متوسط/فقير]، XX، XX بشكل عام [ممتاز/جيد/معتدل/ضعيف] و XX، XX بشكل عام [ممتاز/جيد/متوسط/ضعيف]، على التوالي. نناقش النتائج التي توصلنا إليها فيما يتعلق بالوقت الطويل نسبيًا بين الإجراء المتكرر (سنة واحدة على الأقل).

اختبار إعادة الاختبار، ثبات القياس الطولي، نظرية التعلق، التنظيم الحراري الاجتماعي، التقرير المسجل

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Fiabilidad de prueba y repetición del STRAQ-1: un informe registrado

Este informe registrado proporciona la primera prueba de invariancia de medición en puntos temporales y estimaciones de confiabilidad test-retest para el Cuestionario de termorregulación social y evitación de riesgos (STRAQ-1, Vergara et al., 2019). La escala fue desarrollada y validada para comprender los impulsos fisiológicos que subyacen a los vínculos interpersonales, medidos mediante cuatro constructos: el deseo de regular socialmente la temperatura, el deseo de regular la temperatura en solitario, la sensibilidad a temperaturas más altas y el deseo de evitar riesgos. Estudios anteriores con muestras grandes en 12 países mostraron que el STRAQ-1 tiene una estructura factorial estable, que satisface las consistencias internas para las subescalas de temperatura y las correlaciones esperadas en su red nomológica. Sin embargo, hasta la fecha, este instrumento no cuenta con estimaciones de confiabilidad test-retest. A lo largo de cuatro años académicos (de 2018 a 2021), N = 184 estudiantes franceses participantes realizaron el STRAQ-1 al menos dos veces. De las cuatro subescalas del STRAQ-1, X fueron longitudinales [no invariantes/invariantes] en dos puntos temporales. Por lo tanto, los constructos y las puntuaciones latentes fueron [disimilares/similares] e [incomparables/comparables] a lo largo del tiempo. Luego realizamos una confiabilidad de prueba y repetición utilizando el coeficiente de correlación intraclase (ICC) para las subescalas de termorregulación social, termorregulación solitaria, sensibilidad a altas temperaturas y evitación de riesgos. Las estimaciones de ICC fueron respectivamente para el acuerdo y la coherencia: XX, XX en general [excelente/bueno/moderado/malo], XX, XX en general [excelente/bueno/moderado/malo], XX, XX en general [excelente/bueno/moderado/malo] , y XX, XX en general [excelente/bueno/moderado/malo], respectivamente. Discutimos nuestros hallazgos con respecto al tiempo relativamente largo entre las medidas repetidas (mínimo un año).

Test-Retest, Invariancia de medición longitudinal, Teoría del apego, Termorregulación social, Informe registrado

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Fiabilité test-retest du STRAQ-1 : un rapport enregistré

Ce rapport enregistré fournit le premier test d'invariance des mesures à travers le temps et des estimations de la fiabilité test-retest pour le questionnaire sur la thermorégulation sociale et l'évitement des risques (STRAQ-1, Vergara et al., 2019). L’échelle a été développée et validée pour comprendre les pulsions physiologiques sous-jacentes aux liens interpersonnels, mesurées par quatre concepts : le désir de réguler socialement sa température, le désir de réguler solitairement sa température, la sensibilité aux températures plus élevées et le désir d’éviter le risque. Des études antérieures portant sur de grands échantillons dans 12 pays ont montré que le STRAQ-1 possède une structure factorielle stable, satisfaisant les cohérences internes des sous-échelles de température et les corrélations attendues dans son réseau nomologique. Cependant, à ce jour, cet instrument ne dispose d’aucune estimation de la fiabilité test-retest. Au cours de quatre années universitaires (de 2018 à 2021), N = 184 étudiants français participants ont passé le STRAQ-1 au moins deux fois. Sur les quatre sous-échelles STRAQ-1, X étaient longitudinalement [non invariant/invariant] sur deux points temporels. Les constructions et les scores latents étaient donc [différents/similaires] et [incomparables/comparables] au fil du temps. Nous avons ensuite effectué test-retest de fiabilité en utilisant le coefficient de corrélation intra-classe (ICC) pour les sous-échelles de thermorégulation sociale, de thermorégulation solitaire, de sensibilité aux températures élevées et d'évitement des risques. Les estimations des CCI portaient respectivement sur la concordance et la cohérence : XX, XX globalement [excellent/bon/modéré/médiocre], XX, XX globalement [excellent/bon/modéré/médiocre], XX, XX globalement [excellent/bon/modéré/médiocre] , et XX, XX dans l’ensemble [excellent/bon/modéré/médiocre], respectivement. Nous discutons de nos conclusions concernant le délai relativement long entre les mesures répétées (minimum un an).

Test-Retest, Invariance des mesures longitudinales, Théorie de l'attachement, Thermorégulation sociale, Rapport enregistré

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

STRAQ-1 की परीक्षण-पुनः परीक्षण विश्वसनीयता: एक पंजीकृत रिपोर्ट

यह पंजीकृत रिपोर्ट सामाजिक थर्मोरेग्यूलेशन, जोखिम बचाव प्रश्नावली (STRAQ-1, Vergara et al., 2019) के लिए समय बिंदुओं पर माप अपरिवर्तनशीलता का पहला परीक्षण और परीक्षण-पुनः परीक्षण विश्वसनीयता का अनुमान प्रदान करती है। पैमाने को पारस्परिक संबंधों में अंतर्निहित शारीरिक ड्राइव को समझने के लिए विकसित और मान्य किया गया था, जिसे चार निर्माणों द्वारा मापा गया था: किसी के तापमान को सामाजिक रूप से नियंत्रित करने की इच्छा, किसी के तापमान को अकेले नियंत्रित करने की इच्छा, उच्च तापमान के प्रति संवेदनशीलता, और जोखिम से बचने की इच्छा। 12 देशों में बड़े नमूनों के साथ पिछले अध्ययनों से पता चला है कि STRAQ-1 में एक स्थिर तथ्यात्मक संरचना है, जो तापमान उप-पैमाने के लिए आंतरिक स्थिरता को संतुष्ट करती है, और इसके नाममात्र नेटवर्क में अपेक्षित सहसंबंध है। हालाँकि, आज तक, इस उपकरण का परीक्षण-पुनः परीक्षण विश्वसनीयता का कोई अनुमान नहीं है। पूरे चार शैक्षणिक वर्षों (2018 से 2021 तक) में, N = 184 फ्रांसीसी छात्र प्रतिभागियों ने कम से कम दो बार STRAQ-1 लिया। चार STRAQ-1 उपस्केलों में से, X दो-समय बिंदुओं पर अनुदैर्ध्य रूप से [गैर-अपरिवर्तनीय/अपरिवर्तनीय] थे। इस प्रकार निर्माण और अव्यक्त स्कोर समय-समय पर [असमान/समान] और [अतुलनीय/तुलनीय] थे। इसके बाद हमने सोशल थर्मोरेग्यूलेशन, सॉलिटरी थर्मोरेग्यूलेशन, हाई-टेम्परेचर सेंसिटिविटी और रिस्क अवॉइडेंस सबस्केल के लिए इंट्रा क्लास सहसंबंध गुणांक (आईसीसी) का उपयोग करके विश्वसनीयता का परीक्षण-पुनः परीक्षण किया। आईसीसी के अनुमान क्रमशः सहमति और स्थिरता के लिए थे: XX, XX समग्र रूप से [उत्कृष्ट/अच्छा/मध्यम/खराब], XX, XX समग्र रूप से [उत्कृष्ट/अच्छा/मध्यम/खराब], XX, XX समग्र रूप से [उत्कृष्ट/अच्छा/मध्यम/खराब] , और XX, XX कुल मिलाकर क्रमशः [उत्कृष्ट/अच्छा/मध्यम/खराब]। हम दोहराए गए माप (न्यूनतम एक वर्ष) के बीच अपेक्षाकृत लंबे समय के संबंध में अपने निष्कर्षों पर चर्चा करते हैं।

परीक्षण-पुनः परीक्षण, अनुदैर्ध्य माप अपरिवर्तन, अनुलग्नक सिद्धांत, सामाजिक थर्मोरेग्यूलेशन, पंजीकृत रिपोर्ट

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

STRAQ-1 のテストと再テストの信頼性: 登録済みレポート

この登録レポートは、社会的体温調節、リスク回避アンケート (STRAQ-1、Vergara et al.、2019) の時点にわたる測定の不変性の最初のテストと、テストと再テストの信頼性の推定値を提供します。この尺度は、対人関係の絆の根底にある生理学的衝動を理解するために開発され、検証されました。この尺度は、社会的に体温を調節したいという欲求、孤独に体温を調節したいという欲求、高温に対する感受性、リスクを回避したいという欲求という 4 つの構成要素によって測定されます。 12 か国にわたる大規模なサンプルを用いた以前の研究では、STRAQ-1 が安定した要因構造を持ち、温度サブスケールの内部一貫性と、そのノモロジーネットワークで期待される相関関係を満たしていることが示されました。ただし、現在まで、この機器ではテストと再テストの信頼性を推定することはできません。 4 学年度 (2018 年から 2021 年まで) を通じて、N = 184 人のフランス人学生参加者が STRAQ-1 を少なくとも 2 回受講しました。 4 つの STRAQ-1 サブスケールのうち、X は 2 つの時点にわたって縦方向に [非不変/不変] でした。したがって、構成要素と潜在スコアは、時間の経過とともに [異なる/類似] そして [比較できない/比較可能] でした。次に、社会的体温調節、孤独な体温調節、高温感受性、およびリスク回避の下位尺度についてクラス内相関係数 (ICC) を使用して、テストと再テストの信頼性を実施しました。 ICC の推定値は、一致性と一貫性についてそれぞれ次のとおりでした。全体的に XX、XX [非常に良い/良い/中程度/悪い]、全体的に XX、XX [非常に良い/良い/中程度/悪い]、全体的に XX、XX [非常に良い/良い/中程度/悪い] 、XX、全体的に XX [非常に良い/良い/中程度/悪い] です。反復測定間の比較的長い期間 (最低 1 年) に関する調査結果について説明します。

検査-再検査、縦断測定不変性、愛着理論、社会的体温調節、登録報告書

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Confiabilidade teste-reteste do STRAQ-1: um relatório registrado

Este Relatório Registrado fornece o primeiro teste de invariância de medição ao longo de pontos no tempo e estimativas de confiabilidade teste-reteste para o Questionário de Termorregulação Social, Prevenção de Riscos (STRAQ-1, Vergara et al., 2019). A escala foi desenvolvida e validada para compreender os impulsos fisiológicos subjacentes ao vínculo interpessoal, medidos por quatro construtos: o desejo de regular socialmente a temperatura, o desejo de regular solitáriamente a temperatura, a sensibilidade a temperaturas mais altas e o desejo de evitar riscos. Estudos anteriores com grandes amostras em 12 países mostraram que o STRAQ-1 possui uma estrutura fatorial estável, satisfazendo consistências internas para as subescalas de temperatura e correlações esperadas em sua rede nomológica. Contudo, até o momento, este instrumento não possui estimativas de confiabilidade teste-reteste. Ao longo de quatro anos letivos (de 2018 a 2021), N = 184 estudantes franceses participantes realizaram o STRAQ-1 pelo menos duas vezes. Das quatro subescalas do STRAQ-1, X foram longitudinalmente [não invariantes/invariantes] em dois pontos de tempo. Os construtos e pontuações latentes foram, portanto, [diferentes/similares] e [incomparáveis/comparáveis] ao longo do tempo. Em seguida, conduzimos a confiabilidade teste-reteste usando o coeficiente de correlação intraclasse (ICC) para as subescalas Termorregulação Social, Termorregulação Solitária, Sensibilidade a Altas Temperaturas e Evitação de Riscos. As estimativas dos ICCs foram respectivamente para concordância e consistência: XX, XX geral [excelente/bom/moderado/ruim], XX, XX geral [excelente/bom/moderado/ruim], XX, XX geral [excelente/bom/moderado/ruim] e XX, XX geral [excelente/bom/moderado/ruim], respectivamente. Discutimos nossas descobertas em relação ao tempo relativamente longo entre a repetição da medida (mínimo de um ano).

Teste-Reteste, Invariância de Medida Longitudinal, Teoria do Apego, Termorregulação Social, Relatório Registrado

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Повторное тестирование надежности STRAQ-1: зарегистрированный отчет

В этом зарегистрированном отчете представлен первый тест инвариантности измерений в разные моменты времени и оценки надежности повторных тестов для опросника по социальной терморегуляции и предотвращению рисков (STRAQ-1, Vergara et al., 2019). Шкала была разработана и проверена для понимания физиологических побуждений, лежащих в основе межличностных связей, измеряемых четырьмя конструктами: желанием социально регулировать свою температуру, желанием самостоятельно регулировать свою температуру, чувствительностью к более высоким температурам и желанием избежать риска. Предыдущие исследования с большими выборками в 12 странах показали, что STRAQ-1 имеет стабильную факторную структуру, удовлетворяющую внутренней согласованности температурных подшкал и ожидаемым корреляциям в его номологической сети. Однако на сегодняшний день этот инструмент не имеет оценок надежности повторных испытаний. За четыре учебных года (с 2018 по 2021 год) N = 184 французских студента-участника сдавали STRAQ-1 как минимум два раза. Из четырех подшкал STRAQ-1 X были продольными [неинвариантными/инвариантными] в двух временных точках. Таким образом, конструкции и скрытые оценки были [различными/сходными] и [несравнимыми/сопоставимыми] во времени. Затем мы провели повторное тестирование надежности с использованием коэффициента внутриклассовой корреляции (ICC) для подшкал социальной терморегуляции, одиночной терморегуляции, чувствительности к высоким температурам и избежания риска. Оценки ICC были соответственно согласованными и последовательными: XX, XX в целом [отлично/хорошо/средне/плохо] , XX, XX в целом [отлично/хорошо/средне/плохо], XX, XX в целом [отлично/хорошо/средне/плохо] и XX, XX в целом [отлично/хорошо/средне/плохо] соответственно. Мы обсуждаем наши выводы относительно относительно длительного времени между повторными измерениями (минимум один год).

Повторное тестирование, Инвариантность продольных измерений, Теория привязанности, Социальная терморегуляция, Зарегистрированный отчет

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

STRAQ-1 的重测可靠性：注册报告

这份注册报告首次提供了跨时间点测量不变性的测试以及社会体温调节、风险规避问卷的重测可靠性估计（STRAQ-1，Vergara 等人，2019 年）。该量表的开发和验证是为了了解人际联系背后的生理驱动力，通过四种结构来衡量：社会调节体温的愿望、单独调节体温的愿望、对较高温度的敏感性以及避免风险的愿望。先前对 12 个国家/地区的大样本进行的研究表明，STRAQ-1 具有稳定的因子结构，满足温度子量表的内部一致性以及其法理网络中的预期相关性。然而，迄今为止，该工具还没有对重测可靠性的估计。在四个学年（从 2018 年到 2021 年）中，N = 184 名法国学生参与者至少参加了两次 STRAQ-1。在四个 STRAQ-1 子量表中，X 在两个时间点上是纵向[非不变/不变]的。因此，随着时间的推移，构建和潜在分数是[不同/相似]和[不可比较/可比]的。然后，我们使用组内相关系数 (ICC) 对社交体温调节、单独体温调节、高温敏感性和风险规避分量表进行了重测可靠性。 ICC 的估计分别为一致性和一致性：XX, XX 总体[优秀/良好/中等/较差]，XX, XX 总体[优秀/良好/中等/较差]，XX, XX 总体[优秀/良好/中等/较差] 、以及 XX、XX 总体分别为[优秀/良好/中等/差]。我们讨论了关于重复测量之间相对较长的时间（至少一年）的发现。

重测、纵向测量不变性、依恋理论、社会体温调节、注册报告

Submission: posted 01 March 2023
Recommendation: posted 17 July 2023, validated 18 July 2023

Cite this recommendation as:
Syed, M. (2023) A reliable measure of physical closeness in interpersonal relationships? . Peer Community in Registered Reports, . https://rr.peercommunityin.org/articles/rec?id=419

Related stage 2 preprints:

Test-Retest Reliability of the STRAQ-1: A Registered Report
Olivier Dujols, Siegwart Lindenberg, Caspar J. van Lissa, Hans IJzerman
https://doi.org/10.31234/osf.io/392g6

Recommendation

Attachment and interpersonal relationships are a major subject of research and clinical work in psychology. There are, accordingly, a proliferation of measurement instruments to tap into these broad constructs. The emphasis in these measures tends to be on the emotional dimensions of the relationships—how people feel about their partners and the support that they receive. However, that is not all there is to relationship quality. Increasing attention has been paid to the physical and physiological aspects of relationships, but there are few psychometrically sound measures available to assess these dimensions.

In the current study, Dujols et al. (2023) seek to assess the psychometric properties of the Social Thermoregulation and Risk Avoidance Questionnaire (STRAQ-1), a measure of physical relationships that targets social thermoregulation, or how physical proximity is used to promote warmth and closeness. The proposed project will be a thorough assessment of the measure’s reliability over time—that is, the degree to which the measure assesses the construct similarly across administrations. The authors will assess the test-retest reliability and longitudinal measurement invariance of the STRAQ-1, providing much-needed psychometric data that can build confidence in the utility of the measure.

The Stage 1 manuscript was evaluated over two rounds of in-depth review, the first round consisting of detailed comments from two reviewers and the second round consisting of a close read by the recommender. Based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and was therefore awarded in-principle acceptance (IPA).

URL to the preregistered Stage 1 protocol: https://osf.io/pmnk2

Level of bias control achieved: Level 3. At least some data/evidence that will be used to the answer the research question has been previously accessed by the authors (e.g. downloaded or otherwise received), but the authors certify that they have not yet observed ANY part of the data/evidence

List of eligible PCI RR-friendly journals:

References

1. Dujols, O., Klein, R. A., Lindenberg, S., Van Lissa, C. J., & IJzerman, H. (2023). Test-Retest Reliability of the STRAQ-1: A Registered Report. In principle acceptance of Version 2 by Peer Community in Registered Reports. https://osf.io/pmnk2

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviews

Evaluation round #1

DOI or URL of the report: https://psyarxiv.com/392g6

Version of the report: https://psyarxiv.com/392g6

Author's Reply, 30 Jun 2023

Download author's reply Download tracked changes file

Revision & Response letter PCI-RR STRAQ-1 Stage 1

Dear Prof. Moin Syed,

We would like to thank you and the two reviewers for your comments on the manuscript, which directly contributed to improving it. We have revised the manuscript and the analytic code and provided a point-by-point response to the reviewer’s comments. To facilitate the review, we provide a version of the article with Track Changes enabled, and a clean copy without tracking.

Overall, our two biggest changes consisted of (1) better justifying the analytic choices we made in the main text: a) providing greater detail about the procedure, b) providing detailed analytic choices throughout, c) providing specific cut-offs along with their interpretation, and (2) providing reproducible analysis code that now better constrains the planned analysis. In the following, we separate the comments by theme and then by reviewer to make the comments easier to process.

Major Points

Justifying our subjective decisions and improving our analytic code

A measurement project such as this involves many subjective decisions that must be made on issues for which there is no clear answer. The reviewers provide some suggestions for how to resolve the issues that they identified, but in many cases there may be other defensible options. Thus, a general message that you should take from the reviewer comments is that you need to provide much more detail and better justify your decisions throughout. (Moin Syed)

One specific and important issue to attend to is that both reviewers had some trouble accessing your analytic code but in different ways. Please be sure that you are providing the correct links for this Stage 1 review process; having full access to your code will facilitate the review process. (Moin Syed)

Please note that while the OSF project containing the code for the analyses etc is linked in the manuscript, it is currently not public. As such, I couldn't review the code. (Ian Hussey)

Under the link https://osf.io/ab73e/, I could not find the code indicated in the manuscript. I found it here: https://osf.io/mr8n3/. The latter script seems incomplete (by the way: why a pdf not Rmd was submitted?). For instance, it is not clear what package(s) will be used in data analysis – lavaan, I presume. (Jacek Buczny)

Response: We apologize for the unclear links and the incomplete code. We now better justify our decision criteria and cut off values in the manuscript throughout. Please excuse us for the issue with the analytic code. We also provide a public OSF link with the complete planned analytic code. To accommodate this, we completely rewrote the R analysis script in RMarkdown. The code should now be clearer and reproducible. This code reflects our choices that we justify in the manuscript.

About our sample and statistical power

Data loss was substantial. Between 2018 and 2021, more than 1000 participant provided their responses, but only 184 will be included in the analyses. Data loss could be systematic, the sample of 184 may somehow deviate from the subpopulation (students), and thus, the generalizability of the results may be very limited. (Jacek Buczny)

Response: We agree that generalizability is limited, and part of that lack of generalizability could indeed be due to systematic data loss. But that is, of course, not the only constraint on generalizability: our sample consists of psychology students, who mostly identify as women. In the discussion section of the manuscript, and as is our custom, we will commit to adding a “Constraints On Generality” section (see Simons et al., 2017) that will include a statement with the information above about our sample (psychology students, mostly identified as women, and data loss).

The justification for the sample size is not convincing given that there are powerful tools that are very helpful in deciding on the minimum sampling size: https://github.com/moshagen/semPower/tree/master/vignettes, https://sempower.shinyapps.io/sempower/. Because data have been collected, running post hoc power analysis would seem the only sensible option. Such analysis should account for the fact that various types of invariance will be tested. Models' identification will impact the degrees of freedom for each tested model. Sensitivity analysis should provide detailed information regarding power, tested RMSEA values, and the like. (Jacek Buczny)

Response: We thank you for mentioning these tools to compute power for our analyses. As a result of this comment, we now use Sempower to calculate the sensitivity of our planned analyses (see our code via OSF: https://osf.io/mr8n3/).

We don’t think that we should only do post hoc power analyses, as we can do sensitivity analyses now already, as well as post hoc analyses after the manuscript is completed:

A priori sensitivity analyses using SemPower: We computed an a priori sensitivity power analysis for general configural models (but not specific power for metric, scalar, and residual invariance). The result of this analysis was that 164 participants would be required for the model (our previous calculation suggested 160). We believe we can do these analyses a priori, as it relies on only a limited number of assumptions:

We computed power for a general confirmatory factor analysis (CFA) model.
We set the amount of misfit to correspond to an RMSEA of at least .05.
We set power at 80%.
We set the degrees of freedom at 100.
We set alpha at .05.

We changed the power analysis section accordingly: “We also computed power for a general configural longitudinal measurement invariance models (CFA) models. We set power to 80%, alpha to .05, the amount of misfit to correspond to an RMSEA of at least .05, and the degrees of freedom to 100. The result of this analysis was that 164 participants would be required.”(page 11).

Post hoc sensitivity analyses using semPower.powerLI: To compute the power for metric, scalar, and residual longitudinal invariance, one should specify and quantify the change in the loadings, intercepts, and residuals between measurements at Time 1 and 2. Unfortunately, we do not have such specific a priori hypotheses. Depending on how large the change would be, we may or not be sufficiently powered to detect these changes. For example, we currently have a power of 90 (with our longest subscale - 8 items) and a power of 80 (with our shorter subscale - 3 items) to detect a change from .50 to .74 in the loadings of the items (metric longitudinal invariance between T1 and T2). The R code to compute post hoc sensitivity has been included in the supplementary material on our OSF Page (https://osf.io/mr8n3/). We will run the code for each subscale of the STRAQ-1. We will also report all the steps and tests for the longitudinal invariance and their associated power in the manuscript. We will not claim to have reached longitudinal invariance for any subscales if one of the tests (metric or scalar) is underpowered, even if longitudinal measurement invariance holds based on our criteria for longitudinal measurement invariance.

Consider swapping ICC power analyses from detection of non-zero scores to estimation precision. The current power analyses for the ICCs are based on their ability to detect a non-zero ICC. Many, including - I think - Parsons, Kruijt & Fox (2019) have argued that zero is not a particularly meaningful reference point for reliability estimates. I.e., detectable non-zero reliability does not tell us much about whether reliability is practically adequate, in my opinion. I would argue that a better approach is to estimate the 95% CI width that your sample size will provide as a function of different ICC values (as CI width and estimate are related for ICC). This can be done using the R package ICC.Sample.Size. I have a working example of this here, which I've taken from an in progress test-retest study I'm running. Feel free to use or adapt this yourself: https://gist.github.com/ianhussey/03a3d3940b93d79191a3926a09bcfc2b (Ian Hussey)

Response: Thank you for recommending ICC power analyses of estimation precision and for sharing your personal code. We conducted and adjusted the power analysis section to estimate the 95% CI width that our sample will provide as a function of different ICC values (an adaptation of your code for the power analysis is available in the Supplementary Materials via our OSF page). Of course, the higher the ICC values, the more power we have to estimate the 95% CI width. Based on this power analysis, we will specify in the discussion which ICC(s) estimates are sufficiently powered to precisely estimate either their 0.2 (and 0.1 if applicable) widths of the 95% CI.

We rewrote the power analysis section accordingly and have added this sentence: “Because researchers have argued that detection of non-zero ICC scores may not be sufficient and meaningful (see for example, Parsons, Kruijt & Fox, 2019), we also conducted a power analysis to estimate the 95% CI width that our sample will provide as a function of different ICC values *. (*footnote: The R code associated with this power analysis is available in the Supplementary Materials via our OSF page: https://osf.io/mr8n3/.) This power analysis suggested that based on our sample size N = 184, we could estimate any ICC above .30 with a 0.2 width of the 95%CI, and any ICC above .80 with a 0.1 width of the 95% CI.”(page 11-12).

About Longitudinal measurement Invariance (LI)

Be more precise about your inferential method and its correspondence with Mackinnon et al. From page 13, the manuscript states that "we followed the procedure provided by Mackinnon et al. (2022)" and then goes on to specify your inferential method (e.g., which model fit indices, hypothesis tests, and cut-offs). However, most of the specifics that the current manuscript mentions are not what Mackinnon et al specify in their preprint. Indeed, my reading of Mackinnon et al is that they are purposefully non specific about the method they employ for comparisons (e.g., between configural, metric and scalar models). To take one example, Mackinnon et al. discuss the results of Cheung & Rensvold's (2002) highly cited simulation studies, which studied the utility of examining change in CFI between models ($\delta$CFI), but the current manuscript doesn't employ this method. I would also raise the point that the method the manuscript currently suggests - a combination of both the chi-square tests' p values, the CFI, and the RMSEA - has not, to my knowledge, been assessed within any of the well cited simulation studies, and therefore has an unknown sensitivity-specificity tradeoff. (Ian Hussey)

Response: We indeed plan to rely on the delta CFI to assess measurement invariance and decide which model to retain in the iterative procedure that evaluates the model fit of the configural model, metric model, scalar model and residual model for each subscale of the STRAQ-1. We now better define our criteria in the main text of the manuscript: “ Mackinnon et al. (2022) provided several criteria to access model fit for measurement invariance, one of these is the delta CFI (of .01) which is also recommended by a simulation study (Cheung & Rensvold, 2002). We decided to rely only on a ΔCFI of -.01 or more to conclude that the model with the largest CFI should be chosen. This means that if the ΔCFI is inferior or equal to -.01 we will choose the more parsimonious model and conclude for the longitudinal invariance of the specific level (metric, or scalar, or residual).” (see page 14-15). Additionally, we acknowledge in the manuscript that there is a lack of norm in the field: “ Before pre-registration, we made choices about which metrics and cut-offs we would base our conclusion and interpretation of the subscale’s performance. But we acknowledge a lack of clear norms in the field about which metric to choose for our planned analyses. So, in addition to our pre-registered metric and cut-offs, we reported the results of other fit metrics even though we did not plan to use them for inferences and did not preregister any cut-of-value for them. This process will allow other researchers, who would prefer other indicators or cut-offs than ours, to be able to evaluate our models according to their criteria.”(see page 15).

It is not clear why residual invariance will not be tested. For a helpful tutorial, see this: https://doi.org/10.15626/MP.2020.2595. (Jacek Buczny)

Response: We chose to rely on the procedure from the Mackinnon et al. (2022) tutorial, but we did not want to test for residual invariance because (1) residual invariance has been described to be hard to reach for most psychological measurement instruments (Kline, 2016; Van de Schoot et al., 2015), and (2) originally we wanted to make the ICC analysis contingent on the longitudinal invariance. But based on Ian Hussey’s comment regarding the interpretation of the measure even when the measure is not invariant (please see the section on ICC), we changed our strategy and decided not to make the ICC non-contingent on measurement invariance and we will also test the residual invariance of all the subscales of the STRAQ-1. We changed the wording about the longitudinal measurement invariance, included a description on how to test residual invariance throughout the manuscript, and we added this step in the analysis code in the supplementary materials on our OSF page: https://osf.io/mr8n3/. We will now consider the measure invariant even if longitudinal residual invariance does not hold, because even in the absence of residual invariance, scalar invariance holds. We should still be able to conclude that between T1 and T2 the constructs are the same and that the scores are comparables; ICC estimates should then indeed be interpretable.

Why did you plan to use WLS but not ML or MLR? Besides, I recommend including a script for the analysis of multivariate normality as it might be helpful in determining the correct estimator. (Jacek Buczny)

Response: In our measurement models, we decided to use the Diagonally Weighted Least Squares (WLSMV) estimator over the maximum likelihood (ML) or robust maximum likelihood. Our data is ordinal (5-point Likert type scale, see for example, Liddell & Kruschke, 2018) WLSMV is specifically designed for ordinal data. Furthermore, WLSMV makes no distributional assumptions about the multivariate normality of the data. Finally, in a simulation study, WLSMV was less biased and more accurate than MLR in estimating the factor loadings (Flora & Curran, 2004; Kline, 2016; Li, 2016).

We decided to include a test of the multivariate normality of our dataset to inform the reader about whether the data is normal or not, but – a priori – we do not expect it to follow a multivariate normality. In our analyses, this should not present a problem, as multivariate normality is not a requirement for the WLSMV estimator.

We added the following information to the manuscript: “To investigate whether the variables in our dataset followed a multivariate normal distribution, we used the function `mvnorm.etest` from the Energy package. The analysis showed that our data [does/ does not] follow a multivariate distribution (E = XX, p = .XX). A priori, we had already decided to use the WLSMV estimator instead of ML or MLR as arguments in the cfa function in lavaan to compute our CFA model, irrespective of the outcome of the test for multivariate normality. The WLSMV is the preferred solution when (a) the data is ordinal, and (b) if data is potentially not normally distributed, as it makes no distribution assumptions (see Flora and Curran, 2004; Kline, 2016; Li, 2016).”(page 13-14).

I suggest providing a better justification for why obtaining time invariant measurement would be important. Invariance is important in cross-time trait measurement because it ensures that intra-individual changes are not due to the disturbance in construct validity, regardless of when a measurement was taken. Would you expect that STRAQ-1 would capture any changes within the tested period (2021 minus 2018, so, three years)? Is that the period in which the levels of the measure constructs could even change? I do not think that the problem of within-person changes was adequately discussed in the manuscript. (Jacek Buczny)

Response: Thank you for pointing this out, we rewrote the manuscript adding a better justification about why measurement invariance is important. “In longitudinal studies, the meaning of a construct may change over time, resulting in longitudinal measurement non-invariance (Chen, 2008).”(page 13)

“The levels of longitudinal measurement invariance have different implications for the construct: (a) if the configural level holds, then the structure of the measure is similar between T1 and T2 ; (b) if the metric level hold, then the structure of the measure and the constructs are similar between T1 and T2; (c) if the scalar level hold then the structure of the measure and the constructs are similar and the mean differences between T1 and T2 can be compared. Longitudinal scalar invariance is thus the minimal level required for our planned ICC analysis that uses the means scores of T1 and T2 (Kline, 2016; Mackinnon et al., 2022).”(page 13). We also added to the manuscript an elaboration on how non-invariance could impact the ICC estimates (see our reply to Ian Hussey in the following ICC section).

Concerning the topic of within-person changes we will add this to the discussion: “According to Vergara et al. (2019), the STRAQ-1 measures was supposed to be a stable – trait – constructs that are unlikely to change rapidly in adulthood. A recent meta-analysis about personality trait development across the lifespan showed (similarly to previous meta-analysis, see Roberts & DelVecchio, 2000) that – after young adulthood – traits are indeed stable: they found the average rank-order stability to be r = .60, but with a large heterogeneity across studies (Bleidorn, et al., 2022). Nevertheless, life events (for instance, attachment traumas) are known to introduce changes in personality traits and can be linked differently to different traits (Bleidorn, et al., 2018). But because no test-retest of a scale to assess this has been conducted yet, we did not have any strong a priori hypothesis (a) about how life events could induce changes in participant responses to the STRAQ-1, and (b) about the timeframe in which such change in the measured personality traits could occur.”

About the ICCs

The manuscript notes that you will report ICCs. There are multiple variants of ICC, and multiple ways of labelling these variants. Please report which variant you intend on reporting, ideally with some justification of your choice (although the intent of the creators of the variants are not necessarily universal interpretations, making conceptual justification tricky or less objective at times), even if this imply means going with the modal ICC2. Weir (2005) "Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM" provides a useful overview the different methods and labels. As I note above, it may also be useful to report more than one version. Specificity in the code and written prereg might become important here. Please also specify which code implementation you are using for these, e.g., which R library and version. (Ian Hussey)

However, it is perhaps also worth considering that (at least some forms of) ICC are argued to account for differences in means between timespoints. The current manuscript makes a hard binary re their interpretability, but depending on the ICC choice (which I return to below), these estimates may still be meaningful. Indeed, it may be useful to report multiple versions (e.g., ICC2 and ICC3) and consider their utility in light of the tests of invariance of means. (Ian Hussey)

Response: Thank you very much for this interesting suggestion. We will report and interpret ICC(2,1; which is used to evaluate absolute agreement), and ICC(3,1; which is used to evaluate consistency). Both of these ICCs are calculated through 2-way mixed models. The ICC(2,1) accounts for systematic and random error by specifying the time of measurement as a random effect in the model. The ICC(3,1) only accounts for random error because the time of measurement is not specified as a random effect in the model. We will include the code implementation in the text below, where the first three comments of this section “About the ICC” are resolved.

The current manuscript plans to report estimates of ICC only if temporal measurement invariance is found. I understand the logic of this, e.g., that the meaningfulness of the ICCs is undermined if the means are not invariant. However, it is quite likely that the scale will end up being used whether or not the means are invariant, and that readers and users might be better informed by the abstract including ICC estimates either way. I suggest that you include the ICC estimates non contingent on the results of the invariance tests, as their results will likely be relevant to future users of the scale either way. You chould instead specify that the interpretability of the ICCs is partially contingent on the the invariance. (Ian Hussey)

Response: Based on your feedback, we made the ICC analysis (and the report of the estimates) non-contingent on the results of the invariance tests. If metric longitudinal invariance is not reached for a scale(s), we will discuss the possibility that the ICCs estimate for this scale(s) may be unreliable because of a lack of longitudinal invariance.

Altogether, we indeed agree that reporting both the ICC(2,1) as well as the ICC(3,1) in conjunction with our tests of longitudinal invariance will allow us to better evaluate potential systematic differences and/or stability between time points.

To report the ICCs in the main text, we rely on norms set forth by Parsons, Kruijt, and Fox (2019): “We computed and reported ICC(2,1) to evaluate absolute agreement between the two time points and ICC(3,1) to evaluate consistency. Both of these ICCs are calculated through 2-way mixed models. ICC(2,1) accounts for systematic and random error by specifying the time of measurement as a random effect in the model. ICC(3,1) only accounts for random error because the time of measurement is not specified as a random effect in the model (Koo & Li, 2016). We estimated the STRAQ-1 subscales’ test-retest reliability between the 2-time points through intraclass correlation coefficients (ICCs) using the psych package in R (Revelle, 2018). The analysis code is available on the OSF: https://osf.io/mr8n3/.” (page 17). “For the Sensitivity subscale, the estimated agreement was .XX, 95% confidence interval (CI) = [.XX, .XX], and the estimated consistency was .XX, 95% CI = [.XX, .XX]. For the Social Thermoregulation subscale, the estimated agreement was .XX, 95% confidence interval (CI) = [.XX, .XX], and the estimated consistency was .XX, 95% CI = [.XX, .XX]. For the Solitary Thermoregulation subscale, the estimated agreement was .XX, 95% confidence interval (CI) = [.XX, .XX], and the estimated consistency was .XX, 95% CI = [.XX, .XX]. Finally, for the Risk Avoidance subscale, the estimated agreement was .XX, 95% confidence interval (CI) = [.XX, .XX], and the estimated consistency was .XX, 95% CI = [.XX, .XX].” (page 17).

We will interpret the analyses in the following ways:

If longitudinal measurement invariance does not hold (or is underpowered) and if ICC(3,1) shows a high level of consistency between the two time points, we will not claim that the subscale is reliable across time.
If longitudinal measurement invariance does not hold, the mean of T1 and T2 cannot be compared because the meaning of the constructs may have changed between T1 and T2 for participants, but if ICC(2,1) shows a high level of agreement between T1 and T2 we could show that there is a systematic (probably measurement) error between T1 and T2. In the latter case, the measure remains reliable.
If the measure is longitudinally non-invariant ICC(2,1) will allow us to further investigate whether the error is systematic or not depending on the level of agreement between T1 and T2.
In other words, a lack of longitudinal measurement invariance impacts the interpretation of the mean differences between T1 and T2, and thus impacts the interpretation of both ICC(2,1) and ICC(3,1) estimates (Mackinnon et al., 2022). We think that the impact of a lack of longitudinal measurement invariance is more problematic for the interpretation of ICC(3,1).

If measurement invariance does not hold, high reliability with ICC(3,1) could hide the fact that participants interpret the construct differently between the first time they took the survey and the second time they took it (for example, the pattern of loading of the measure would not be the same between T1 and T2, but the mean scores would be the same). We think that ICC(2,1) is less problematic because it can account for a systematic difference between T1 and T2, so if longitudinal measurement invariance holds, ICC2 can show a systematic effect between the two time points (for example, a learning effect or measurement error). In other words, if the measure is longitudinally non-invariant ICC(2,1) will allow us to further investigate whether the error is systematic, or not.

The manuscript uses cut-offs in several places, e.g., "X subscales to provide [excellent/good/moderate/poor] test-retest reliability". Please provide the cut-offs you're using and their citations, as these can be contentious, especially in light of Watson (2004) who noted that test-retest reliability studies “almost invariably conclude that their stability correlations were ‘adequate’ or ‘satisfactory’ regardless of the size of the coefficient or the length of the retest interval” (p. 326). (Ian Hussey)

Response: The cut-off values that we will use for the interpretation of the ICC result are the following : ICC values less than 0.5 will be reported as indicative of poor, values between 0.5 and 0.75 will be reported as indicative of moderate, values between 0.75 and 0.9 will be reported as indicative of good, and values greater than 0.90 will be reported as indicative of excellent reliability. At the same time, we will recognize that these cut-offs are still contentious. In our last draft, we already had the following text: “Koo and Li (2016) defined standards for the ICC with reliability being poor at ICC < 0.5; moderate at 0.5 < ICC > 0.75; good at 0.75 < ICC > 0.9; and excellent at ICC > 0.9.” We now clarified that these will be our cut-off values, by adding: “ “These are the cut-off values that we used for labeling our results. If the 95% confidence interval of an ICC estimate was in between two labels, we used both (for example, if the 95% CI interval would have been [0.83-0.94], the level of reliability would have been regarded as “good’ to “excellent”; see Koo & Li, 2016).”(page 17-18).

Following Parsons et al. (2019), we added a footnote saying that “We recognize that the discussion around cut-offs is contentious and that cut-offs are often arbitrarily chosen, which may make our values equally arbitrary (see e.g., Watson, 2004). The resulting labels (e.g., “good’) are considered as one of many means to assess the validity of a measure (Rodebaugh et al., 2016) and a first step towards defining a normative range of reliability estimates for a scale that will be applied across samples or contexts.”(page 18).

Minor points

Factor structure of the STRAQ-1 and unidimensionality of the subscales

Define methods of testing for unidimensionality. The method section includes templates for the assessment of unidimensionality, but the method and results do not define how this will be discussed. The assessment of unidimensionality, like say convergent and discriminant validity, can be a tricky and multifacetted process that is difficult to infer from a single metric. Perhaps this language needs to be toned down as well as the method explicated? E.g., stating that X metric produced Y result, which is (in)consistent with unidimensionality, while noting that this is just one metric? I'm generally arguing for both more explication of your method here, both to constrain researcher degrees of freedom and for the reader's understanding, and also some slightly more cautious language. (Ian Hussey)

What is the point of running exploratory factor analysis by means of the EMPKC() function? I do not find running this analysis necessary. Besides, N = 184 to small to perform a conclusive EFA. (Jacek Buczny)

It is not clear why each scale is going to be tested separately. By my lights, the instrument should be tested as a whole, so each test should be based on a four-factor model. I understand that due to the small sample size (N = 184), testing four unidimensional models seems the only option, but in my opinion, such a procedure is doubtful, at best. All the subscales were tested in the same survey, so they should be analyzed jointly. (Jacek Buczny)

Response: Based on both reviewers’ comments, we have decided not to report on unidimensionality, as our sample is too underpowered for a comprehensive test. The goal of the paper is not to test the factor structure of the scale as this has already been done by Vergara et al. (2019). The current goal instead is to investigate the scale’s reliability through ICC. Initially, we wanted to test unidimensionality of the subscales in our sample, only using the EMPKC() function (aiming for a 1-factor solution for each subscales) as a first extra step before running our main analysis, but we decided to remove this test and to not report it in the manuscript. We will thus not investigate the convergent and discriminant validity and not run a confirmatory factor analysis on the full STRAQ-1 model, as our test would be underpowered and thus not informative.

Nomological network of the scale

Report of the magnitudes of all correlations. Footnote 4 notes "All the reported correlations are significant, and the interested reader can refer to Vergara et al. (2019) for more details about all the correlations investigated in the original development paper." and table 1 reports some correlations as "n.s.". I think readers would be better informed by reporting all correlations (e.g., the min and max in the footnote, and the estimates in the table), and separately noting their significance or not separately. I'm not positive off the top of my head, but I think this is in line with APA guidelines, and provides useful detail for the reader's understanding. (Ian Hussey)

Response: Thank you for this comment. We now report all the correlations and their significance levels in tables in the Supplementary Materials on our OSF page: https://osf.io/86qdx. We do not report them in the running text since the number of correlations is large and is best presented using several tables.

Citation of R Packages

Please specify and cite all key R packages you use for your analyses. This doesn't have to be exhaustive, and could even be in supplementary materials, but it gives credit where credit is due, and helps with reproducibility.

Response: We agree, for transparency, reproducibility, and for credit to the authors that wrote the packages, we wrote the following section in our main text: “We used the following R packages to conduct the analysis: rio (Chan et al., 2021), janitor (Firke, 2021), tidyverse (Wickham et al., 2019), psych (Revelle, 2022), GPArotation (Coen et al., 2005), EFA.dimensions (O'Connor, 2022), lavaan (Rosseel, 2012), semPlot (Epskamp, 2022), semTools (Jorgensen, 2021), energy (Rizzo & Szekely, 2022), semPower (Moshagen, & Erdfelder, 2016), ICC.Sample.size (Zou, 2012).”. (page 11). We also added the relevant packages in our reference section.

Rewriting suggestions that we accepted

"This Registered Report provides the first measurement invariance across time points and test-retest reliability of the Social Thermoregulation, Risk Avoidance Questionnaire". Perhaps this should be "This Registered Report provides the first test of measurement invariance across time points and estimates of test-retest reliability for the Social Thermoregulation, Risk Avoidance Questionnaire"? (Ian Hussey)

Response: We changed the writing to “This Registered Report provides the first test of measurement invariance across time points and estimates of test-retest reliability for the Social Thermoregulation, Risk Avoidance Questionnaire". Thank you for pointing this out. (page 2).

"However, to date, this instrument has no test-retest reliability." Perhaps this should be "However, to date, this instrument has no estimates of test-retest reliability." (Ian Hussey)

Response: We changed the writing to "However, to date, this instrument has no estimates of test-retest reliability.” Again, thank you for pointing this out. (page 2).

https://doi.org/10.24072/pci.rr.100419.ar1

Decision by Moin Syed, posted 19 Apr 2023, validated 19 Apr 2023

April 18, 2023

Dear Authors,

Thank you for submitting your Stage 1 manuscript, “Test-Retest Reliability of the STRAQ-1: A Registered Report,” to PCI RR.

The reviewers and I were all in agreement that you are pursuing an important project, but that the Stage 1 manuscript would benefit from some revisions. Accordingly, I am asking that you revise and resubmit your Stage 1 proposal for further evaluation.

The reviewers provided thoughtful, detailed comments with align with my own read of the proposal, so I urge you to pay close attention to them as you prepare your revision.

In my view, there is nothing major that needs to be revised in your proposal, but rather there are many small to moderate issues (detailed by the reviewers). A measurement project such as this involves many subjective decisions that must be made on issues for which there is no clear answer. The reviewers provide some suggestions for how to resolve the issues that they identified, but in many cases there may be other defensible options. Thus, a general message that you should take from the reviewer comments is that you need to provide much more detail and better justify your decisions throughout.

One specific and important issue to attend to is that both reviewers had some trouble accessing your analytic code, but in different ways. Please be sure that you are providing the correct links for this Stage 1 review process; having full access to your code will facilitate the review process.

When submitting a revision, please provide a cover letter detailing how you have addressed the reviewers’ points.

Thank you for submitting your work to PCI RR, and I look forward to receiving your revised manuscript.

Moin Syed

PCI RR Recommender

https://doi.org/10.24072/pci.rr.100419.d1

Reviewed by Ian Hussey, 15 Mar 2023

Dear Olivier, Richard, Lindenberg, Caspar, and Hans,

Thank you for the opportunity to review your stage 1 RR. Test-retest reliability is a topic that I think about more and more these days, and estimates of it are unfortuantely underreported in the literature relative to its importance to substantive claims. I'm therefore glad to see you plan to provide estimates of test-retest reliability for this scale.

Please note that while the OSF project containing the code for the analyses etc is linked in the manuscript, it is currently not public. As such, I couldn't review the code.

Larger points

Contingent analysis and reporting of ICCs in the abstract

The current manuscript plans to report estimates of ICC only if temporal measurement invariance is found. I understand the logic of this, e.g., that the meaningfulness of the ICCs is undermined if the means are not invariant. However, it is quite likely that the scale will end up being used whether or not the means are invariant, and that readers and users might be better informed by the abstract including ICC estimates either way. I suggest that you include the ICC estimates non contingent on the results of the invariance tests, as their results will likely be relevant to future users of the scale either way. You chould instead specify that the interpretability of the ICCs is partially contingent on the the invariance.

However, it is perhaps also worth considering that (at least some forms of) ICC are argued to account for differences in means between timespoints. The current manuscript makes a hard binary re their interpretability, but depending on the ICC choice (which I return to below), these estimates may still be meaningful. Indeed, it may be useful to report multiple versions (e.g., ICC2 and ICC3) and consider their utility in light of the tests of invariance of means.

Report of the magnitudes of all correlations

Footnote 4 notes "All the reported correlations are significant, and the interested reader can refer to Vergara et al. (2019) for more details about all the correlations investigated in the original development paper." and table 1 reports some correlations as "n.s.". I think readers would be better informed by reporting all correlations (e.g., the min and max in the footnote, and the estimates in the table), and separately noting their significance or not separately. I'm not positive off the top of my head, but I think this is in line with APA guidelines, and provides useful detail for the reader's understanding.

Clarify choice of ICC variant

The manuscript notes that you will report ICCs. There are multiple variants of ICC, and multiple ways of labelling these variants. Please report which variant you intend on reporting, ideally with some justification of your choice (although the intent of the creators of the variants are not necessarily universal interpretations, making conceptual justification tricky or less objective at times), even if this imply means going with the modal ICC2. Weir (2005) "Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM" provides a useful overview the different methods and labels. As I note above, it may also be useful to report more than one version. Specificity in the code and written prereg might become important here. Please also specify which code implementation you are using for these, e.g., which R library and version.

Consider swapping ICC power analyses from detection of non-zero scores to estimation precision

The current power analyses for the ICCs are based on their ability to detect a non-zero ICC. Many, including - I think - Parsons, Kruijt & Fox (2019) have argued that zero is not a particularly meaningful reference point for reliability estimates. I.e., detectable non-zero reliability does not tell us much about whether reliability is practically adequate, in my opinion. I would argue that a better approach is to estimate the 95% CI width that your sample size will provide as a function of different ICC values (as CI width and estimate are related for ICC). This can be done using the R package ICC.Sample.Size. I have a working example of this here, which I've taken from an in progress test-retest study I'm running. Feel free to use or adapt this yourself: https://gist.github.com/ianhussey/03a3d3940b93d79191a3926a09bcfc2b

Define methods of testing for unidimensionality

The method section includes templates for the assessment of unidimensionality, but the method and results do not define how this will be discussed. The assessment of unidimensionality, like say convergent and discriminant validity, can be a tricky and multifacetted process that is difficult to infer from a single metric. Perhaps this language needs to be toned down as well as the method explicated? E.g., stating that X metric produced Y result, which is (in)consistent with unidimensionality, while noting that this is just one metric? I'm generally arguing for both more explication of your method here, both to constrain researcher degrees of freedom and for the reader's understanding, and also some slightly more cautious language.

Be more precise about your inferential method and its corrispondence with Mackinnon et al.

From page 13, the manuscript states that "we followed the procedure provided by Mackinnon et al. (2022)" and then goes on to specify your inferential method (e.g., which model fit indices, hypothesis tests, and cut-offs). However, most of the specifics that the current manuscript mentions are not what Mackinnon et al specify in their preprint. Indeed, my reading of Mackinnon et al is that they are purposefully non specific about the method they employ for comparisons (e.g., between configural, metric and scalar models). To take one example, Mackinnon et al. discuss the results of Cheung & Rensvold's (2002) highly cited simulation studies, which studied the utility of examining change in CFI between models ($\delta$CFI), but the current manuscript doesn't employ this method. I would also raise the point that the method the manuscript currently suggests - a combination of both the chi-square tests' p values, the CFI, and the RMSEA - has not, to my knowledge, been assessed within any of the well cited simulation studies, and therefore has an unknown sensitivity-specificity tradeoff.

On the one hand, I am extremely sympathetic that you are trying to simulateously (1) create a tight preregistration and yet (2) assess (temporal) measurement invariance, for which there are few commonly agreed upon inference methods with the level of specificity in cut-offs etc required for a prereg. This creates a tricky situation for you. However, on the other hand, I think it's important to be more explicit about your choice of inference method, how well it corresponds with the method specified in Mackinnon et al (which you cite as following but arguably diverge from in some ways that have unknown importance), and indeed how much agreement there is in the field about these methods. I think it would be best - and indeed pragmatic - to include greater acknowledgement of the lack of norms in the field in this regard.

I would also suggest that you follow the recommendations of Putnick & Bornstein (2016) and report the results of other fit metrics even though you don't use them for inferences, especially given that the lack of norms for inference methods for invariance testing mean that some (or even many) readers may look at the same fit metric values and come to different conclusions.

I should qualify the above section and recommendations that I find the literature on these recommendations confusing and frequently miscited, and I myself may have gotten some of the details of what previous studies incorrect here or there. My suggestions here are well intentioned even if I get a detail wrong - My general desire is for some more detail on your methods, their sources, acknowledgement of the lack of agreement in the field, and some bullet proofing / maximising the informativeness of your results with other metrics for future readers' benefit.

Interpretation of the magnitude of test-retest reliablity

The manuscript uses cut-offs in several places, e.g., "X subscales to provide [excellent/good/moderate/poor] test-retest reliability". Please provide the cut-offs you're using and their citations, as these can be contentious, especially in light of Watson (2004) who noted that test-retest reliability studies “almost invariably conclude that their stability correlations were ‘adequate’ or ‘satisfactory’ regardless of the size of the coefficient or the length of the retest interval” (p. 326).

Interpretative guidelines

In light of Watson (2004), quoted in the previous paragraph, could the authors give some thought to preregistering how they might interpret, in their discussion, different results that they might find. Under what circumstances would the test-retest reliabilty, for example, be problematically low, or problematic for what type of analysis in what way? This is very tricky to do and not a hard requirement. However, if tricky to do before seeing the results, this is no less difficult after seeing the results. Perhaps it would be useful to then preregister what won't be claimed contingent on results, if no such contingencies can be specified ahead of time. For example, that no hard claims about the performence of the measure in terms of its reliability will be be made in the discussion. In saying this, I realise that I am raising the bar for this manuscript relative to published work. I have been thinking for a few years about how preregistration interacts with measurement work such as this, and it is certainly not a solved problem. I'm not looking to be difficult here and wouldn't hinder the publication of this work based on this point - I'm more raising the question for you (we, us, collectively) about high quality, preregistered measurement studies like this can think about how its claims are linked to its results ahead of time. Any thoughts or ideas or attempt at this would be interesting and welcome.

Smaller points

Please specify and cite all key R packages you use for your analyses. This doesn't have to be exhaustive, and could even be in supplementary materials, but it gives credit where credit is due, and helps with reproducibility.

In the abstract, you write:

1. "This Registered Report provides the first measurement invariance across time points and test-retest reliability of the Social Thermoregulation, Risk Avoidance Questionnaire". Perhaps this should be "This Registered Report provides the first test of measurement invariance across time points and estimates of test-retest reliability for the Social Thermoregulation, Risk Avoidance Questionnaire"?

2. "However, to date, this instrument has no test-retest reliability." Perhaps this should be "However, to date, this instrument has no estimates of test-retest reliability."

https://doi.org/10.24072/pci.rr.100419.rev11

Reviewed by Jacek Buczny, 18 Apr 2023

Dear Authors,

This is an interesting project. The structure of the manuscript is clear and the argumentation is concise. You may find my recommendations below.

Major Comments

(1) Data loss was substantial. Between 2018 and 2021, more than 1000 participant provided their responses, but only 184 will be included in the analyses. Data loss could be systematic, the sample of 184 may somehow deviate from the subpopulation (students), and thus, the generalizability of the results may be very limited.

(2) The justification for the sample size is not convincing given that there are powerful tools that are very helpful in deciding on the minimum sampling size: https://github.com/moshagen/semPower/tree/master/vignettes, https://sempower.shinyapps.io/sempower/. Because data have been collected, running post hoc power analysis would seem the only sensible option. Such analysis should account for the fact that various types of invariance will be tested. Models' identification will impact the degrees of freedom for each tested model. Sensitivity analysis should provide detailed information regarding power, tested RMSEA values, and the like.

(3) It is not clear why each scale is going to be tested separately. By my lights, the instrument should be tested as a whole, so each test should be based on a four-factor model. I understand that due to the small sample size (N = 184), testing four unidimensional models seems the only option, but in my opinion, such a procedure is doubtful, at best. All the subscales were tested in the same survey, so they should be analyzed jointly.

(4) Under the link https://osf.io/ab73e/, I could not find the code indicated in the manuscript. I found it here: https://osf.io/mr8n3/. The latter script seems incomplete (by the way: why a pdf not Rmd was submitted?). For instance, it is not clear what package(s) will be used in data analysis – lavaan, I presume. What is the point of running exploratory factor analysis by means of the EMPKC() function? I do not find running this analysis necessary. Besides, N = 184 to small to perform a conclusive EFA.

(5) It is not clear why residual invariance will not be tested. For a helpful tutorial, see this: https://doi.org/10.15626/MP.2020.2595.

(6) Why did you plan to use WLS but not ML or MLR? Besides, I recommend including a script for the analysis of multivariate normality as it might be helpful in determining the correct estimator.

Minor Comments

(7) I suggest providing a better justification for why obtaining time invariant measurement would be important. Invariance is important in cross-time trait measurement because it ensures that intra-individual changes are not due to the disturbance in construct validity, regardless of when a measurement was taken. Would you expect that STRAQ-1 would capture any changes within the tested period (2021 minus 2018, so, three years)? Is that the period in which the levels of the measure constructs could even change? I do not think that the problem of within-person changes was adequately discussed in the manuscript.

Kindest regards,
Jacek Buczny
Vrije Universiteit Amsterdam

https://doi.org/10.24072/pci.rr.100419.rev12

User comments

No user comments yet

or Register
Submit a report