Recommendation

Developing differential item functioning (DIF) testing methods for use in forced-choice assessments

Amanda Montoya based on reviews by Timo Gnambs and 2 anonymous reviewers

A recommendation of:

STAGE 1

Detecting DIF in Forced-Choice Assessments: A Simulation Study Examining the Effect of Model Misspecification

Jake Plantz, Anna Brown, Keith Wright, Jessica K. Flake https://osf.io/5d9ph?view_only=cccd50cce05a4dcea4df3a31fe963f2d version 3

Read report on server

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Detecting DIF in Forced-Choice Assessments: A Simulation Study Examining the Effect of Model Misspecification

On a forced-choice (FC) questionnaire, the respondent must rank two or more items instead of indicating how much they agree with each of them. Research demonstrates that this format can reduce response bias. However, the data are ipsative, resulting in item scores that are not comparable across individuals. Advances in Item Response Theory have made scoring FC assessments possible, as well as evaluating their psychometric properties. These methodological developments have spurred increase use of FC assessments in applied educational, industrial, and psychological settings. Yet, a reliable method for testing differential item functioning (DIF), necessary for evaluating test bias, has not been established. In 2021, Lee and colleagues examined a latent-variable modelling approach for detecting DIF in forced-choice data and reported promising results. However, their research was focused on conditions where DIF items were known, which is not likely in practice. To build upon their work, we carried out a simulation study to evaluate the impact of model misspecification, using the Thurstonian-IRT model, on DIF detection, i.e., treating DIF items as non-DIF anchors. We manipulated the following factors: Sample size, whether the groups being tested for DIF had equal or unequal sample size, the number of traits, DIF effect size, the percentage of items with DIF, the analysis approach, the anchor set size, and the percent of DIF blocks in the anchor. Across 336 simulated conditions, we found [Results and discussion summarized here].

psychometrics, DIF, IRT

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

اكتشاف DIF في تقييمات الاختيار القسري: دراسة محاكاة تدرس تأثير سوء مواصفات النموذج

في استبيان الاختيار القسري (FC)، يجب على المستجيب ترتيب عنصرين أو أكثر بدلاً من الإشارة إلى مدى موافقته على كل منها. توضح الأبحاث أن هذا التنسيق يمكن أن يقلل من تحيز الاستجابة. ومع ذلك، فإن البيانات غير قابلة للمقارنة، مما يؤدي إلى درجات غير قابلة للمقارنة بين الأفراد. إن التقدم في نظرية الاستجابة للعنصر جعل من الممكن إجراء تقييمات FC، بالإضافة إلى تقييم خصائصها السيكومترية. وقد حفزت هذه التطورات المنهجية على زيادة استخدام تقييمات FC في البيئات التعليمية والصناعية والنفسية التطبيقية. ومع ذلك، لم يتم إنشاء طريقة موثوقة لاختبار أداء العناصر التفاضلية (DIF)، اللازمة لتقييم تحيز الاختبار. في عام 2021، قام لي وزملاؤه بفحص نهج نمذجة المتغير الكامن للكشف عن DIF في بيانات الاختيار القسري، وأبلغوا عن نتائج واعدة. ومع ذلك، ركزت أبحاثهم على الظروف التي كانت فيها عناصر DIF معروفة، وهو أمر غير مرجح في الممارسة العملية. للبناء على عملهم، أجرينا دراسة محاكاة لتقييم تأثير التحديد الخاطئ للنموذج، باستخدام نموذج Thurstonian-IRT، على اكتشاف DIF، أي معاملة عناصر DIF على أنها نقاط ارتساء غير DIF. لقد تعاملنا مع العوامل التالية: حجم العينة، وما إذا كانت المجموعات التي يتم اختبارها من أجل DIF لها حجم عينة متساوٍ أو غير متساوٍ، وعدد السمات، وحجم تأثير DIF، والنسبة المئوية للعناصر التي تحتوي على DIF، ونهج التحليل، وحجم مجموعة الارتساء، ونسبة العناصر التي تحتوي على DIF. النسبة المئوية لكتل DIF الموجودة في المرساة. عبر 336 حالة محاكاة، وجدنا [تلخيص النتائج والمناقشة هنا].

القياسات النفسية، DIF، IRT

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Detección de DIF en evaluaciones de elección forzada: un estudio de simulación que examina el efecto de la especificación errónea del modelo

En un cuestionario de elección forzada (FC), el encuestado debe clasificar dos o más elementos en lugar de indicar cuánto está de acuerdo con cada uno de ellos. Las investigaciones demuestran que este formato puede reducir el sesgo de respuesta. Sin embargo, los datos son ipsativos, lo que da como resultado puntuaciones de ítems que no son comparables entre individuos. Los avances en la teoría de la respuesta al ítem han hecho posible calificar las evaluaciones de FC, así como también evaluar sus propiedades psicométricas. Estos desarrollos metodológicos han estimulado un mayor uso de evaluaciones de FC en entornos educativos, industriales y psicológicos aplicados. Sin embargo, no se ha establecido un método confiable para probar el funcionamiento diferencial de ítems (DIF), necesario para evaluar el sesgo de la prueba. En 2021, Lee y sus colegas examinaron un enfoque de modelado de variables latentes para detectar DIF en datos de elección forzada e informaron resultados prometedores. Sin embargo, su investigación se centró en condiciones en las que se conocían los elementos DIF, lo que no es probable en la práctica. Para aprovechar su trabajo, llevamos a cabo un estudio de simulación para evaluar el impacto de la especificación errónea del modelo, utilizando el modelo Thurstoniano-IRT, en la detección de DIF, es decir, tratando los elementos DIF como anclajes no DIF. Manipulamos los siguientes factores: tamaño de la muestra, si los grupos evaluados para DIF tenían un tamaño de muestra igual o desigual, el número de rasgos, el tamaño del efecto del DIF, el porcentaje de elementos con DIF, el enfoque de análisis, el tamaño del conjunto de anclaje y el porcentaje de bloques DIF en el ancla. En 336 condiciones simuladas, encontramos [Los resultados y la discusión se resumen aquí].

psicometría, DIF, TRI

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Détection du DIF dans les évaluations à choix forcé : une étude de simulation examinant l'effet d'une mauvaise spécification du modèle

Dans un questionnaire à choix forcé (FC), le répondant doit classer deux éléments ou plus au lieu d'indiquer dans quelle mesure il est d'accord avec chacun d'eux. La recherche démontre que ce format peut réduire les biais de réponse. Cependant, les données sont ipsatives, ce qui entraîne des scores aux éléments qui ne sont pas comparables entre les individus. Les progrès de la théorie de la réponse aux éléments ont rendu possible la notation des évaluations FC, ainsi que l'évaluation de leurs propriétés psychométriques. Ces développements méthodologiques ont stimulé une utilisation accrue des évaluations FC dans les contextes éducatifs appliqués, industriels et psychologiques. Pourtant, une méthode fiable pour tester le fonctionnement différentiel des éléments (DIF), nécessaire à l’évaluation du biais des tests, n’a pas été établie. En 2021, Lee et ses collègues ont examiné une approche de modélisation à variable latente pour détecter le DIF dans les données à choix forcé et ont rapporté des résultats prometteurs. Cependant, leurs recherches se sont concentrées sur des conditions dans lesquelles les éléments du DIF étaient connus, ce qui est peu probable dans la pratique. Pour nous appuyer sur leurs travaux, nous avons réalisé une étude de simulation pour évaluer l'impact d'une mauvaise spécification du modèle, à l'aide du modèle Thurstonian-IRT, sur la détection DIF, c'est-à-dire le traitement des éléments DIF comme des ancres non-DIF. Nous avons manipulé les facteurs suivants : la taille de l'échantillon, si les groupes testés pour le DIF avaient une taille d'échantillon égale ou inégale, le nombre de traits, la taille de l'effet DIF, le pourcentage d'éléments avec DIF, l'approche d'analyse, la taille de l'ensemble d'ancrage et le pourcentage de blocs DIF dans l’ancre. Sur 336 conditions simulées, nous avons trouvé [Résultats et discussion résumés ici].

psychométrie, DIF, IRT

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

फ़ोर्स्ड-चॉइस असेसमेंट में डीआईएफ का पता लगाना: मॉडल मिसस्पेसिफिकेशन के प्रभाव की जांच करने वाला एक सिमुलेशन अध्ययन

जबरन विकल्प (एफसी) प्रश्नावली पर, उत्तरदाता को यह इंगित करने के बजाय दो या दो से अधिक आइटमों को रैंक करना होगा कि वे उनमें से प्रत्येक से कितना सहमत हैं। अनुसंधान दर्शाता है कि यह प्रारूप प्रतिक्रिया पूर्वाग्रह को कम कर सकता है। हालाँकि, डेटा ipsative है, जिसके परिणामस्वरूप आइटम स्कोर अलग-अलग व्यक्तियों में तुलनीय नहीं हैं। आइटम रिस्पांस थ्योरी में प्रगति ने स्कोरिंग एफसी मूल्यांकन को संभव बना दिया है, साथ ही उनके साइकोमेट्रिक गुणों का मूल्यांकन भी किया है। इन पद्धतिगत विकासों ने व्यावहारिक शैक्षिक, औद्योगिक और मनोवैज्ञानिक सेटिंग्स में एफसी मूल्यांकन के उपयोग में वृद्धि को प्रेरित किया है। फिर भी, परीक्षण पूर्वाग्रह के मूल्यांकन के लिए आवश्यक डिफरेंशियल आइटम फंक्शनिंग (डीआईएफ) के परीक्षण के लिए एक विश्वसनीय विधि स्थापित नहीं की गई है। 2021 में, ली और उनके सहयोगियों ने फ़ोर्स्ड-चॉइस डेटा में डीआईएफ का पता लगाने के लिए एक गुप्त-चर मॉडलिंग दृष्टिकोण की जांच की और आशाजनक परिणामों की सूचना दी। हालाँकि, उनका शोध उन स्थितियों पर केंद्रित था जहां डीआईएफ आइटम ज्ञात थे, जो व्यवहार में संभव नहीं है। उनके काम को आगे बढ़ाने के लिए, हमने डीआईएफ का पता लगाने पर, यानी डीआईएफ आइटमों को गैर-डीआईएफ एंकर के रूप में मानते हुए, थर्स्टनियन-आईआरटी मॉडल का उपयोग करके मॉडल मिसस्पेसिफिकेशन के प्रभाव का मूल्यांकन करने के लिए एक सिमुलेशन अध्ययन किया। हमने निम्नलिखित कारकों में हेरफेर किया: नमूना आकार, क्या डीआईएफ के लिए परीक्षण किए जा रहे समूहों का नमूना आकार समान या असमान था, लक्षणों की संख्या, डीआईएफ प्रभाव आकार, डीआईएफ वाले आइटम का प्रतिशत, विश्लेषण दृष्टिकोण, एंकर सेट आकार, और एंकर में डीआईएफ ब्लॉक का प्रतिशत। 336 अनुरूपित स्थितियों में, हमने पाया [परिणाम और चर्चा यहां संक्षेप में प्रस्तुत की गई है]।

साइकोमेट्रिक्स, डीआईएफ, आईआरटी

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

強制選択評価における DIF の検出: モデルの仕様ミスの影響を検証するシミュレーション研究

強制選択 (FC) アンケートでは、回答者は各項目にどの程度同意するかを示すのではなく、2 つ以上の項目をランク付けする必要があります。研究により、この形式により応答の偏りを軽減できることが実証されています。ただし、データは推測的なものであるため、項目スコアは個人間で比較できないものになります。アイテム反応理論の進歩により、FC 評価のスコアリングが可能になり、心理測定特性の評価も可能になりました。これらの方法論の発展により、応用教育、産業、心理学の現場で FC 評価の使用が増加しています。しかし、テストバイアスの評価に必要な差分項目機能（DIF）をテストするための信頼できる方法はまだ確立されていません。 2021 年に、Lee らは強制選択データ内の DIF を検出するための潜在変数モデリングアプローチを検討し、有望な結果を報告しました。ただし、彼らの研究は、DIF 項目が既知である条件に焦点を当てていましたが、実際にはそうではありません。彼らの研究に基づいて、サーストニアン-IRT モデルを使用して、DIF 検出に対するモデルの誤った仕様の影響、つまり DIF アイテムを非 DIF アンカーとして扱うことの影響を評価するシミュレーション研究を実行しました。次の要素を操作しました: サンプルサイズ、DIF についてテストされるグループのサンプルサイズが等しいか等しくないか、形質の数、DIF 効果サイズ、DIF を持つ項目の割合、分析アプローチ、アンカーセットサイズ、およびアンカー内の DIF ブロックの割合。 336 のシミュレーション条件にわたって、[結果と議論はここにまとめられています] が見つかりました。

心理測定、DIF、IRT

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Detectando DIF em avaliações de escolha forçada: um estudo de simulação que examina o efeito da especificação incorreta do modelo

Em um questionário de escolha forçada (FC), o respondente deve classificar dois ou mais itens em vez de indicar o quanto concorda com cada um deles. A pesquisa demonstra que este formato pode reduzir o viés de resposta. No entanto, os dados são ipsativos, resultando em pontuações de itens que não são comparáveis entre indivíduos. Os avanços na Teoria de Resposta ao Item tornaram possível a pontuação das avaliações da FC, bem como a avaliação de suas propriedades psicométricas. Esses desenvolvimentos metodológicos estimularam o aumento do uso de avaliações de FC em ambientes educacionais, industriais e psicológicos aplicados. No entanto, um método confiável para testar o funcionamento diferencial dos itens (DIF), necessário para avaliar o viés do teste, não foi estabelecido. Em 2021, Lee e colegas examinaram uma abordagem de modelagem de variável latente para detectar DIF em dados de escolha forçada e relataram resultados promissores. Contudo, a sua investigação centrou-se em condições em que os itens do DIF eram conhecidos, o que não é provável na prática. Para desenvolver seu trabalho, realizamos um estudo de simulação para avaliar o impacto da especificação incorreta do modelo, usando o modelo Thurstonian-IRT, na detecção de DIF, ou seja, tratando itens de DIF como âncoras não-DIF. Manipulamos os seguintes fatores: tamanho da amostra, se os grupos testados para DIF tinham tamanho de amostra igual ou desigual, o número de características, tamanho do efeito do DIF, a porcentagem de itens com DIF, a abordagem de análise, o tamanho do conjunto de âncoras e o porcentagem de blocos DIF na âncora. Em 336 condições simuladas, encontramos [Resultados e discussão resumidos aqui].

psicometria, DIF, IRT

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Обнаружение DIF в оценках с принудительным выбором: симуляционное исследование, изучающее влияние неправильной спецификации модели

В анкете принудительного выбора (FC) респондент должен ранжировать два или более пунктов вместо того, чтобы указывать, насколько он согласен с каждым из них. Исследования показывают, что этот формат может уменьшить предвзятость ответов. Однако данные являются ипсативными, в результате чего баллы по пунктам несопоставимы между отдельными людьми. Достижения в теории ответов на вопросы сделали возможным выставление оценок FC, а также оценку их психометрических свойств. Эти методологические разработки стимулировали более широкое использование оценок FC в прикладных образовательных, производственных и психологических условиях. Тем не менее, надежный метод тестирования дифференциального функционирования заданий (DIF), необходимый для оценки систематической ошибки теста, не разработан. В 2021 году Ли и его коллеги изучили подход к моделированию скрытых переменных для обнаружения DIF в данных принудительного выбора и сообщили об многообещающих результатах. Однако их исследование было сосредоточено на условиях, когда были известны элементы DIF, что на практике маловероятно. Основываясь на их работе, мы провели моделирование, чтобы оценить влияние неправильной спецификации модели, используя модель Терстона-IRT, на обнаружение DIF, т. е. рассматривая элементы DIF как привязки, не относящиеся к DIF. Мы манипулировали следующими факторами: размер выборки, независимо от того, имели ли группы, проверяемые на DIF, одинаковый или неравный размер выборки, количество признаков, размер эффекта DIF, процент элементов с DIF, метод анализа, размер набора привязок и процент блоков DIF в привязке. Мы обнаружили 336 смоделированных условий [результаты и обсуждения суммированы здесь].

психометрия, ДИФ, ИРТ

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

在强制选择评估中检测 DIF：一项检验模型错误指定影响的模拟研究

在强制选择 (FC) 问卷中，受访者必须对两个或多个项目进行排名，而不是表明他们对每个项目的同意程度。研究表明，这种格式可以减少响应偏差。然而，数据是自比的，导致项目得分在个体之间不具有可比性。项目反应理论的进步使得对 FC 评估进行评分以及评估其心理测量特性成为可能。这些方法的发展刺激了 FC 评估在应用教育、工业和心理环境中的使用。然而，尚未建立评估测试偏倚所需的测试差异项目功能（DIF）的可靠方法。 2021 年，Lee 和同事研究了一种用于检测强制选择数据中 DIF 的潜变量建模方法，并报告了有希望的结果。然而，他们的研究重点是 DIF 项目已知的情况，而这在实践中是不可能的。为了以他们的工作为基础，我们进行了一项模拟研究，使用 Thurstonian-IRT 模型评估模型错误指定对 DIF 检测的影响，即将 DIF 项目视为非 DIF 锚点。我们操纵了以下因素：样本大小、接受 DIF 测试的群体是否具有相等或不相等的样本大小、性状数量、DIF 效应大小、具有 DIF 的项目百分比、分析方法、锚定集大小以及锚点中 DIF 块的百分比。在 336 种模拟条件下，我们发现了[此处总结的结果和讨论]。

心理测量学、DIF、IRT

Submission: posted 06 September 2023
Recommendation: posted 14 February 2024, validated 14 February 2024

Cite this recommendation as:
Montoya, A. (2024) Developing differential item functioning (DIF) testing methods for use in forced-choice assessments. Peer Community in Registered Reports, . https://rr.peercommunityin.org/PCIRegisteredReports/articles/rec?id=554

Recommendation

Traditional Likert-type items are commonly used but can elicit response bias. An alternative approach, the forced-choice question, required respondents to rank order all items. Forced-choice questions boast some advantages but required advanced item response theory analysis to generate scores which are comparable across individuals and to evaluate the properties of those scales. However, there has been limited discussion of how to test differential item functioning (DIF) in these scales. In a previous study, Lee et al. (2021) proposed a method for testing DIF.

Here, Plantz et al. (2024) explore the implications of incorrect specification of anchors in DIF detection for forced choice items. The study proposes to use a Monte Carlo simulation which manipulates sample size, equality of sample size across groups, effect size, percentage of differentially functioning items, analysis approach, anchor set size, and percent of DIF blocks in the anchor set. This study aims to answer research questions about the type I error and power of DIF detection strategies under a variety of circumstances, both evaluating whether the results from Lee et al. (2021) generalize to misspecified models and expanding to evaluate new research questions. Results of this study will provide practical implications for DIF testing with forced-choice questions. An important limitation of the study is that it does not explore non-uniform DIF, only uniform DIF. Additionally, as with all simulation studies not all results can only apply to conditions which are simulated and so rely on the realistic selection of simulation conditions. The authors have selected conditions to match reality in circumstances where data is available, but relied on previous simulations in cases when data is not available.

This Stage 1 manuscript was evaluated over two rounds of review by two reviewers with expertise in psychometrics. An additional round of review was completed by the recommender only. Based on the merits of the original submission and responsiveness of the authors to requests from the reviewers, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).

URL to the preregistered Stage 1 protocol: https://osf.io/p8awx

Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.

List of eligible PCI RR-friendly journals:

References

1. Lee, P., Joo, S.-H. & Stark, S. (2021). Detecting DIF in multidimensional forced choice measures using the Thurstonian Item Response Theory Model. Organizational Research Methods, 24, 739–771. https://doi.org/10.1177/1094428120959822

2. Plantz, J. W., Brown, A., Wright, K. & Flake, J. K. (2024). Detecting DIF in Forced-Choice Assessments: A Simulation Study Examining the Effect of Model Misspecification. In principle acceptance of Version 3 by Peer Community in Registered Reports. https://osf.io/p8awx

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviews

Evaluation round #2

DOI or URL of the report: https://osf.io/h6qtp?view_only=cccd50cce05a4dcea4df3a31fe963f2d

Version of the report: 2

Author's Reply, 29 Jan 2024

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.rr.100554.ar2

Decision by Amanda Montoya, posted 29 Jan 2024, validated 29 Jan 2024

Thank you for completing your recent revision. Both reviewers indicated that your revisions addressed their concerns very well. There are a few remaining, small comments from Reviewer 1. I believe these can be addressed in a minor revision, and I do not expect to send this out again for peer review.

https://doi.org/10.24072/pci.rr.100554.d2

Reviewed by Timo Gnambs, 22 Dec 2023

The manuscript describes a planned simulation study investigating uniform differential item functioning (DIF) in Thurstonian item response models (TIRT). I have previously commented on the manuscript and am pleased to read the revised version. I would like to thank the authors for addressing my suggestions. Therefore, I have only a few remaining comments, mainly related to some minor clarifications.

1) In Figure 2, all latent trait factor means and variances (eta) have identification constraints. Therefore, the comment in the top box may be somewhat misleading because it refers to “one” mean and variance, although all means and variances are constrained. I was wondering whether the latent means in the first-order TIRT (Figure 3) also require zero constraints on the latent means of the latent traits.

2) On Page 13, it should correctly read mu_t1, mu_t2, and mu_t3 and not t1, t2, and t3.

3) It might be helpful to readers if the introduction (e.g., in the section “Present Study”) clarified that the focus of the simulation study is on uniform DIF and that non-uniform DIF is not addressed.

4) I did not understand how the authors derived the percentages for RQ4 on Page 20. If 10% of the items in the 5-trait condition with 20 blocks and 60 items exhibit DIF, then 60 / 10 = 6 DIF items will be simulated. Since each block contains only 1 DIF item (see Table 1), there are 6 DIF blocks. This is therefore 6 / 20 = 30% and not 40% as stated in the manuscript.

5) In my opinion, simulation studies need to justify their choice of simulation conditions in order to emphasize for which real world scenarios they might be representative of. Therefore, I would hope that the authors could provide some information on the applied settings in which the chosen values of, for example, sample sizes and DIF effects are typical. Simply stating that previous simulations have used these values is not very convincing (Page 20).

https://doi.org/10.24072/pci.rr.100554.rev21

Reviewed by anonymous reviewer 2, 08 Jan 2024

I have carefully examined the authors' responses to editor/reviewer comments as well as the revised manuscript. Overall, I appreciate the authors' attentiveness and responsiveness, and believe the authors have sufficiently addressed the comments and suggestions raised in the first round of reviews.

In my opinion, the revised manuscript is significantly stronger, and I look forward to the findings of the study. Therefore, I would like to express my support for the acceptance of the revised manuscript.

https://doi.org/10.24072/pci.rr.100554.rev22

Evaluation round #1

DOI or URL of the report: https://osf.io/rwqba?view_only=cccd50cce05a4dcea4df3a31fe963f2d

Version of the report: 1

Author's Reply, 10 Nov 2023

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.rr.100554.ar1

Decision by Amanda Montoya, posted 19 Oct 2023, validated 20 Oct 2023

Thank you for your submission to PCI-RR! The submitted manuscript shows promise but needs additional revision prior to further consideration. We received comments from two very engaged reviewers, and I think their comments clearly identify some steps forward that should be considered. I want to emphasize a few of their comments and provide a few of my own which I do below.

Commentary on Reviewer Comments

I agree with Reviewer 1 that incorporating non-uniform DIF into the simulation would strengthen the contribution of the study. I understand this could be a large undertaking within the study, so it may not be feasible to extend the study in this dimension. Instead perhaps it would be worthwhile to comment on whether the researchers hypothesize the results to be similar or dissimilar for uniform vs. non-uniform DIF (i.e., whether you think it's safe to generalize the results to non-uniform DIF, or whether further studies may be needed to explore this phenomenon in non-uniform DIF contexts [or mixed contexts]).

Similarly, I found the Reviewer 1's comment on sample size (equal vs. unequal) to be quite interesting. It is common to test DIF across demographic characteristics which are often unbalanced. This may have some effect on the results of the study and should potentially be incorporated or discussed. Note that Reviewer 2 suggested testing multiple sample sizes or to cite literature which suggests how sample size might impact the outcome.

I agree with Reviewer 2 that the term ipsative could be more clearly defined.

Editor Comments:

A couple things to consider because this is a registered report: Review carefully the introduction and methods as these sections will not be able to be changed after Stage 1 IPA (other than light editing). Many parts of the manuscript are written in future tense, but other parts are in past tense. I recommend revising everything to past tense because you’ll need to make these edits for Stage 2 anyway. For your simulation you should consider some positive checks that you might conduct prior to doing your analysis, that would help evaluate that the simulation has been conducted correctly. These could be any evaluation of factors that might suggest failures of the data generation or data analysis process. In simulations this might include things like, reporting rates of convergence, checking parameter bias for conditions where bias should be zero, etc.

Appendix 1. Design Table.

Please add an analysis plan column to the study design table describing how the research team will determine whether each hypothesis is supported. This could involve some test statistic or effect size, a visualization with a specific pattern, a table with a specific pattern. However, it should be clearly unambiguous how to interpret the result of the test, and not based on the judgement of the research.

In general, the interpretations column does not sufficiently account for all possible outcomes of the research results. Especially when there are multiple outcome variables and results could be mixed across these outcomes.

RQ1 should be broken into two hypotheses, as there are two outcomes. The interpretation should acknowledge the potential for contradictory results across outcomes and how such a result would be interpreted. For example, what if TIE is lower and power is also lower?

RQ2 it’s unclear how Type I Error rates will be compared. The threshold is set to .01, but is this of the estimates from each condition, or based on some kind of uncertainty estimate. In addition, it’s not clear if the second part of the interpretation is specific only to Type I Error or also power. These could be letters (a, b, or ab) to indicate which claims would be made based on the outcomes of which tests.

RQ4 This section I found very confusing. It seems that in this case a Type I Error would be when DIF is not present, but it is detected. But the interpretation suggests that if Type I Error is constant or decreases this suggests that DIF can be correctly identified, which sounds more like power than Type I error. Similar issues arise with the other interpretations, so perhaps clearly defining what is meant by a type I error in this case would be helpful.

RQ6. Please clarify what is an acceptable type I error rate and power. I found the second section of the interpretation difficult to understand.

Introduction

Equation 1 is not fully defined because the value of y is not defined if y* < 0

In Equation 2/3 is there an assumption that e_j and e_k are independent, or can they be dependent?

In general I found it difficult to follow the language of "first order" and "second order" models. This likely stems from my lack of background in this area, but I think it could be more clearly explained for interested yet less embedded audience members.

There is some discussion of the modeling distinction of DIF for an item vs. a pairwise comparison, but it was not really clear what the practical implications of this would be. Perhaps providing an example of when one vs. the other would be expected?

I found the Model Identification section to be quite difficult to read. Perhaps incorporating the notation which is introduced in the prior pages to clarify what is meant in this section would be helpful. A reviewer also suggested incorporating some of this information in a figure, which is an idea I support.

While there are obviously a lot of abbreviations used throughout the paper, I think they could be avoided in section headers to make the contents more clear to someone who might be skimming the paper.

While DIF testing by groups is somewhat of a norm, DIF can be defined and tested by continuous variables. For example, using the MNLFA approach by Bauer. It is a limitation of the study that the exploration of DIF is only by groups, and this should be acknowledged. The definition of DIF and discussion of DIF seems to imply that it only applies to groups, which is not quite right. So differentiating the kind of DIF testing this paper is doing vs. what a broader definition of DIF is.

I found the paragraph at the bottom of page 8 quite difficult to understand. I was confused because I thought the utility means had been fixed as part of the identification section, but maybe I’m not understanding which means are fixed. Similarly, the explanation of nonuniform DIF sounds backwards. Perhaps part of this is that I’m unfamiliar with using the term preferability. If preferability is the same as discrimination, would not differences on preferability correspond to uniform (rather that nonuniform DIF)? I do really struggle with the term preferability, because it seems reverse to difficulty. For example, options that many people endorse I would describe as highly preferable, but that would be low in difficulty. Again, this might be my out-of-field experience showing, but I didn’t find the language intuitive, and a cursory search did not turn up any papers using this term.

In general, I found it difficult to determine whether the IRT Models for DIF section was describing the performance of the models based on prior research, or rather whether the claims were speculative. There are very few citations in this section, suggesting perhaps there is little research in this area. But the claims seem somewhat strong. It should be more clear if these claims are merely based on the researcher’s hypotheses, or whether they are founded in prior research. Additionally, as a smaller note I found myself wondering if there are any advantages of a constrained-baseline approach. Only disadvantages were mentioned.

I also found the end of the first paragraph of page 11 to be a bit confusing. I don’t seem to understand why there is a test for the difference between the means of t1 and t2 in each group, as opposed to a test of the difference between the t1 means across groups and the difference between the t2 means across groups. This is what I was expecting but what not what was described. Perhaps this reflects a larger issue with the explanation of the setup of the model or perhaps could require some additional explanation. In the same section, I found myself feeling unclear about how exactly using the blocks accounts for multidimensionality, so perhaps this could be explained more.

Within the area of psychology that I work in, forced choice and rank order questions are not terribly popular. While I think this paper is still valuable I felt there could be more of an accounting of where these types of items are commonly used, or perhaps even more a pitch for using these types of items in contexts where they are not currently used. Perhaps prior research on these response types has identified important pockets of research communities that specifically use this kind of item and that could be described a little more in-depth.

I’m curious if the authors have any kind of hypothesis about why the free-baseline approach performed so poorly with large sample sizes in Lee et al (2021). This seems particularly relevant to the current study, as the poor performance was at the sample size selected for the current study. Was that done on purpose? Is there a rationale for why such a method would get worse with large sample sizes?

The Present Study

In general, I found the research questions to be a little difficult to understand. The wording of RQ2 is a bit awkward, I think because the question connection between the beginning and the end is lost given the length of the middle clause which summarized the result from Lee et al. 2021). In RQ3 it’s not clear what the question is asking, “more accurate” than what? RQ 5 and 6 are specific to the free-baseline approach, though it’s not clear why (as compared to the other methods). It seems like it’s likely a foregone conclusion that getting anchors wrong will harm the accuracy of the statistical models. Which I think limits how interesting RQ6 is. It seems like the underlying question, when reading the paper, might be more about whether certain methods are more robust to misspecification of anchor items than others. But that is not really how the question is framed. The latter RQs could be more specific that they are looking only at the second-order models and not the first order models.

Methods

There are still some elements of the methods which do not seem particularly clear. Specifically it’s not clear what is meant by saying that “trait correlations will be mixed across all conditions.” I initially thought this meant that trait correlations were in some way permuted (mixed) but I think what is meant is that the sign of the correlation is sometimes positive sometimes negative. The term mixed is used for so many things, I’m not sure it’s an apt descriptor on its own. Similarly, I found the information about the correlation matrix somewhat contradictory, where in one place it says that it will be matched exactly, and in other places it says it will be randomly generated.

It’s unclear from the description of the study design whether the analysis model is a within-factor (i.e., each method is used on each dataset that is analyzed) as compared to a between-factor (data generated is unique to each analysis model). My understanding that the within- method is frequently used to reduce computation time, but either would be acceptable. It should just be clear in the methods.

I do not understand Step 4 of the data generation. Measurement errors should be unique to each person, but the equation given is fixed based on lamba (a parameter that does not vary by person), so it’s not clear if this is meant to reflect the variance of the measurement errors. But as written I’m not sure this describes a process that would generate measurement errors.

Analysis Plan

Using \beta as power us somewhat unintuitive as \beta typically denotes type II errors rates, and so power is typically 1-\beta.

https://doi.org/10.24072/pci.rr.100554.d1

Reviewed by anonymous reviewer 1, 17 Oct 2023

The manuscript describes a planned simulation study investigating uniform differential item functioning (DIF) in Thurstonian item response models (T-IRT). Designed as a replication and extension of Lee et al. (2021), the authors describe several simulation conditions to evaluate the effects of, among others, anchor set size and model misspecification on DIF detection. The manuscript is well written and addresses an important topic for psychological assessments with forced-choice items. Also, the planned simulation study seems reasonable to me and likely will allow to answer the research questions. Therefore, I have a few comments that might help the authors improve their work.

1) The abstract would be informative if it emphasized that DIF was evaluated for Thurstonian IRT models that overcome the limitations of traditional scoring schemes which result in ipsative scores.

2) The figure presenting the second-order T-IRT is Figure 2 (and not Figure 1). Moreover, I believe the observed scores should be denoted by “y” and not “y*” to correspond to Equation 1. The figure would be more informative if the mean structure would be included as well. It could also be helpful to include the identification constraints described in the text, for example, the fact that 1 uniqueness and 1 utility mean per block are fixed to 1 and 0, respectively.

3) Because the authors plan to use a multiple-constraint Wald test to identify DIF, it might be informative to formally describe the respective test. I was also wondering whether the authors plan to correct the p-values for multiple testing.

4) Although the authors describe the necessary identification constraints for the single-group T-IRT, it might be informative to also describe the respective constraints in the multi-group context. I assume in the multi-group models some latent factor means and variances will be freely estimated to acknowledge between-group differences.

5) The authors plan to use Mplus for their simulation. I was wondering whether the model might also be estimated using an R package (e.g., lavaan). I am not suggesting that this would be preferable in the present situation. I was just curious whether Mplus is currently the only software to estimate these types of models.

6) The authors should adopt a consistent terminology. Currently, “Thurstonian-IRT”, “T-IRT”, and “TIRT” are used interchangeably.

7) A serious limitation of the planned study is the exclusive focus on uniform DIF. The simulation could be substantially strengthened if non-uniform DIF was evaluated as well (similar to Lee et al., 2021).

8) The authors plan to simulate two different sizes of DIF that were chosen based on previous simulation studies (see Page 17). I was wondering whether the chosen values are also representative of applied settings. Are these effect sizes common, for example, in educational assessments or other contexts?

9) The description of the simulation design does not inform about the direction of DIF. Thus, will DIF always favor one specific group or will DIF for different items favor different groups?

10) The simulation design does not specify the sample sizes of the two groups. Are the authors planning to simulate equal sample sizes in the two groups?

11) On Page 17, the authors describe how they plan to simulate DIF in the blocks. Because each block will only include 1 item with DIF (see Table 1), it might be helpful if the description emphasized that the authors plan to simulate 10% to 20% of blocks with DIF. Currently, the description focuses on items with DIF which is certainly correct but probably not what the authors want to emphasize. Moreover, the simulation conditions are inconsistently described. On Page 17, the authors report that they plan to simulate 10%, 15%, and 20% of items with DIF. In contrast, Table 2 gives values of 40%, 50%, and 60%.

12) It was unclear to me how the percentage of DIF blocks (RQ4) and misspecification (RQ6) will be implemented in the simulation. If for example, 10% of DIF blocks are simulated (RQ4) will part of these DIF blocks be included in the anchor block set (RQ6) or will additional DIF blocks be simulated specifically for the anchor block set?

https://doi.org/10.24072/pci.rr.100554.rev11

Reviewed by anonymous reviewer 2, 14 Oct 2023

The current manuscript involves a simulation study that evaluates the impact of model misspecification on DIF detection. The goal of the simulation is to extend the work of Lee et al. (2021) by evaluating the performance of the second-order T-IRT model in detecting DIF under more realistic conditions (e.g., when anchors are unknown).

I thought the authors did a nice job of laying out the motivation for the paper, providing enough background, and highlighting the current gap in the literature in a compelling way. My primary concern is with where the focus is placed in regard to the interpretation of simulation results. In my opinion, the current interpretations are a bit black-and-white, and I think shifting the focus to more nuanced aspects of the results would strengthen the authors’ arguments as well as offer more prescriptive guidance to applied researchers. Below, I have included more detailed comments for refocusing interpretations along with more minor comments. The comments below are ordered based on my sense of their importance.

1. In my opinion, RQ6 (p. 31) is the most important and most interesting research question. However, Hypothesis B and the first interpretation are hard to follow (for me, anyway). First, Hypothesis B is referring to the size of the anchor set, which seems to be more in line with RQ5 (p.30). Second, for the first interpretation (if there is support for both hypotheses), the second sentence (“However, even in case of complete misspecification…”) seems to contradict the first sentence (“If we find support for both…”).

Given the importance of RQ6, I think it would be worth giving more consideration to the potential nuances that may emerge from the simulation results and dedicate an appropriate amount of space in the discussion to the interpretation of the results regarding this research question. Currently, the interpretations seem a bit limited. There is one for if both hypotheses are supported, one for if Hypothesis A is supported, and one for if Hypothesis B is supported. For this 72-cell design, there may be some interaction effects worth mentioning.

In regard to one of the potential interpretations for RQ6, the authors state that reducing model misspecification is optimal. I find this interpretation, in and of itself, to be quite uninteresting and one that most methodologists would already agree with. In line with my comments above, I think it would be more compelling to shift the focus of the interpretation to expected interactions or highlighting which specific conditions researchers should be particularly vigilant about when testing for DIF.

2. In the authors’ proposed interpretations for RQ1 (p. 29), they state that if a smaller number of traits results in improved DIF detection, this would support limiting the number of traits measured by an FC assessment. I would avoid making recommendations of this nature that encourage researchers to add or remove traits from an assessment with the goal of improving DIF detection. I think most psychometricians would agree that adding unnecessary traits or removing necessary traits would negatively impact the quality of a measure. For example, it is known that increasing the length of a test will generally improve reliability (i.e., internal consistency). However, it is not recommended to add items to a test with the goal of (artificially) improving reliability. To avoid spuriously inflating reliability, it is good practice to consult content experts or reference substantive theory, for example. I think the same logic applies to the relationship of trait size and the accuracy of DIF detection. Regardless of the simulation results, “assessments [should] be designed more intentionally only to measure what is needed rather than adding additional factors to examine DIF accurately.” This is something the authors state as a potential interpretation on p. 29, but again, I think this should always be the case. If the authors do happen to find an effect of trait size, I suggest presenting it as a consideration researchers should make when evaluating DIF as opposed to it being a motivation for altering trait size.

3. Hypothesis A and the first interpretation for RQ5 (p. 30) seem to be something that most psychometricians would already agree upon. In cases where the anchor is known to be pure (simulation studies), I think it is safe to say that a larger anchor set will always be better. Given that this set of items is used to estimate the means and SD differences between potential DIF groups, having more DIF-free items should result in better DIF detection. This has been shown to be the case for more traditional IRT models even when an anchor is not pure (e.g., Kopf et al., 2015). Given this, and similar to my comments for RQ6, I think the interpretation for the results of the research question should be refocused (e.g., potential interactions, conditions that are particularly impactful for DIF detection).

4. In the first interpretation for RQ2 (p. 29), the authors state that improved DIF detection when the DIF effect size is larger would support the use of the latent scoring approach. Although I agree that this provides further evidence for the viability of the approach, I think the study would benefit from including a comparison with at least one other model/approach. As the authors mention in the second interpretation for RQ2, findings in the opposite direction would suggest the need for a different method. Incorporating this in the current study and having results that show how the method the authors currently propose performs in relation to another DIF method would provide a useful frame of reference.

Minor comments:

5. Sample size is constant in the simulation. Although the authors provide a reasonable justification for this, I think readers will be left to wonder how much sample size could have affected DIF detection. For example, what if doubling the sample size to 2000 counteracts including DIF items in the anchor? If the authors choose not to add another sample size condition to the simulation, I would suggest providing evidence from previous work on the effect of sample size on DIF detection and/or mentioning sample size being kept constant in the limitations.

6. Although Figure 1 and the context that surrounds the use of the term “ipsative” are helpful, it might be worth providing a more explicit definition of “ipsative.” This would be especially helpful if the target audience is not likely to be familiar with FC assessments and how they are analyzed.

7. I think the path diagram labeled “Figure 1” is supposed to be labeled “Figure 2.” In addition to the figure title, there is at least one reference to this path diagram in the introduction that will need to be revised accordingly.

8. The path diagram labeled “Figure 1,” which may actually be “Figure 2,” has no caption. Including a detailed caption that allows the figure to stand on its own may be very helpful for readers, especially because it will allow readers to understand the figure without having to go back to the text (some readers may not understand the path diagram without some context).

9. On p. 5, third line from the bottom, the authors refer to “equations 2/3” in a comparison between the unit of analysis for the first-order model and the second-order model. Are Equations 2 and 3 the equations the authors meant to reference?

References

Kopf, J., Zeileis, A., & Strobl, C. (2015). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39(2), 83-103. https://doi.org/10.1177/0146621614544195

https://doi.org/10.24072/pci.rr.100554.rev12

User comments

No user comments yet

or Register
Submit a report