Close printable page

Recommendation

Bug detection in software engineering: which incentives work best?

Chris Chambers based on reviews by Edson OliveiraJr and 1 anonymous reviewer

A recommendation of:

STAGE 1

Registered Report: A Laboratory Experiment on Using Different Financial-Incentivization Schemes in Software-Engineering Experimentation

Jacob Krüger, Gül Çalıklı, Dmitri Bershadskyy, Robert Heyer, Sarah Zabel, Siegmar Otto https://doi.org/10.48550/arXiv.2202.10985 version v3

Read report on server

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Registered Report: A Laboratory Experiment on Using Different Financial-Incentivization Schemes in Software-Engineering Experimentation

Empirical studies in software engineering are often conducted with open-source developers or in industrial collaborations. Seemingly, this resulted in few experiments using financial incentives (e.g., money, vouchers) as a strategy to motivate the participants’ behavior; which is typically done in other research communities, such as economics or psychology. Even the current version of the SIGSOFT Empirical Standards does mention payouts for completing surveys only, but not for mimicking the real-world or motivating realistic behavior during experiments. So, there is a lack of understanding regarding whether financial incentives can or cannot be useful for software-engineering experimentation. To tackle this problem, we plan a survey based on which we will conduct a controlled laboratory experiment. Precisely, we will use the survey to elicit incentivization schemes we will employ as (up to) four payoff functions (i.e., mappings of choices or performance in an experiment to a monetary payment) during a code-review task in the experiment: (1) a scheme that employees prefer, (2) a scheme that is actually employed, (3) a scheme that is performance-independent, and (4) a scheme that mimics an open-source scenario. Using a between-subject design, we aim to explore how the different schemes impact the participants’ performance. Our contributions help understand the impact of financial incentives on developers in experiments as well as real-world scenarios, guiding researchers in designing experiments and organizations in compensating developers.

Software Engineering, Financial Incentives

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

تقرير مسجل: تجربة معملية حول استخدام خطط التحفيز المالي المختلفة في تجارب هندسة البرمجيات

غالبًا ما يتم إجراء الدراسات التجريبية في هندسة البرمجيات مع مطوري المصادر المفتوحة أو من خلال التعاون الصناعي. على ما يبدو، أدى هذا إلى تجارب قليلة تستخدم الحوافز المالية (مثل المال والقسائم) كاستراتيجية لتحفيز سلوك المشاركين؛ وهو ما يتم عادةً في مجتمعات بحثية أخرى، مثل الاقتصاد أو علم النفس. وحتى الإصدار الحالي من معايير SIGSOFT التجريبية يذكر المدفوعات مقابل استكمال الدراسات الاستقصائية فقط، ولكن ليس لمحاكاة العالم الحقيقي أو تحفيز السلوك الواقعي أثناء التجارب. لذا، هناك نقص في الفهم فيما يتعلق بما إذا كانت الحوافز المالية قد تكون مفيدة أم لا في تجربة هندسة البرمجيات. ولمعالجة هذه المشكلة، نخطط لإجراء مسح سنقوم على أساسه بإجراء تجربة معملية مضبوطة. على وجه التحديد، سوف نستخدم الاستطلاع لاستنباط مخططات التحفيز التي سنستخدمها (ما يصل إلى) أربع وظائف للمكافأة (أي تعيين الاختيارات أو الأداء في تجربة للدفع النقدي) أثناء مهمة مراجعة الكود في التجربة: (1 ) مخطط يفضله الموظفون، (2) مخطط مستخدم بالفعل، (3) مخطط مستقل عن الأداء، و (4) مخطط يحاكي سيناريو مفتوح المصدر. باستخدام التصميم بين المواضيع، نهدف إلى استكشاف كيفية تأثير المخططات المختلفة على أداء المشاركين. تساعد مساهماتنا في فهم تأثير الحوافز المالية على المطورين في التجارب بالإضافة إلى سيناريوهات العالم الحقيقي، مما يرشد الباحثين في تصميم التجارب والمؤسسات في تعويض المطورين.

هندسة البرمجيات، الحوافز المالية

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Informe registrado: un experimento de laboratorio sobre el uso de diferentes esquemas de incentivos financieros en la experimentación de ingeniería de software

Los estudios empíricos en ingeniería de software a menudo se llevan a cabo con desarrolladores de código abierto o en colaboraciones industriales. Al parecer, esto dio lugar a pocos experimentos que utilizaran incentivos financieros (p. ej., dinero, vales) como estrategia para motivar el comportamiento de los participantes; lo que normalmente se hace en otras comunidades de investigación, como la economía o la psicología. Incluso la versión actual de los Estándares Empíricos de SIGSOFT menciona pagos solo por completar encuestas, pero no por imitar el mundo real o motivar un comportamiento realista durante los experimentos. Por lo tanto, existe una falta de comprensión sobre si los incentivos financieros pueden o no ser útiles para la experimentación en ingeniería de software. Para abordar este problema, planeamos una encuesta a partir de la cual realizaremos un experimento de laboratorio controlado. Precisamente, usaremos la encuesta para obtener esquemas de incentivos que emplearemos como (hasta) cuatro funciones de pago (es decir, asignaciones de elecciones o desempeño en un experimento a un pago monetario) durante una tarea de revisión de código en el experimento: (1 ) un esquema que los empleados prefieren, (2) un esquema que realmente se emplea, (3) un esquema que es independiente del desempeño y (4) un esquema que imita un escenario de código abierto. Utilizando un diseño entre sujetos, nuestro objetivo es explorar cómo los diferentes esquemas impactan el desempeño de los participantes. Nuestras contribuciones ayudan a comprender el impacto de los incentivos financieros en los desarrolladores en experimentos, así como en escenarios del mundo real, guiando a los investigadores en el diseño de experimentos y a las organizaciones en la compensación a los desarrolladores.

Ingeniería de Software, Incentivos Financieros

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Rapport enregistré : une expérience en laboratoire sur l'utilisation de différents systèmes d'incitation financière dans l'expérimentation en génie logiciel

Les études empiriques en génie logiciel sont souvent menées avec des développeurs open source ou dans le cadre de collaborations industrielles. Apparemment, cela a donné lieu à peu d’expériences utilisant des incitations financières (par exemple, de l’argent, des bons d’achat) comme stratégie pour motiver le comportement des participants ; ce qui se fait généralement dans d'autres communautés de recherche, comme l'économie ou la psychologie. Même la version actuelle des normes empiriques de SIGSOFT mentionne uniquement les paiements pour avoir répondu à des enquêtes, mais pas pour imiter le monde réel ou motiver un comportement réaliste lors d'expériences. Il y a donc un manque de compréhension quant à savoir si les incitations financières peuvent ou non être utiles pour l’expérimentation en génie logiciel. Pour résoudre ce problème, nous prévoyons une enquête sur la base de laquelle nous mènerons une expérience contrôlée en laboratoire. Plus précisément, nous utiliserons l'enquête pour obtenir des schémas d'incitation que nous utiliserons comme (jusqu'à) quatre fonctions de récompense (c'est-à-dire des mappages de choix ou de performances dans une expérience avec un paiement monétaire) au cours d'une tâche de révision de code dans l'expérience : (1 ) un système que les employés préfèrent, (2) un système qui est réellement employé, (3) un système indépendant des performances et (4) un système qui imite un scénario open source. En utilisant une conception inter-sujets, nous visons à explorer l’impact des différents programmes sur les performances des participants. Nos contributions aident à comprendre l'impact des incitations financières sur les développeurs dans les expériences ainsi que sur des scénarios du monde réel, guidant les chercheurs dans la conception d'expériences et les organisations dans la rémunération des développeurs.

Génie logiciel, incitations financières

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

पंजीकृत रिपोर्ट: सॉफ्टवेयर-इंजीनियरिंग प्रयोग में विभिन्न वित्तीय-प्रोत्साहन योजनाओं का उपयोग करने पर एक प्रयोगशाला प्रयोग

सॉफ़्टवेयर इंजीनियरिंग में अनुभवजन्य अध्ययन अक्सर ओपन-सोर्स डेवलपर्स या औद्योगिक सहयोग के साथ आयोजित किए जाते हैं। ऐसा प्रतीत होता है कि, इसके परिणामस्वरूप प्रतिभागियों के व्यवहार को प्रेरित करने की रणनीति के रूप में वित्तीय प्रोत्साहन (जैसे, पैसा, वाउचर) का उपयोग करने वाले कुछ प्रयोग हुए; जो आमतौर पर अर्थशास्त्र या मनोविज्ञान जैसे अन्य अनुसंधान समुदायों में किया जाता है। यहां तक कि SIGSOFT अनुभवजन्य मानकों के वर्तमान संस्करण में केवल सर्वेक्षण पूरा करने के लिए भुगतान का उल्लेख है, लेकिन वास्तविक दुनिया की नकल करने या प्रयोगों के दौरान यथार्थवादी व्यवहार को प्रेरित करने के लिए नहीं। इसलिए, इस बारे में समझ की कमी है कि वित्तीय प्रोत्साहन सॉफ्टवेयर-इंजीनियरिंग प्रयोग के लिए उपयोगी हो सकते हैं या नहीं। इस समस्या से निपटने के लिए, हम एक सर्वेक्षण की योजना बना रहे हैं जिसके आधार पर हम एक नियंत्रित प्रयोगशाला प्रयोग करेंगे। सटीक रूप से, हम सर्वेक्षण का उपयोग प्रोत्साहन योजनाओं को प्राप्त करने के लिए करेंगे जिन्हें हम प्रयोग में एक कोड-समीक्षा कार्य के दौरान चार भुगतान कार्यों (यानी, एक प्रयोग में विकल्पों या प्रदर्शन की मैपिंग से लेकर मौद्रिक भुगतान तक) के रूप में नियोजित करेंगे: (1) ) एक योजना जिसे कर्मचारी पसंद करते हैं, (2) एक योजना जो वास्तव में कार्यरत है, (3) एक योजना जो प्रदर्शन-स्वतंत्र है, और (4) एक योजना जो एक ओपन-सोर्स परिदृश्य की नकल करती है। विषय-वस्तु के बीच डिज़ाइन का उपयोग करते हुए, हमारा लक्ष्य यह पता लगाना है कि विभिन्न योजनाएँ प्रतिभागियों के प्रदर्शन को कैसे प्रभावित करती हैं। हमारा योगदान प्रयोगों के साथ-साथ वास्तविक दुनिया के परिदृश्यों में डेवलपर्स पर वित्तीय प्रोत्साहन के प्रभाव को समझने में मदद करता है, प्रयोगों को डिजाइन करने में शोधकर्ताओं और डेवलपर्स को मुआवजा देने में संगठनों का मार्गदर्शन करता है।

c Bad1bc62f3145a8b9f75ac72c5422c4 पंजीकृत रिपोर्ट: सॉफ्टवेयर-इंजीनियरिंग प्रयोग में विभिन्न वित्तीय-प्रोत्साहन योजनाओं का उपयोग करने पर एक प्रयोगशाला प्रयोग 3d8a47038cac43a1b26a4c0ed09b5d37 सॉफ्टवेयर इंजीनियरिंग, वित्तीय प्रोत्साहन

सॉफ्टवेयर इंजीनियरिंग, वित्तीय प्रोत्साहन

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

登録報告書: ソフトウェアエンジニアリング実験におけるさまざまな財政的インセンティブスキームの使用に関する実験室実験

ソフトウェアエンジニアリングにおける実証研究は、多くの場合、オープンソース開発者と協力して、または業界の協力のもとで実施されます。一見すると、参加者の行動を動機付ける戦略として金銭的インセンティブ（金銭や商品券など）を使用した実験はほとんど行われなかったように見えます。これは通常、経済学や心理学などの他の研究コミュニティで行われます。 SIGSOFT Empirical Standards の現在のバージョンでも、アンケートの完了に対する支払いについてのみ言及されており、現実世界を模倣したり、実験中に現実的な行動を動機付けたりすることについては言及されていません。したがって、金銭的インセンティブがソフトウェアエンジニアリングの実験に役立つかどうかについては理解が不足しています。この問題に取り組むために、私たちは調査を計画し、それに基づいて室内管理実験を実施します。正確には、調査を使用して、実験のコードレビュータスク中に (最大) 4 つの報酬関数 (つまり、実験での選択またはパフォーマンスを金銭的支払いにマッピングする) として採用するインセンティブスキームを導き出します。 ) 従業員が好むスキーム、(2) 実際に採用されているスキーム、(3) パフォーマンスに依存しないスキーム、(4) オープンソースのシナリオを模倣したスキーム。被験者間のデザインを使用して、さまざまなスキームが参加者のパフォーマンスにどのような影響を与えるかを調査することを目的としています。私たちの貢献は、実験や現実世界のシナリオにおける開発者に対する金銭的インセンティブの影響を理解するのに役立ち、研究者が実験を設計する際の指針となり、組織が開発者に報酬を支払う際の指針となります。

ソフトウェアエンジニアリング、金銭的インセンティブ

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Relatório registrado: um experimento de laboratório sobre o uso de diferentes esquemas de incentivo financeiro em experimentação de engenharia de software

Estudos empíricos em engenharia de software são frequentemente conduzidos com desenvolvedores de código aberto ou em colaborações industriais. Aparentemente, isto resultou em poucas experiências utilizando incentivos financeiros (por exemplo, dinheiro, vouchers) como estratégia para motivar o comportamento dos participantes; o que normalmente é feito em outras comunidades de pesquisa, como economia ou psicologia. Mesmo a versão atual dos Padrões Empíricos da SIGSOFT menciona pagamentos apenas pela conclusão de pesquisas, mas não por imitar o mundo real ou motivar comportamento realista durante experimentos. Portanto, há uma falta de compreensão sobre se os incentivos financeiros podem ou não ser úteis para a experimentação em engenharia de software. Para resolver esse problema, planejamos uma pesquisa com base na qual conduziremos um experimento controlado em laboratório. Precisamente, usaremos a pesquisa para obter esquemas de incentivos que empregaremos como (até) quatro funções de recompensa (ou seja, mapeamentos de escolhas ou desempenho em um experimento para um pagamento monetário) durante uma tarefa de revisão de código no experimento: (1 ) um esquema preferido pelos funcionários, (2) um esquema que seja realmente empregado, (3) um esquema que seja independente do desempenho e (4) um esquema que imite um cenário de código aberto. Utilizando um design entre disciplinas, pretendemos explorar como os diferentes esquemas impactam o desempenho dos participantes. Nossas contribuições ajudam a compreender o impacto dos incentivos financeiros sobre os desenvolvedores em experimentos, bem como em cenários do mundo real, orientando os pesquisadores na concepção de experimentos e as organizações na remuneração dos desenvolvedores.

Engenharia de Software, Incentivos Financeiros

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Зарегистрированный отчет: Лабораторный эксперимент по использованию различных схем финансового стимулирования в экспериментах по разработке программного обеспечения

Эмпирические исследования в области разработки программного обеспечения часто проводятся совместно с разработчиками ПО с открытым исходным кодом или в рамках промышленного сотрудничества. По-видимому, это привело к небольшому количеству экспериментов с использованием финансовых стимулов (например, денег, ваучеров) в качестве стратегии мотивации поведения участников; что обычно делается в других исследовательских сообществах, таких как экономика или психология. Даже в текущей версии эмпирических стандартов SIGSOFT упоминаются выплаты только за участие в опросах, но не за имитацию реального мира или мотивацию реалистичного поведения во время экспериментов. Таким образом, отсутствует понимание того, могут ли финансовые стимулы быть полезными для экспериментов в области разработки программного обеспечения. Для решения этой проблемы мы планируем опрос, по результатам которого проведем контролируемый лабораторный эксперимент. Точнее, мы будем использовать опрос для выявления схем стимулирования, которые мы будем использовать в качестве (до) четырех функций выигрыша (т. е. сопоставления выбора или результатов эксперимента с денежными выплатами) во время задачи проверки кода в эксперименте: (1 ) схема, которую предпочитают сотрудники, (2) схема, которая фактически используется, (3) схема, независимая от производительности, и (4) схема, имитирующая сценарий с открытым исходным кодом. Используя межпредметный дизайн, мы стремимся изучить, как различные схемы влияют на успеваемость участников. Наш вклад помогает понять влияние финансовых стимулов на разработчиков в экспериментах, а также в реальных сценариях, помогая исследователям планировать эксперименты, а организациям — выплачивать вознаграждение разработчикам.

Программная инженерия, финансовые стимулы

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

注册报告：在软件工程实验中使用不同金融激励方案的实验室实验

软件工程的实证研究通常是与开源开发人员或行业合作进行的。看起来，这导致很少有实验使用经济激励（例如金钱、代金券）作为激励参与者行为的策略；这通常是在其他研究领域完成的，例如经济学或心理学。即使是当前版本的 SIGSOFT 经验标准也仅提到了完成调查的报酬，而没有提到在实验期间模仿现实世界或激励现实行为的报酬。因此，对于经济激励是否对软件工程实验有用缺乏了解。为了解决这个问题，我们计划进行一项调查，并在此基础上进行受控实验室实验。准确地说，我们将使用调查来引出激励计划，我们将在实验中的代码审查任务期间采用（最多）四个回报函数（即，将实验中的选择或绩效映射到货币支付）：（1 ）员工喜欢的方案，（2）实际使用的方案，（3）与性能无关的方案，（4）模仿开源场景的方案。使用受试者间设计，我们的目的是探索不同的方案如何影响参与者的表现。我们的贡献有助于了解经济激励对实验和现实场景中的开发人员的影响，指导研究人员设计实验和组织补偿开发人员。

软件工程、财务激励

Submission: posted 23 February 2022
Recommendation: posted 15 July 2022, validated 15 July 2022

Cite this recommendation as:
Chambers, C. (2022) Bug detection in software engineering: which incentives work best?. Peer Community in Registered Reports, . https://rr.peercommunityin.org/PCIRegisteredReports/articles/rec?id=186

Related stage 2 preprints:

A Laboratory Experiment on Using Different Financial-Incentivization Schemes in Software-Engineering Experimentation
Dmitri Bershadskyy, Jacob Krüger, Gül Çalıklı, Siegmar Otto, Sarah Zabel, Jannik Greif, Robert Heyer
https://arxiv.org/pdf/2202.10985

Recommendation

Bug detection is central to software engineering, but what motivates programmers to perform as optimally as possible? Despite a long history of economic experiments on incentivisation, there is surprisingly little research on how different incentives shape software engineering performance. In the current study, Krüger et al. (2022) propose an experiment to evaluate how the pay-off functions associated with different financial incentives influence the performance of participants in identifying bugs during code review. The authors hypothesise that performance-based incentivisation will result in higher average performance, as defined using the F1-score, and that different incentivisation schemes may also differ in their effectiveness. As well as testing confirmatory predictions, the authors will explore a range of ancillary strands, including how the different incentivisation conditions influence search and evaluation behaviour (using eye-tracking), and the extent to which any effects are moderated by demographic factors.

The Stage 1 manuscript was evaluated over one round of in-depth review. Based on detailed responses to the recommender and reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).

URL to the preregistered Stage 1 protocol: https://osf.io/s36c2

Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.

List of eligible PCI RR-friendly journals:

References

1. Krüger, J., Çalıklı, G., Bershadskyy, D., Heyer, R., Zabel, S. & Siegmar, O. (2022). Registered Report: A Laboratory Experiment on Using Different Financial-Incentivization Schemes in Software-Engineering Experimentation, in principle acceptance of Version 3 by Peer Community in Registered Reports. https://osf.io/s36c2

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviews

Evaluation round #1

DOI or URL of the report: https://doi.org/10.48550/arXiv.2202.10985

Author's Reply, 08 Jul 2022

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.rr.100186.ar1

Decision by Chris Chambers, posted 21 May 2022

Thank you for your patience as we have gathered evaluations of your submission. I have now obtained two expert reviews and I have also read your manuscript myself. As you will see, the reviews are broadly positive about your proposal, while also noting several areas that would benefit from improvement, particularly concerning the level of methodological detail, threats to validity, and specifics of the analysis plans. I have also read your manuscript carefully with a view toward ensuring that it meets the Stage 1 criteria.

Overall, I believe your proposal is a promising candidate a RR, and should be suitable for Stage 1 in-principle acceptance once a variety of issues are addressed. I am therefore happy to invite a major revision. In revising, please include a point-by-point response to each comment of the reviewers (making clear what has changed in the manuscript, or providing a suitable rebuttal), as well as to my own points below (Recommender comments). When resubmitting please also include a tracked-changes version of the manuscript.

Recommender comments

1. Justification of N=30 for the survey study

Given the importance of the survey study for choosing the input parameters for the MAIT and MPIT interventions, the precision of these estimates in the survey study seems crucial. At the moment, the sample size justification for this part of the design is defined too imprecisely and arbitrarily. Instead, please provide a formal sampling plan based on the required level of precision (for guidance see the section “Planning for Precision” in https://psyarxiv.com/9d3yf/). This could be achieved analytically or through simulations.

2. Sub-samples within the survey study

On p3 you note: “In addition, we will distribute a second version (to distinguish both populations) of our survey through our social media networks.” How will this be taken into account in generating the parameter estimates? Will the different samples be distinguished or collapsed to produce the payoff functions?

3. Clarification of the statistical sampling plans for the experiment.

The are two issues to address in relation to the sampling plans.

You plan on recruiting 20 participants per group but also reserve the option to collect additional participants. In order to control the Type I error rate, standard power analysis requires a fixed stopping rule, which in turn requires committing to a specific sample size. If you want to employ a flexible stopping rule then you will need to implement a sequential design that involves regular inspection of the data between the minimum and maximum N, with the error rate corrected en route (see https://onlinelibrary.wiley.com/doi/abs/10.1002/ejsp.2023).
At present the only reference to statistical power is in the design table: “Furthermore, we will conduct an a posteriori power analysis to reason on the power of our tests.” Power is a pre-experimental concept, and post hoc power analysis (or “observed power”) is inferentially meaningless because it simply reflects the outcomes. A formal prospective power analysis is required, to either define the sample size required to detect a smallest effect size of interest (a priori power analysis), or to define the smallest effect that can be detected given a maximum resource limit (so-called sensitivity power analysis). At present, given N=20 per group, and a strictest Holm-Bonferroni correct alpha of .0083 for the lowest ranked p-value (assuming you apply the H-B correction for 6 tests across both hypotheses), your design has 90% power to detect d = 1.3. Any d > 1 (i.e 1 standard deviation) is in the conventionally-defined “large” range. Unless you would be happy to miss an effect smaller than d=1.3, the sample size needs to be substantially increased. To progress I would suggest the following: (1) try to establish what the smallest effect size of interest is for H1 and H2, either based on theory, or the smallest practical benefit of your intervention in an applied setting, or based on prior software-engineering experiments; then justify the rationale for this smallest effect of interest in the manuscript. (2) if you have no upper resource limit on sample size then perform an a priori power analysis to determine the sample size necessary to correctly reject H0 for this effect size with no less than 90% power. If you do have an upper resource limit on sample size (which is very reasonable) then instead perform a sensitivity power analysis (see section 3.1.2 here if using G*Power) to determine what effect size you have 90% power to reject at your maximum feasible sample size, and then justify why this effect size is sufficiently small for your experiment to provide a sufficiently sensitive test of your hypotheses (H1 and H2).

4. Clarification of which specific outcomes will confirm or disconfirm the hypotheses.

For H1: In the design table you state: “We find support for H1, if our participants’ performance in NPIT is worse AND if the tests between any of our experimental treatments are significant with p < 0.05 (after correcting with the Holm-Bonferroni method).” Does “worse” here refer to each of the pairwise comparisons, or does it mean that NPIT must be numerically worse than all of (or the average of) the other conditions (OSIT, MAIT and MPIT)? If I understand correctly, the second part of your specification means that H1 is supported if any of the following contrasts is statistically significant: (NPIT < MPIT) OR (NPIT < MAIT) OR <NPIT < OSIT). If so, I suggest making this crystal clear by adding italics and including these in the “interpretation” cell of the table: “We find support for H1 if our participants’ performance in NPIT is significantly lower than in any one of our experimental treatments at p<.05 (after correcting with the Holm-Bonferroni method): (NPIT < MPIT) OR (NPIT < MAIT) OR <NPIT < OSIT)”.

For H2: If I understand correctly, any significant difference in any direction between OSIT, MAIT and MPIT would be considered support for H2. So H2 is supported if: (MPIT < > MAIT) OR (MAIT < > OSIT) OR (OSIT < > MPIT). If so, please make this clear in the interpretation cell of the design table.

5. Definition of the F1-score.

Please provide a precise explanation and definition of the F1-score (including a worked example of how it is calculated), and make clear that it is the only outcome measure that will be used to evaluate H1 and H2.

6. Clarification of exclusion criteria.

On p7: “We do not plan to remove any outliers or data unless we identify a specific reason for which we believe the data would be invalid.” For a Registered Report, the precise rules for excluding data must be exhaustively specified, both within participants and also at the level of participants themselves. Where participants are excluded, make clear that they will be replaced to ensure that the target sample size is reached.

7. Eye-tracking acquisition and analyses

Please provide additional details on preprocessing (e.g. filtering, smoothing) of eye-tracking data to ensure that the procedures are fully reproducible. Presumably eye-tracking analyses are reserved for exploratory analyses (with no prospective hypotheses) and will therefore be reported in the “Exploratory outcomes” section of the Results at Stage 2. If so, please note this explicitly in the revised manuscript. Alternatively, if you have specific hypotheses for the effect of incentivization on the eye-tracking measures, ensure that they are fully elaborated in the main text and study design table.

8. Robustness analyses

On p7 you state: “Though the share of participants who will use eye trackers will be constant among all treatments, and thus should not affect treatment effects, we will further check whether the presence of eye trackers affected performance. To increase the statistical robustness, we will also conduct a regression analysis using the treatments as categorical variables and NPIT as base. As exogenous variables, we include: age, gender, experience, and arousal of the participants.” Make clear that these are exploratory analyses.

9. Other points

p7: "We will first check whether the assumptions required for parametric tests are fulfilled, and if not proceed with non-parametric tests." Make clear which assumptions (e.g. normality) you are going to test for, and how, and then specify the alternative tests that will be used (e.g. presumably Mann Whitney U test?)

p7: "For the significance analyses, we will apply a confidence interval of p < 0.05 and correct for multiple hypotheses testing using the Holm-Bonferroni method." Do you mean "alpha level of .05" instead of "confidence interval of p < 0.05"?

https://doi.org/10.24072/pci.rr.100186.d1

Reviewed by anonymous reviewer 1, 07 Apr 2022

The authors have identified a significant gap in the literature. Current studies, in general, do not consider the impact of financial incentives in affecting behaviour and performance developing software.

It should be explicitly stated how they plan to mitigate the thread to validity of having colleagues perform code reviews.

In general, the report is very well written. One thing I would change is “Seemingly, this resulted” to “This has resulted” in the abstract though.

Given that the experiment will be conducted in a controlled laboratory setting, the authors should state what threats this could present in terms of the results being transferred to industry and how such threats could be mitigated. Cost functions are discussed solely in terms of motivating participants. The authors could add a discussion on the different types of organisational objective functions that may be at play in an industrial setting. Such as, the organisational culture and the degree to which code quality is important to the software being the developed and the extent the organisation would be willing to compensate employees in this manner.

The authors should state the sample size or the number of people that will partake in the study to justify the potential statistical results.

https://doi.org/10.24072/pci.rr.100186.rev11

Reviewed by Edson OliveiraJr, 20 May 2022

This RR presents a two-package study on how financial-incentivization might impact in code review. The first study is a survey with practitioners in which researchers will observe the most applied and the preferred payoff methods. From this survey, they will define a set of such methods (4 a priori) to conduct a controlled experiment with students and, potentially, practitioners. In such an experiment, the researchers will analyze how different payoff schemes impact the performance of software developers during code review.

This is a relevant research topic. The protocols of the studies are generally well-designed and explained. However, I have some points to discuss towards improving such studies:

* Survey:

- a major concern is on the open vs. non-open source projects. Literature clearly emphasizes they have different motivations and activities from general software engineering projects. I didn´t see any discussion on these potential threats in the survey protocol. How do you can extrapolate such threats as you will provide a set of payoff methods to the controlled experiment, which will be performed with students and, potentially, practitioners (non-open projects)?

- Why do exactly you expect at least 30 participants? Is this because of the probability's Central Limit Theorem? If so, please make this explicit.

- What is the minimum experience time expected for the participant's profiles?

- I'd suggest to run the instrument evaluation tests with practitioners rather than students, as students are not the target audience of the survey.

- As you will use the mean value for the weights, how will outliers or extreme values be treated?

- I'd suggest providing a complete feedback on results for participants at the end of the study, as a way to motivate them to take other surveys.

- It is not clear to me, if the participants may choose more than one payoff method in the survey questionnaire. If so, 30 participants seem ok, otherwise, the sample size should be considerably larger.

* Lab Experiment

- I'd like to see clearly the declaration of independent and dependent variables in the "Metrics" section. This is essential for readers to understand the chosen Experimental Design.

- In the "Experimental Design" section, please provide the design chosen in terms of factors and treatments, for example, 2x2, 1xn, etc..

- in the "Inferential Statistics" I'd suggest running an effect size test to provide the strength of the p-value over the null hypothesis results.

All in all, the RR is well-written and easy to follow. Congrats on it and success on the studies.

https://doi.org/10.24072/pci.rr.100186.rev12