Recommendation

Bug detection is central to software engineering, but what motivates programmers to perform as optimally as possible? Despite a long history of economic experiments on incentivisation, there is surprisingly little research on how different incentives shape software engineering performance. In the current study, Krüger et al. (2022) propose an experiment to evaluate how the pay-off functions associated with different financial incentives influence the performance of participants in identifying bugs during code review. The authors hypothesise that performance-based incentivisation will result in higher average performance, as defined using the F1-score, and that different incentivisation schemes may also differ in their effectiveness. As well as testing confirmatory predictions, the authors will explore a range of ancillary strands, including how the different incentivisation conditions influence search and evaluation behaviour (using eye-tracking), and the extent to which any effects are moderated by demographic factors.

The Stage 1 manuscript was evaluated over one round of in-depth review. Based on detailed responses to the recommender and reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).

**URL to the preregistered Stage 1 protocol: **https://osf.io/s36c2

1. Krüger, J., Çalıklı, G., Bershadskyy, D., Heyer, R., Zabel, S. & Siegmar, O. (2022). Registered Report: A Laboratory Experiment on Using Different Financial-Incentivization Schemes in Software-Engineering Experimentation, in principle acceptance of Version 3 by Peer Community in Registered Reports. https://osf.io/s36c2

Chris Chambers (2022) Bug detection in software engineering: which incentives work best?.

DOI or URL of the report: **https://doi.org/10.48550/arXiv.2202.10985**

Thank you for your patience as we have gathered evaluations of your submission. I have now obtained two expert reviews and I have also read your manuscript myself. As you will see, the reviews are broadly positive about your proposal, while also noting several areas that would benefit from improvement, particularly concerning the level of methodological detail, threats to validity, and specifics of the analysis plans. I have also read your manuscript carefully with a view toward ensuring that it meets the Stage 1 criteria.

Overall, I believe your proposal is a promising candidate a RR, and should be suitable for Stage 1 in-principle acceptance once a variety of issues are addressed. I am therefore happy to invite a major revision. In revising, please include a point-by-point response to each comment of the reviewers (making clear what has changed in the manuscript, or providing a suitable rebuttal), as well as to my own points below (Recommender comments). When resubmitting please also include a tracked-changes version of the manuscript.

Given the importance of the survey study for choosing the input parameters for the MAIT and MPIT interventions, the precision of these estimates in the survey study seems crucial. At the moment, the sample size justification for this part of the design is defined too imprecisely and arbitrarily. Instead, please provide a formal sampling plan based on the required level of precision (for guidance see the section “Planning for Precision” in https://psyarxiv.com/9d3yf/). This could be achieved analytically or through simulations.

On p3 you note: “In addition, we will distribute a second version (to distinguish both populations) of our survey through our social media networks.” How will this be taken into account in generating the parameter estimates? Will the different samples be distinguished or collapsed to produce the payoff functions?

The are two issues to address in relation to the sampling plans.

- You plan on recruiting 20 participants per group but also reserve the option to collect additional participants. In order to control the Type I error rate, standard power analysis requires a fixed stopping rule, which in turn requires committing to a specific sample size. If you want to employ a flexible stopping rule then you will need to implement a sequential design that involves regular inspection of the data between the minimum and maximum N, with the error rate corrected en route (see https://onlinelibrary.wiley.com/doi/abs/10.1002/ejsp.2023).
- At present the only reference to statistical power is in the design table: “Furthermore, we will conduct an a posteriori power analysis to reason on the power of our tests.” Power is a pre-experimental concept, and post hoc power analysis (or “observed power”) is inferentially meaningless because it simply reflects the outcomes. A formal prospective power analysis is required, to either define the sample size required to detect a smallest effect size of interest (
*a priori*power analysis), or to define the smallest effect that can be detected given a maximum resource limit (so-called*sensitivity*power analysis). At present, given N=20 per group, and a strictest Holm-Bonferroni correct alpha of .0083 for the lowest ranked p-value (assuming you apply the H-B correction for 6 tests across both hypotheses), your design has 90% power to detect d = 1.3. Any d > 1 (i.e 1 standard deviation) is in the conventionally-defined “large” range. Unless you would be happy to miss an effect smaller than d=1.3, the sample size needs to be substantially increased. To progress I would suggest the following: (1) try to establish what the smallest effect size of interest is for H1 and H2, either based on theory, or the smallest practical benefit of your intervention in an applied setting, or based on prior software-engineering experiments; then justify the rationale for this smallest effect of interest in the manuscript. (2) if you have no upper resource limit on sample size then perform an*a priori*power analysis to determine the sample size necessary to correctly reject H0 for this effect size with no less than 90% power. If you*do*have an upper resource limit on sample size (which is very reasonable) then instead perform a sensitivity power analysis (see section 3.1.2 here if using G*Power) to determine what effect size you have 90% power to reject at your maximum feasible sample size, and then justify why this effect size is sufficiently small for your experiment to provide a sufficiently sensitive test of your hypotheses (H1 and H2).

Please provide a precise explanation and definition of the F1-score (including a worked example of how it is calculated), and make clear that it is the *only* outcome measure that will be used to evaluate H1 and H2.

On p7: “We do not plan to remove any outliers or data unless we identify a specific reason for which we believe the data would be invalid.” For a Registered Report, the precise rules for excluding data must be exhaustively specified, both within participants and also at the level of participants themselves. Where participants are excluded, make clear that they will be replaced to ensure that the target sample size is reached.

Please provide additional details on preprocessing (e.g. filtering, smoothing) of eye-tracking data to ensure that the procedures are fully reproducible. Presumably eye-tracking analyses are reserved for exploratory analyses (with no prospective hypotheses) and will therefore be reported in the “Exploratory outcomes” section of the Results at Stage 2. If so, please note this explicitly in the revised manuscript. Alternatively, if you have specific hypotheses for the effect of incentivization on the eye-tracking measures, ensure that they are fully elaborated in the main text and study design table.

On p7 you state: “Though the share of participants who will use eye trackers will be constant among all treatments, and thus should not affect treatment effects, we will further check whether the presence of eye trackers affected performance. To increase the statistical robustness, we will also conduct a regression analysis using the treatments as categorical variables and NPIT as base. As exogenous variables, we include: age, gender, experience, and arousal of the participants.” Make clear that these are exploratory analyses.

p7: "We will first check whether the assumptions required for parametric tests are fulfilled, and if not proceed with non-parametric tests." Make clear which assumptions (e.g. normality) you are going to test for, and how, and then specify the alternative tests that will be used (e.g. presumably Mann Whitney U test?)

p7: "For the significance analyses, we will apply a confidence interval of p < 0.05 and correct for multiple hypotheses testing using the Holm-Bonferroni method." Do you mean "alpha level of .05" instead of "confidence interval of p < 0.05"?

The authors have identified a significant gap in the literature. Current studies, in general, do not consider the impact of financial incentives in affecting behaviour and performance developing software.

It should be explicitly stated how they plan to mitigate the thread to validity of having colleagues perform code reviews.

In general, the report is very well written. One thing I would change is “Seemingly, this resulted” to “This has resulted” in the abstract though.

Given that the experiment will be conducted in a controlled laboratory setting, the authors should state what threats this could present in terms of the results being transferred to industry and how such threats could be mitigated. Cost functions are discussed solely in terms of motivating participants. The authors could add a discussion on the different types of organisational objective functions that may be at play in an industrial setting. Such as, the organisational culture and the degree to which code quality is important to the software being the developed and the extent the organisation would be willing to compensate employees in this manner.

The authors should state the sample size or the number of people that will partake in the study to justify the potential statistical results.

This RR presents a two-package study on how financial-incentivization might impact in code review. The first study is a survey with practitioners in which researchers will observe the most applied and the preferred payoff methods. From this survey, they will define a set of such methods (4 a priori) to conduct a controlled experiment with students and, potentially, practitioners. In such an experiment, the researchers will analyze how different payoff schemes impact the performance of software developers during code review.

This is a relevant research topic. The protocols of the studies are generally well-designed and explained. However, I have some points to discuss towards improving such studies:

* Survey:

- a major concern is on the open vs. non-open source projects. Literature clearly emphasizes they have different motivations and activities from general software engineering projects. I didn´t see any discussion on these potential threats in the survey protocol. How do you can extrapolate such threats as you will provide a set of payoff methods to the controlled experiment, which will be performed with students and, potentially, practitioners (non-open projects)?

- Why do exactly you expect at least 30 participants? Is this because of the probability's Central Limit Theorem? If so, please make this explicit.

- What is the minimum experience time expected for the participant's profiles?

- I'd suggest to run the instrument evaluation tests with practitioners rather than students, as students are not the target audience of the survey.

- As you will use the mean value for the weights, how will outliers or extreme values be treated?

- I'd suggest providing a complete feedback on results for participants at the end of the study, as a way to motivate them to take other surveys.

- It is not clear to me, if the participants may choose more than one payoff method in the survey questionnaire. If so, 30 participants seem ok, otherwise, the sample size should be considerably larger.

* Lab Experiment

- I'd like to see clearly the declaration of independent and dependent variables in the "Metrics" section. This is essential for readers to understand the chosen Experimental Design.

- In the "Experimental Design" section, please provide the design chosen in terms of factors and treatments, for example, 2x2, 1xn, etc..

- in the "Inferential Statistics" I'd suggest running an effect size test to provide the strength of the p-value over the null hypothesis results.

All in all, the RR is well-written and easy to follow. Congrats on it and success on the studies.