Bug detection in software engineering: which incentives work best?
Registered Report: A Laboratory Experiment on Using Different Financial-Incentivization Schemes in Software-Engineering Experimentation
Recommendation: posted 15 July 2022, validated 15 July 2022
URL to the preregistered Stage 1 protocol: https://osf.io/s36c2
List of eligible PCI RR-friendly journals:
Chris Chambers (2022) Bug detection in software engineering: which incentives work best?. Peer Community in Registered Reports, . https://rr.peercommunityin.org/articles/rec?id=186
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #1
DOI or URL of the report: https://doi.org/10.48550/arXiv.2202.10985
Author's Reply, 08 Jul 2022
Decision by Chris Chambers, posted 21 May 2022
- You plan on recruiting 20 participants per group but also reserve the option to collect additional participants. In order to control the Type I error rate, standard power analysis requires a fixed stopping rule, which in turn requires committing to a specific sample size. If you want to employ a flexible stopping rule then you will need to implement a sequential design that involves regular inspection of the data between the minimum and maximum N, with the error rate corrected en route (see https://onlinelibrary.wiley.com/doi/abs/10.1002/ejsp.2023).
- At present the only reference to statistical power is in the design table: “Furthermore, we will conduct an a posteriori power analysis to reason on the power of our tests.” Power is a pre-experimental concept, and post hoc power analysis (or “observed power”) is inferentially meaningless because it simply reflects the outcomes. A formal prospective power analysis is required, to either define the sample size required to detect a smallest effect size of interest (a priori power analysis), or to define the smallest effect that can be detected given a maximum resource limit (so-called sensitivity power analysis). At present, given N=20 per group, and a strictest Holm-Bonferroni correct alpha of .0083 for the lowest ranked p-value (assuming you apply the H-B correction for 6 tests across both hypotheses), your design has 90% power to detect d = 1.3. Any d > 1 (i.e 1 standard deviation) is in the conventionally-defined “large” range. Unless you would be happy to miss an effect smaller than d=1.3, the sample size needs to be substantially increased. To progress I would suggest the following: (1) try to establish what the smallest effect size of interest is for H1 and H2, either based on theory, or the smallest practical benefit of your intervention in an applied setting, or based on prior software-engineering experiments; then justify the rationale for this smallest effect of interest in the manuscript. (2) if you have no upper resource limit on sample size then perform an a priori power analysis to determine the sample size necessary to correctly reject H0 for this effect size with no less than 90% power. If you do have an upper resource limit on sample size (which is very reasonable) then instead perform a sensitivity power analysis (see section 3.1.2 here if using G*Power) to determine what effect size you have 90% power to reject at your maximum feasible sample size, and then justify why this effect size is sufficiently small for your experiment to provide a sufficiently sensitive test of your hypotheses (H1 and H2).
Reviewed by anonymous reviewer, 07 Apr 2022
The authors have identified a significant gap in the literature. Current studies, in general, do not consider the impact of financial incentives in affecting behaviour and performance developing software.
It should be explicitly stated how they plan to mitigate the thread to validity of having colleagues perform code reviews.
In general, the report is very well written. One thing I would change is “Seemingly, this resulted” to “This has resulted” in the abstract though.
Given that the experiment will be conducted in a controlled laboratory setting, the authors should state what threats this could present in terms of the results being transferred to industry and how such threats could be mitigated. Cost functions are discussed solely in terms of motivating participants. The authors could add a discussion on the different types of organisational objective functions that may be at play in an industrial setting. Such as, the organisational culture and the degree to which code quality is important to the software being the developed and the extent the organisation would be willing to compensate employees in this manner.
The authors should state the sample size or the number of people that will partake in the study to justify the potential statistical results.
Reviewed by Edson OliveiraJr, 20 May 2022
This RR presents a two-package study on how financial-incentivization might impact in code review. The first study is a survey with practitioners in which researchers will observe the most applied and the preferred payoff methods. From this survey, they will define a set of such methods (4 a priori) to conduct a controlled experiment with students and, potentially, practitioners. In such an experiment, the researchers will analyze how different payoff schemes impact the performance of software developers during code review.
This is a relevant research topic. The protocols of the studies are generally well-designed and explained. However, I have some points to discuss towards improving such studies:
- a major concern is on the open vs. non-open source projects. Literature clearly emphasizes they have different motivations and activities from general software engineering projects. I didn´t see any discussion on these potential threats in the survey protocol. How do you can extrapolate such threats as you will provide a set of payoff methods to the controlled experiment, which will be performed with students and, potentially, practitioners (non-open projects)?
- Why do exactly you expect at least 30 participants? Is this because of the probability's Central Limit Theorem? If so, please make this explicit.
- What is the minimum experience time expected for the participant's profiles?
- I'd suggest to run the instrument evaluation tests with practitioners rather than students, as students are not the target audience of the survey.
- As you will use the mean value for the weights, how will outliers or extreme values be treated?
- I'd suggest providing a complete feedback on results for participants at the end of the study, as a way to motivate them to take other surveys.
- It is not clear to me, if the participants may choose more than one payoff method in the survey questionnaire. If so, 30 participants seem ok, otherwise, the sample size should be considerably larger.
* Lab Experiment
- I'd like to see clearly the declaration of independent and dependent variables in the "Metrics" section. This is essential for readers to understand the chosen Experimental Design.
- In the "Experimental Design" section, please provide the design chosen in terms of factors and treatments, for example, 2x2, 1xn, etc..
- in the "Inferential Statistics" I'd suggest running an effect size test to provide the strength of the p-value over the null hypothesis results.
All in all, the RR is well-written and easy to follow. Congrats on it and success on the studies.