Bug detection in software engineering: which incentives work best?
A Laboratory Experiment on Using Different Financial-Incentivization Schemes in Software-Engineering Experimentation
Abstract
Recommendation: posted 11 September 2024, validated 11 September 2024
Chambers, C. (2024) Bug detection in software engineering: which incentives work best?. Peer Community in Registered Reports, 100746. 10.24072/pci.rr.100746
This is a stage 2 based on:
Jacob Krüger, Gül Çalıklı, Dmitri Bershadskyy, Robert Heyer, Sarah Zabel, Siegmar Otto
https://doi.org/10.48550/arXiv.2202.10985
Recommendation
URL to the preregistered Stage 1 protocol: https://osf.io/s36c2
List of eligible PCI RR-friendly journals:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #1
DOI or URL of the report: https://arxiv.org/pdf/2202.10985.pdf
Version of the report: v6
Author's Reply, 13 Aug 2024
Decision by Chris Chambers, posted 31 May 2024, validated 31 May 2024
- Abstract. Re the sentence: “Due to the small sample sizes, our results are not statistically significant, but we can still observe clear tendencies.” Statistically non-significant results can arise either because the null hypothesis is true or the test is insensitive. When relying on null hypothesis significance testing, we cannot know for certain which is the case (at least not without adding frequentist equivalence testing or Bayesian hypothesis testing), therefore I would like you to consider replacing the sentence above with a less deterministic statement about the potential reasons for non-significance: “Our results are not statistically significant, possibly due to small sample sizes and consequent lack of statistical power, but with some notable trends [that may inspire future hypothesis generation].” (the section in square brackets is a stylistic addition that you are free to omit, but I replaced "clear" with "notable" because non-significant trends are by definition unclear and it is a potential source of interpretative bias to overstate their importance.
- Table 1 (p4). Please add a column to the far right of this table called “observed outcome” that briefly summarises the results and, in particular, states the degree of support for each hypothesis (H1, H2) in the second section (i.e. supported, not supported, with the statement based strictly on the outcomes of the preregistered analyses rather than any additional exploratory analyses). To make room for this table, I suggest moving the content in the “disproved theory” column to the Table caption (since it applies generally to all aspects of the study), and then this column can be removed from the table to make space for an "observed outcome" column.
- p15: Explain padjusted in a footnote the first time it is used. I believe it is simply the alpha-corrected value following Holm-Bonferroni correction (?), which case padjusted should actually be reported as >.99 rather than =1 because a p value can never equal exactly 1.
- p15: replace “insignificantly” with “non-significantly” consistent with standard statistical parlence.
Once these issues are addressed, I will issue final Stage 2 acceptance without further peer review. In anticipation of the next version being the final preprint, once you have made these revisions please update the latest version of the preprint to be a clean version (rather than tracked changes), but upload a tracked-changes version (that shows only these latest revisions) in the PCI RR system when you resubmit.
Reviewed by Edson OliveiraJr, 16 May 2024
Congratulations on the submitted Stage 2 manuscript rigor and fidelity on the approved RR. I answered the recommended questions regarding such a submission as follows.
=========================================
Have the authors provided a direct URL to the approved protocol in the Stage 2 manuscript? Did they stay true to their protocol? Are any deviations from protocol clearly justified and fully documented?
R.: The authors provided the DOI of the arXiv, which goes directly to the Stage 2 manuscript rather than the exact URL of the RR, which is https://arxiv.org/abs/2202.10985v4, to avoid misunderstanding by the readers.
=========================================
Is the Introduction in the Stage 1 manuscript (including hypotheses) the same as in the Stage 2 manuscript? Are any changes transparently flagged?
R.: Both Stage 1 and Stage 2 introduction sections are the same, with no relevant changes.
=========================================
Did any prespecified data quality checks, positive controls, or tests of intervention fidelity succeed?
R.: The pre-specified analysis was successful as part of the performed survey and experiment.
=========================================
Are any additional post hoc analyses justified, performed appropriately, and clearly distinguished from the preregistered analyses? Are the conclusions appropriately centered on the outcomes of the preregistered analyses?
R.: Yes. There was an additional iteration of the survey with 8 participants to achieve the stipulated sample size. Another subtle change was merging MAIT and MPIT metrics, which were identical, but this was previously expected in the survey RR design. Regarding the survey, there was one change reflected by the merging metrics. Therefore, the experimental design changed from a 4x1 to a 3x1 experiment, thus modifying the treatment function. No other deviation was detected.
=========================================
Are the overall conclusions based on the evidence?
R.: The conclusion is solid and based on the evidence provided by the survey and experiment analysis.