Thank you for your patience as we have gathered evaluations of your submission. I have now obtained two expert reviews and I have also read your manuscript myself. As you will see, the reviews are broadly positive about your proposal, while also noting several areas that would benefit from improvement, particularly concerning the level of methodological detail, threats to validity, and specifics of the analysis plans. I have also read your manuscript carefully with a view toward ensuring that it meets the Stage 1 criteria.
Overall, I believe your proposal is a promising candidate a RR, and should be suitable for Stage 1 in-principle acceptance once a variety of issues are addressed. I am therefore happy to invite a major revision. In revising, please include a point-by-point response to each comment of the reviewers (making clear what has changed in the manuscript, or providing a suitable rebuttal), as well as to my own points below (Recommender comments). When resubmitting please also include a tracked-changes version of the manuscript.
1. Justification of N=30 for the survey study
Given the importance of the survey study for choosing the input parameters for the MAIT and MPIT interventions, the precision of these estimates in the survey study seems crucial. At the moment, the sample size justification for this part of the design is defined too imprecisely and arbitrarily. Instead, please provide a formal sampling plan based on the required level of precision (for guidance see the section “Planning for Precision” in https://psyarxiv.com/9d3yf/
). This could be achieved analytically or through simulations.
2. Sub-samples within the survey study
On p3 you note: “In addition, we will distribute a second version (to distinguish both populations) of our survey through our social media networks.” How will this be taken into account in generating the parameter estimates? Will the different samples be distinguished or collapsed to produce the payoff functions?
3. Clarification of the statistical sampling plans for the experiment.
The are two issues to address in relation to the sampling plans.
- You plan on recruiting 20 participants per group but also reserve the option to collect additional participants. In order to control the Type I error rate, standard power analysis requires a fixed stopping rule, which in turn requires committing to a specific sample size. If you want to employ a flexible stopping rule then you will need to implement a sequential design that involves regular inspection of the data between the minimum and maximum N, with the error rate corrected en route (see https://onlinelibrary.wiley.com/doi/abs/10.1002/ejsp.2023).
- At present the only reference to statistical power is in the design table: “Furthermore, we will conduct an a posteriori power analysis to reason on the power of our tests.” Power is a pre-experimental concept, and post hoc power analysis (or “observed power”) is inferentially meaningless because it simply reflects the outcomes. A formal prospective power analysis is required, to either define the sample size required to detect a smallest effect size of interest (a priori power analysis), or to define the smallest effect that can be detected given a maximum resource limit (so-called sensitivity power analysis). At present, given N=20 per group, and a strictest Holm-Bonferroni correct alpha of .0083 for the lowest ranked p-value (assuming you apply the H-B correction for 6 tests across both hypotheses), your design has 90% power to detect d = 1.3. Any d > 1 (i.e 1 standard deviation) is in the conventionally-defined “large” range. Unless you would be happy to miss an effect smaller than d=1.3, the sample size needs to be substantially increased. To progress I would suggest the following: (1) try to establish what the smallest effect size of interest is for H1 and H2, either based on theory, or the smallest practical benefit of your intervention in an applied setting, or based on prior software-engineering experiments; then justify the rationale for this smallest effect of interest in the manuscript. (2) if you have no upper resource limit on sample size then perform an a priori power analysis to determine the sample size necessary to correctly reject H0 for this effect size with no less than 90% power. If you do have an upper resource limit on sample size (which is very reasonable) then instead perform a sensitivity power analysis (see section 3.1.2 here if using G*Power) to determine what effect size you have 90% power to reject at your maximum feasible sample size, and then justify why this effect size is sufficiently small for your experiment to provide a sufficiently sensitive test of your hypotheses (H1 and H2).
4. Clarification of which specific outcomes will confirm or disconfirm the hypotheses.
For H1: In the design table you state: “We find support for H1, if our participants’ performance in NPIT is worse AND if the tests between any of our experimental treatments are significant with p < 0.05 (after correcting with the Holm-Bonferroni method).” Does “worse” here refer to each of the pairwise comparisons, or does it mean that NPIT must be numerically worse than all of (or the average of) the other conditions (OSIT, MAIT and MPIT)? If I understand correctly, the second part of your specification means that H1 is supported if any of the following contrasts is statistically significant: (NPIT < MPIT) OR (NPIT < MAIT) OR <NPIT < OSIT). If so, I suggest making this crystal clear by adding italics and including these in the “interpretation” cell of the table: “We find support for H1 if our participants’ performance in NPIT is significantly lower than in any one of our experimental treatments at p<.05 (after correcting with the Holm-Bonferroni method): (NPIT < MPIT) OR (NPIT < MAIT) OR <NPIT < OSIT)”.
For H2: If I understand correctly, any significant difference in any direction between OSIT, MAIT and MPIT would be considered support for H2. So H2 is supported if: (MPIT < > MAIT) OR (MAIT < > OSIT) OR (OSIT < > MPIT). If so, please make this clear in the interpretation cell of the design table.
5. Definition of the F1-score.
Please provide a precise explanation and definition of the F1-score (including a worked example of how it is calculated), and make clear that it is the only outcome measure that will be used to evaluate H1 and H2.
6. Clarification of exclusion criteria.
On p7: “We do not plan to remove any outliers or data unless we identify a specific reason for which we believe the data would be invalid.” For a Registered Report, the precise rules for excluding data must be exhaustively specified, both within participants and also at the level of participants themselves. Where participants are excluded, make clear that they will be replaced to ensure that the target sample size is reached.
7. Eye-tracking acquisition and analyses
Please provide additional details on preprocessing (e.g. filtering, smoothing) of eye-tracking data to ensure that the procedures are fully reproducible. Presumably eye-tracking analyses are reserved for exploratory analyses (with no prospective hypotheses) and will therefore be reported in the “Exploratory outcomes” section of the Results at Stage 2. If so, please note this explicitly in the revised manuscript. Alternatively, if you have specific hypotheses for the effect of incentivization on the eye-tracking measures, ensure that they are fully elaborated in the main text and study design table.
8. Robustness analyses
On p7 you state: “Though the share of participants who will use eye trackers will be constant among all treatments, and thus should not affect treatment effects, we will further check whether the presence of eye trackers affected performance. To increase the statistical robustness, we will also conduct a regression analysis using the treatments as categorical variables and NPIT as base. As exogenous variables, we include: age, gender, experience, and arousal of the participants.” Make clear that these are exploratory analyses.
9. Other points
p7: "We will first check whether the assumptions required for parametric tests are fulfilled, and if not proceed with non-parametric tests." Make clear which assumptions (e.g. normality) you are going to test for, and how, and then specify the alternative tests that will be used (e.g. presumably Mann Whitney U test?)
p7: "For the significance analyses, we will apply a confidence interval of p < 0.05 and correct for multiple hypotheses testing using the Holm-Bonferroni method." Do you mean "alpha level of .05" instead of "confidence interval of p < 0.05"?