DOI or URL of the report: https://osf.io/rfsah
Version of the report: v2
Thank you for your careful work addressing review comments for this Stage 1 RR. The revisions have been evaluated by one of the original reviewers, who is happy with the changes. (The other original reviewer was not available at this time.)
I have looked over the paper myself, and think it is considerably improved, but there are a few more issues that I would like you to give attention to before IPA is issued for this experiment. You are not obliged to follow any suggestions made, but should provide a rationale for the course of action that you decide.
1) "Power analysis". In developing you study plan, you present both frequentist power analyses (for prior studies) and power analyses developed within the Bayesian framework for your planned study. Given that you seem to be applying a criterion threshold BF (>3) to support a binary claim, it may be legitimate to talk about 'power', but I think it is nonetheless potentially confusing for you to use the same language of 'power' to apply both to frequentist and Bayesian appoaches.
One key reason is that power concerns only the probability to detect true effects (of a given size) when present. But your Bayesian analysis is not only asking this question, but is configured also to return evidence in favour of the null hypothesis (BF < 1/3). Thus, you state in the abstract that "... a power of .90 for detecting moderate evidence (Bayes Factor above 3 or below 1/3)", but actually your sample size planning seems to be predicated only on sensitivity to a true effect when present, without considering your sensitivity to the null when the null is true. A small tweak of your code would allow you to make more complete statements about the probability of your Bayesian analysis to return sensitive evidence for H1 or H0, and the rates of misleading evidence (using the language of Bayes Factor Design Analysis).
2) At present, your sample size is predicated on the smallest effect size of interest for H3, and the tests of H1 and H2 inherit their sensitivity from this design. Strictly speaking, this means that your experimental design has not been shown to be adequate to test H1 and H2, whereas the RR format required that you demonstrate your required standard of evidence for all hypotheses. Given that your SESOI for H3 is so small, it would seem to be a small extra step for you to make the (easy) argument that an effect size smaller than this would be similarly uninteresting for H1 and H2, which would then allow you to assert the required level of sensitivity for all hypotheses.
3) However, for your experiment to really have the required level of sensitivity for all hypotheses, then your stopping rule cannot be based on a sensitive outcome for one hypothesis only - you could only stop the experiment (prior to n-max) if a sensitive BF were found for all three hypotheses. If your plan is to terminate the experiment based on H3 only, then your plan does not have the desired level of sensitivity for H1 and H2, and you would need to relegate these hypotheses to secondary, exploratory status (i.e. remove them from the Stage 1 plan).
4) You have now added the Odd Effect as a positive control/manipulation check. However, your logical chain here simply states that it is a robust effect in partity judgements, and that you expect to find it and will be surprised if you find evidence against it (you do not state what will happen if the BF is insensitive). This does not constitute a meaningful manipulation check, because it does not seem to have any implications for your main hypothesis tests. Normally, a manipulation check is an effect that should definitely be present in the data so that, if it is not found, there is evidence that your task has not worked as intended. Normally, when a manipulation check is failed, then the conclusion is that the experiment is deemed incapable of returning a clear anwer on the experimental hypotheses. For this reason, the manipulation check is normally first in the list of inferential tests, to establish the adequacy of the task to the question. Like other inferential tests, it requires a power/sensitivity analysis. If the Odd Effect has this status, then you need to make this clear. If it does not, then it is not a manipulation check.
5) On the other hand, your H1 is a check that the SNARC effect is present in all number ranges. This seems much more to me like a conventional (and relevant) manipulation check, and yet you simply state that you will be surprised if you don't find it in all ranges, but do not indicate that this would limit your ability to test further experimental hypotheses in any way. I would at least have thought that finding the SNARC effect in a given range was a requirement for testing the other hypotheses with respect to that range (i.e. any tests in which that range is involved for H2 or H3). If this is not the case, then it would seem that your experiment has been configured such that you could be making claims about the range dependency of the SNARC effect even if your data showed no evidence of SNARC effects per se. I realise that this outcome is unlikely, but it is the logical coherence of your analysis plan that is at stake.
6) The manuscript is long and complex. There are no word limits at PCI-RR, but with the aim of future publication, I would strongly encourage you to try to be more concise wherever possible (obviously, without omitting any essential material).
7) The new title is rather unwieldy: "“One and only SNARC? How Flexible are The Flexibility of Spatial-Numerical Associations? A Registered Report on the SNARC’s Range Dependency”.
Should the first part be: "One and only one SNARC?"? The second and third parts seem to be two alternate sub-titles; there should probably be one and only one sub-title.
I hope that these comments are helpful. You have done a lot of good work to sharpen this plan up, but I think that a little further sharpening and streamlining is required before IPA.
Best wishes,
Rob McIntosh
The authors have carefully revised their report. I highly appreciate their efforts and I am fully satisfied with how they addressed my concerns and incorporated my suggestions.
DOI or URL of the report: https://osf.io/heq6p
Version of the report: v1
Thank you for your patience, and I apologise that it has taken so long to return reviews for your manuscript. It proved tricky to find reviewers, but once suitable and willing people were found, the reviews were thorough, and I think you will find them very helpful in guiding a revision of this Stage 1 plan. Reviewer#2 in particular is very familiar with the logic and rigours of Registered Reports, and provides excellent guidance on related considerations.
Both reviewers are generally positive about the proposed study, but have substantive concerns and suggestions for improvement. You should consider (and respond to) all of these points carefully. I would emphasise the following in particular, adding some comments of my own.
Reviewer#1 (Melina Mende) makes a number of requests for clarity, and emphasises the need for a full justification for the approach to data trimming. The approach to treatment of Reaction Times is absolutely critical (since it is the basis for the core dependent measure). It is described in detail, but it is not rationalised. The treatment of RT is a complex issue, and decisions about whether (and how) to exclude outliers and/or to transform data and/or to use robust estimators of central tendency per cell (i.e. medians) are complex and ideally should be informed by a good working knowledge of the characteristics of the data in your experiment. (In general, of course, pre-registered approaches may tend towards more conservative and robust approaches, because the final form of the data cannot be known in advance.)
On this point, although this Stage 1 RR seems to have been thought through carefully, I do not see direct evidence of the tasks having been fully piloted, where pilot data would allow for the full piloting of the proposed analysis pipeline. It may be that you have done such piloting, or perhaps you have used the same data collection approach/platform in a previous study, so your analysis pipeline is well established. If so, then you should describe this relevant history in the present RR. If not, then I honestly think it is necessary to conduct a reasonably-sized pilot in order to debug and optimise your analysis plan.
This relates also to the points made by Reviewer#2 regarding your quality checks (and confidence in the is capable of testing the hypotheses of interest seem essential here.
I also agree strongly with this reviewer that the purpose and inferential role of all parts of the analysis must be clear (e.g. how will the outcome of the dropout analysis be used to inform interpretation of findings), and that the exploratory analyses should be removed from the Stage 1 plan. The ‘follow-up’ analyses should be elevated to full inferential status and specified fully or, if not essential to the main conclusions, relegated to exploratory status and omitted (the latter approach may be preferred as simpler, given that your analysis plan is already rather complex).
I also concur with the idea that combining frequentist and Bayesian approaches seems unnecessarily complex and ambiguous. If these approaches do not lead to the same outcomes then which approach will you be guided by? (And then why should you bother to include the other approach at all?) It is of course possible to include parallel frequentist and Bayesian analyses in an RR, but specifying unambiguously which theoretical conclusions will follow for the full range of possible outcomes becomes very complex.
With regard to the frequentist analysis, I have some concerns about the approach to apha levels and (non-)adjustment for multiple comparisons. In the text you state: “For each test described below, a significance level of α = .01 will be used. The reason for using a rather conservative significance level is that we will conduct multiple tests per hypothesis… Importantly, the significance level does not need to be corrected for the total number of conducted tests in this study, because the tests belong to different test families and because different theoretical inferences can be drawn from their results (Lakens, 2016). Moreover, we will look at each result individually and not generalize from one single significant result within a test family to the presence of an effect in both experiments and in all possible number ranges, so that our interpretations will not inflate the familywise error rate.”
This sounds very thorough, but I am not sure it is sound/coherent. First you state that you adopt a conservative significance level so that you don’t have to adjust for multiple comparisons – it would be more transparent to state what the significance criterion is, and how it has been adjusted for (how many) comparisons. Without this, it is unclear what your effective significance threshold is. In apparent contradiction to the above you then go on to state that the threshold does not need to be adjusted because the individual tests are all testing independent hypotheses, and you will interpret each individually. This logic is repeated in the design table.
Although this approach may seem appealing, I am not sure that it is convincing in the present case. As far as I am aware, there is no theory proposing that functionally independent SNARC effects may exist for your different number ranges. In any case, you also state that “not finding it [the SNARC effect] in one of the four ranges despite our large sample would speak against the robustness of the SNARC effect”. This means that the results are not really being evaluated independently for each number range, but considered together to bear on the same theoretical question. Moreover, it is not convincing to state that the failure to find the result in one of the four ranges would speak against the robustness of the SNARC effect, because 90% power implies a 10% chance of a false negative in any one range (~40% chance of at least one false negative result).
In general, I think that your statistical approach needs better justification and specification, and that it might benefit from simplification (e.g. by deciding on either a frequentist or Bayesian approach). In passing, I note that you refer to another paper (Roth, Lukács, et al., 2022) for your power calculations (which, confusingly, seems to be an earlier version of this same RR plan). In any case, power calculations are an integral part of a Stage 1 RR plan and so should be described fully in the RR itself.
I hope that the reviewers' helpful comments will be useful to you in taking this project forward, and if you decide to revise this Stage 1 RR, then you should indicate how you have responded to each of the comments made, including the additional ones above.
The article is well-written and targets an interesting and relevant issue. The researchers aim to investigate RMdependency and AMdependency of the SNARC effect. I think that this work is a positive example of a well-designed study where a lot of considerations were made, starting with the optimal number of trials per cell and power calculations. Further, also the planned data analysis is well described with useful measurements to improve the quality of the statistical analysis. Overall, I have just some minor suggestions to further improve this work.
p.7
“In that study, the observed result pattern looked like Scenario 5 in Figure 1”.
Figure 1 is too far away from this claim. I suggest either introducing the figure earlier or not referring to it at this stage.
p.7
The content of footnote 1 would be better in the main text together with the previous explanation on how to calculate the SNRAC effect.
p. 10/11
Scenarios are very hard to understand, even though they are illustrated in Figure 1. One needs to scroll up and down a lot. Maybe you could divide the figure into parts and explain each of the scenarios and then directly show the figure.
p. 12
“namely 0 to 5 and 4 to 9 in Experiment 1, and 1 to 5 (excluding 3) and 4 to 8 (excluding 6) in Experiment 2”
Which study are you referring to?
p. 16
I do understand your design approach and I think that the two experiments are well elaborated. Nonetheless, I do not understand the content of Table 2. What do you mean by “Parity +0.5”/”Parity -0.5”?
p.17
Why not visualize the time course of the stimuli presentation with a Figure?
p. 18
“This figure shows the four between-subjects conditions”
Why don’t you want to use a fully within-subject design?
p. 18
“handedness, and finger-counting habits”
How will these be measured?
p. 18
“Participants may choose response keys for the experimental task which are to be located in the same row and about one hand width apart from each other on their keyboard”
Even if the keys are one hand width apart from each other, how do you make sure that participants do not use just one hand for giving their responses?
p. 19
“Only trials with RTs between 200 and 1500 ms will be included in the analysis. Further outliers will be removed in an iterative trimming procedure for each participant separately, such that only RTs that are maximum 3 SDs above or below the individual mean RT of all remaining trials will be considered. Finally, only datasets of participants with at least 75% valid remaining trials and without any empty experimental cell (number magnitude per response side) in both number ranges will be considered.”
Please specify how and why you selected these criteria. Such data trimming criteria are often similar in the literature but not entirely equal. Thus, I would like to learn about your justification for using these criteria.
Download the review
The authors of the present Stage 1 Registered Report aim to investigate flexibility of Spatial-Numerical associations by means of two experiments, one being a close replication and one a conceptual replication of previous studies in the field. I think the topic is highly timely, given the accumulating evidence on SNA and its implications. Overall, the authors have obviously taken great care in reviewing the existing literature and in assessing the current methodological limitations. However, the implementation of this study as a registered-report is still suboptimal, my main concerns are outlined below.