Is the SNARC effect modulated by absolute number magnitude?
One and only SNARC? A Registered Report on the SNARC Effect’s Range Dependency
Recommendation: posted 28 November 2023, validated 28 November 2023
McIntosh, R. (2023) Is the SNARC effect modulated by absolute number magnitude?. Peer Community in Registered Reports, . https://rr.peercommunityin.org/PCIRegisteredReports/articles/rec?id=352
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.
List of eligible PCI RR-friendly journals:
- Advances in Cognitive Psychology
- Collabra: Psychology
- Experimental Psychology
- Journal of Cognition
- Peer Community Journal
- Psychology of Consciousness: Theory, Research, and Practice
- Royal Society Open Science
- Studia Psychologica
- Swiss Psychology Open
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #2
DOI or URL of the report: https://osf.io/rfsah
Version of the report: v2
Author's Reply, 22 Nov 2023
Decision by Robert McIntosh, posted 08 Sep 2023, validated 08 Sep 2023
Thank you for your careful work addressing review comments for this Stage 1 RR. The revisions have been evaluated by one of the original reviewers, who is happy with the changes. (The other original reviewer was not available at this time.)
I have looked over the paper myself, and think it is considerably improved, but there are a few more issues that I would like you to give attention to before IPA is issued for this experiment. You are not obliged to follow any suggestions made, but should provide a rationale for the course of action that you decide.
1) "Power analysis". In developing you study plan, you present both frequentist power analyses (for prior studies) and power analyses developed within the Bayesian framework for your planned study. Given that you seem to be applying a criterion threshold BF (>3) to support a binary claim, it may be legitimate to talk about 'power', but I think it is nonetheless potentially confusing for you to use the same language of 'power' to apply both to frequentist and Bayesian appoaches.
One key reason is that power concerns only the probability to detect true effects (of a given size) when present. But your Bayesian analysis is not only asking this question, but is configured also to return evidence in favour of the null hypothesis (BF < 1/3). Thus, you state in the abstract that "... a power of .90 for detecting moderate evidence (Bayes Factor above 3 or below 1/3)", but actually your sample size planning seems to be predicated only on sensitivity to a true effect when present, without considering your sensitivity to the null when the null is true. A small tweak of your code would allow you to make more complete statements about the probability of your Bayesian analysis to return sensitive evidence for H1 or H0, and the rates of misleading evidence (using the language of Bayes Factor Design Analysis).
2) At present, your sample size is predicated on the smallest effect size of interest for H3, and the tests of H1 and H2 inherit their sensitivity from this design. Strictly speaking, this means that your experimental design has not been shown to be adequate to test H1 and H2, whereas the RR format required that you demonstrate your required standard of evidence for all hypotheses. Given that your SESOI for H3 is so small, it would seem to be a small extra step for you to make the (easy) argument that an effect size smaller than this would be similarly uninteresting for H1 and H2, which would then allow you to assert the required level of sensitivity for all hypotheses.
3) However, for your experiment to really have the required level of sensitivity for all hypotheses, then your stopping rule cannot be based on a sensitive outcome for one hypothesis only - you could only stop the experiment (prior to n-max) if a sensitive BF were found for all three hypotheses. If your plan is to terminate the experiment based on H3 only, then your plan does not have the desired level of sensitivity for H1 and H2, and you would need to relegate these hypotheses to secondary, exploratory status (i.e. remove them from the Stage 1 plan).
4) You have now added the Odd Effect as a positive control/manipulation check. However, your logical chain here simply states that it is a robust effect in partity judgements, and that you expect to find it and will be surprised if you find evidence against it (you do not state what will happen if the BF is insensitive). This does not constitute a meaningful manipulation check, because it does not seem to have any implications for your main hypothesis tests. Normally, a manipulation check is an effect that should definitely be present in the data so that, if it is not found, there is evidence that your task has not worked as intended. Normally, when a manipulation check is failed, then the conclusion is that the experiment is deemed incapable of returning a clear anwer on the experimental hypotheses. For this reason, the manipulation check is normally first in the list of inferential tests, to establish the adequacy of the task to the question. Like other inferential tests, it requires a power/sensitivity analysis. If the Odd Effect has this status, then you need to make this clear. If it does not, then it is not a manipulation check.
5) On the other hand, your H1 is a check that the SNARC effect is present in all number ranges. This seems much more to me like a conventional (and relevant) manipulation check, and yet you simply state that you will be surprised if you don't find it in all ranges, but do not indicate that this would limit your ability to test further experimental hypotheses in any way. I would at least have thought that finding the SNARC effect in a given range was a requirement for testing the other hypotheses with respect to that range (i.e. any tests in which that range is involved for H2 or H3). If this is not the case, then it would seem that your experiment has been configured such that you could be making claims about the range dependency of the SNARC effect even if your data showed no evidence of SNARC effects per se. I realise that this outcome is unlikely, but it is the logical coherence of your analysis plan that is at stake.
6) The manuscript is long and complex. There are no word limits at PCI-RR, but with the aim of future publication, I would strongly encourage you to try to be more concise wherever possible (obviously, without omitting any essential material).
7) The new title is rather unwieldy: "“One and only SNARC? How Flexible are The Flexibility of Spatial-Numerical Associations? A Registered Report on the SNARC’s Range Dependency”.
Should the first part be: "One and only one SNARC?"? The second and third parts seem to be two alternate sub-titles; there should probably be one and only one sub-title.
I hope that these comments are helpful. You have done a lot of good work to sharpen this plan up, but I think that a little further sharpening and streamlining is required before IPA.
Reviewed by anonymous reviewer 1, 27 Aug 2023
Evaluation round #1
DOI or URL of the report: https://osf.io/heq6p
Version of the report: v1
Author's Reply, 25 Jul 2023
Decision by Robert McIntosh, posted 04 Apr 2023, validated 04 Apr 2023
Thank you for your patience, and I apologise that it has taken so long to return reviews for your manuscript. It proved tricky to find reviewers, but once suitable and willing people were found, the reviews were thorough, and I think you will find them very helpful in guiding a revision of this Stage 1 plan. Reviewer#2 in particular is very familiar with the logic and rigours of Registered Reports, and provides excellent guidance on related considerations.
Both reviewers are generally positive about the proposed study, but have substantive concerns and suggestions for improvement. You should consider (and respond to) all of these points carefully. I would emphasise the following in particular, adding some comments of my own.
Reviewer#1 (Melina Mende) makes a number of requests for clarity, and emphasises the need for a full justification for the approach to data trimming. The approach to treatment of Reaction Times is absolutely critical (since it is the basis for the core dependent measure). It is described in detail, but it is not rationalised. The treatment of RT is a complex issue, and decisions about whether (and how) to exclude outliers and/or to transform data and/or to use robust estimators of central tendency per cell (i.e. medians) are complex and ideally should be informed by a good working knowledge of the characteristics of the data in your experiment. (In general, of course, pre-registered approaches may tend towards more conservative and robust approaches, because the final form of the data cannot be known in advance.)
On this point, although this Stage 1 RR seems to have been thought through carefully, I do not see direct evidence of the tasks having been fully piloted, where pilot data would allow for the full piloting of the proposed analysis pipeline. It may be that you have done such piloting, or perhaps you have used the same data collection approach/platform in a previous study, so your analysis pipeline is well established. If so, then you should describe this relevant history in the present RR. If not, then I honestly think it is necessary to conduct a reasonably-sized pilot in order to debug and optimise your analysis plan.
This relates also to the points made by Reviewer#2 regarding your quality checks (and confidence in the is capable of testing the hypotheses of interest seem essential here.
I also agree strongly with this reviewer that the purpose and inferential role of all parts of the analysis must be clear (e.g. how will the outcome of the dropout analysis be used to inform interpretation of findings), and that the exploratory analyses should be removed from the Stage 1 plan. The ‘follow-up’ analyses should be elevated to full inferential status and specified fully or, if not essential to the main conclusions, relegated to exploratory status and omitted (the latter approach may be preferred as simpler, given that your analysis plan is already rather complex).
I also concur with the idea that combining frequentist and Bayesian approaches seems unnecessarily complex and ambiguous. If these approaches do not lead to the same outcomes then which approach will you be guided by? (And then why should you bother to include the other approach at all?) It is of course possible to include parallel frequentist and Bayesian analyses in an RR, but specifying unambiguously which theoretical conclusions will follow for the full range of possible outcomes becomes very complex.
With regard to the frequentist analysis, I have some concerns about the approach to apha levels and (non-)adjustment for multiple comparisons. In the text you state: “For each test described below, a significance level of α = .01 will be used. The reason for using a rather conservative significance level is that we will conduct multiple tests per hypothesis… Importantly, the significance level does not need to be corrected for the total number of conducted tests in this study, because the tests belong to different test families and because different theoretical inferences can be drawn from their results (Lakens, 2016). Moreover, we will look at each result individually and not generalize from one single significant result within a test family to the presence of an effect in both experiments and in all possible number ranges, so that our interpretations will not inflate the familywise error rate.”
This sounds very thorough, but I am not sure it is sound/coherent. First you state that you adopt a conservative significance level so that you don’t have to adjust for multiple comparisons – it would be more transparent to state what the significance criterion is, and how it has been adjusted for (how many) comparisons. Without this, it is unclear what your effective significance threshold is. In apparent contradiction to the above you then go on to state that the threshold does not need to be adjusted because the individual tests are all testing independent hypotheses, and you will interpret each individually. This logic is repeated in the design table.
Although this approach may seem appealing, I am not sure that it is convincing in the present case. As far as I am aware, there is no theory proposing that functionally independent SNARC effects may exist for your different number ranges. In any case, you also state that “not finding it [the SNARC effect] in one of the four ranges despite our large sample would speak against the robustness of the SNARC effect”. This means that the results are not really being evaluated independently for each number range, but considered together to bear on the same theoretical question. Moreover, it is not convincing to state that the failure to find the result in one of the four ranges would speak against the robustness of the SNARC effect, because 90% power implies a 10% chance of a false negative in any one range (~40% chance of at least one false negative result).
In general, I think that your statistical approach needs better justification and specification, and that it might benefit from simplification (e.g. by deciding on either a frequentist or Bayesian approach). In passing, I note that you refer to another paper (Roth, Lukács, et al., 2022) for your power calculations (which, confusingly, seems to be an earlier version of this same RR plan). In any case, power calculations are an integral part of a Stage 1 RR plan and so should be described fully in the RR itself.
I hope that the reviewers' helpful comments will be useful to you in taking this project forward, and if you decide to revise this Stage 1 RR, then you should indicate how you have responded to each of the comments made, including the additional ones above.