Is the SNARC effect modulated by absolute number magnitude?
One and only SNARC? A Registered Report on the SNARC Effect’s Range Dependency
Abstract
Recommendation: posted 28 November 2023, validated 28 November 2023
McIntosh, R. (2023) Is the SNARC effect modulated by absolute number magnitude?. Peer Community in Registered Reports, . https://rr.peercommunityin.org/PCIRegisteredReports/articles/rec?id=352
Related stage 2 preprints:
Lilly Roth, John Caffier, Ulf-Dietrich Reips, Hans-Christoph Nuerk, Annika Tave Overlander, Krzysztof Cipora
https://osf.io/ajqpk
Recommendation
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.
List of eligible PCI RR-friendly journals:
- Advances in Cognitive Psychology
- Collabra: Psychology
- Experimental Psychology
- Journal of Cognition
- Peer Community Journal
- PeerJ
- Psychology of Consciousness: Theory, Research, and Practice
- Royal Society Open Science
- Studia Psychologica
- Swiss Psychology Open
References
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #2
DOI or URL of the report: https://osf.io/rfsah
Version of the report: v2
Author's Reply, 22 Nov 2023
Decision by Robert McIntosh, posted 08 Sep 2023, validated 08 Sep 2023
Thank you for your careful work addressing review comments for this Stage 1 RR. The revisions have been evaluated by one of the original reviewers, who is happy with the changes. (The other original reviewer was not available at this time.)
I have looked over the paper myself, and think it is considerably improved, but there are a few more issues that I would like you to give attention to before IPA is issued for this experiment. You are not obliged to follow any suggestions made, but should provide a rationale for the course of action that you decide.
1) "Power analysis". In developing you study plan, you present both frequentist power analyses (for prior studies) and power analyses developed within the Bayesian framework for your planned study. Given that you seem to be applying a criterion threshold BF (>3) to support a binary claim, it may be legitimate to talk about 'power', but I think it is nonetheless potentially confusing for you to use the same language of 'power' to apply both to frequentist and Bayesian appoaches.
One key reason is that power concerns only the probability to detect true effects (of a given size) when present. But your Bayesian analysis is not only asking this question, but is configured also to return evidence in favour of the null hypothesis (BF < 1/3). Thus, you state in the abstract that "... a power of .90 for detecting moderate evidence (Bayes Factor above 3 or below 1/3)", but actually your sample size planning seems to be predicated only on sensitivity to a true effect when present, without considering your sensitivity to the null when the null is true. A small tweak of your code would allow you to make more complete statements about the probability of your Bayesian analysis to return sensitive evidence for H1 or H0, and the rates of misleading evidence (using the language of Bayes Factor Design Analysis).
2) At present, your sample size is predicated on the smallest effect size of interest for H3, and the tests of H1 and H2 inherit their sensitivity from this design. Strictly speaking, this means that your experimental design has not been shown to be adequate to test H1 and H2, whereas the RR format required that you demonstrate your required standard of evidence for all hypotheses. Given that your SESOI for H3 is so small, it would seem to be a small extra step for you to make the (easy) argument that an effect size smaller than this would be similarly uninteresting for H1 and H2, which would then allow you to assert the required level of sensitivity for all hypotheses.
3) However, for your experiment to really have the required level of sensitivity for all hypotheses, then your stopping rule cannot be based on a sensitive outcome for one hypothesis only - you could only stop the experiment (prior to n-max) if a sensitive BF were found for all three hypotheses. If your plan is to terminate the experiment based on H3 only, then your plan does not have the desired level of sensitivity for H1 and H2, and you would need to relegate these hypotheses to secondary, exploratory status (i.e. remove them from the Stage 1 plan).
4) You have now added the Odd Effect as a positive control/manipulation check. However, your logical chain here simply states that it is a robust effect in partity judgements, and that you expect to find it and will be surprised if you find evidence against it (you do not state what will happen if the BF is insensitive). This does not constitute a meaningful manipulation check, because it does not seem to have any implications for your main hypothesis tests. Normally, a manipulation check is an effect that should definitely be present in the data so that, if it is not found, there is evidence that your task has not worked as intended. Normally, when a manipulation check is failed, then the conclusion is that the experiment is deemed incapable of returning a clear anwer on the experimental hypotheses. For this reason, the manipulation check is normally first in the list of inferential tests, to establish the adequacy of the task to the question. Like other inferential tests, it requires a power/sensitivity analysis. If the Odd Effect has this status, then you need to make this clear. If it does not, then it is not a manipulation check.
5) On the other hand, your H1 is a check that the SNARC effect is present in all number ranges. This seems much more to me like a conventional (and relevant) manipulation check, and yet you simply state that you will be surprised if you don't find it in all ranges, but do not indicate that this would limit your ability to test further experimental hypotheses in any way. I would at least have thought that finding the SNARC effect in a given range was a requirement for testing the other hypotheses with respect to that range (i.e. any tests in which that range is involved for H2 or H3). If this is not the case, then it would seem that your experiment has been configured such that you could be making claims about the range dependency of the SNARC effect even if your data showed no evidence of SNARC effects per se. I realise that this outcome is unlikely, but it is the logical coherence of your analysis plan that is at stake.
6) The manuscript is long and complex. There are no word limits at PCI-RR, but with the aim of future publication, I would strongly encourage you to try to be more concise wherever possible (obviously, without omitting any essential material).
7) The new title is rather unwieldy: "“One and only SNARC? How Flexible are The Flexibility of Spatial-Numerical Associations? A Registered Report on the SNARC’s Range Dependency”.
Should the first part be: "One and only one SNARC?"? The second and third parts seem to be two alternate sub-titles; there should probably be one and only one sub-title.
I hope that these comments are helpful. You have done a lot of good work to sharpen this plan up, but I think that a little further sharpening and streamlining is required before IPA.
Best wishes,
Rob McIntosh
Reviewed by anonymous reviewer 1, 27 Aug 2023
The authors have carefully revised their report. I highly appreciate their efforts and I am fully satisfied with how they addressed my concerns and incorporated my suggestions.
Evaluation round #1
DOI or URL of the report: https://osf.io/heq6p
Version of the report: v1
Author's Reply, 25 Jul 2023
Decision by Robert McIntosh, posted 04 Apr 2023, validated 04 Apr 2023
Thank you for your patience, and I apologise that it has taken so long to return reviews for your manuscript. It proved tricky to find reviewers, but once suitable and willing people were found, the reviews were thorough, and I think you will find them very helpful in guiding a revision of this Stage 1 plan. Reviewer#2 in particular is very familiar with the logic and rigours of Registered Reports, and provides excellent guidance on related considerations.
Both reviewers are generally positive about the proposed study, but have substantive concerns and suggestions for improvement. You should consider (and respond to) all of these points carefully. I would emphasise the following in particular, adding some comments of my own.
Reviewer#1 (Melina Mende) makes a number of requests for clarity, and emphasises the need for a full justification for the approach to data trimming. The approach to treatment of Reaction Times is absolutely critical (since it is the basis for the core dependent measure). It is described in detail, but it is not rationalised. The treatment of RT is a complex issue, and decisions about whether (and how) to exclude outliers and/or to transform data and/or to use robust estimators of central tendency per cell (i.e. medians) are complex and ideally should be informed by a good working knowledge of the characteristics of the data in your experiment. (In general, of course, pre-registered approaches may tend towards more conservative and robust approaches, because the final form of the data cannot be known in advance.)
On this point, although this Stage 1 RR seems to have been thought through carefully, I do not see direct evidence of the tasks having been fully piloted, where pilot data would allow for the full piloting of the proposed analysis pipeline. It may be that you have done such piloting, or perhaps you have used the same data collection approach/platform in a previous study, so your analysis pipeline is well established. If so, then you should describe this relevant history in the present RR. If not, then I honestly think it is necessary to conduct a reasonably-sized pilot in order to debug and optimise your analysis plan.
This relates also to the points made by Reviewer#2 regarding your quality checks (and confidence in the is capable of testing the hypotheses of interest seem essential here.
I also agree strongly with this reviewer that the purpose and inferential role of all parts of the analysis must be clear (e.g. how will the outcome of the dropout analysis be used to inform interpretation of findings), and that the exploratory analyses should be removed from the Stage 1 plan. The ‘follow-up’ analyses should be elevated to full inferential status and specified fully or, if not essential to the main conclusions, relegated to exploratory status and omitted (the latter approach may be preferred as simpler, given that your analysis plan is already rather complex).
I also concur with the idea that combining frequentist and Bayesian approaches seems unnecessarily complex and ambiguous. If these approaches do not lead to the same outcomes then which approach will you be guided by? (And then why should you bother to include the other approach at all?) It is of course possible to include parallel frequentist and Bayesian analyses in an RR, but specifying unambiguously which theoretical conclusions will follow for the full range of possible outcomes becomes very complex.
With regard to the frequentist analysis, I have some concerns about the approach to apha levels and (non-)adjustment for multiple comparisons. In the text you state: “For each test described below, a significance level of α = .01 will be used. The reason for using a rather conservative significance level is that we will conduct multiple tests per hypothesis… Importantly, the significance level does not need to be corrected for the total number of conducted tests in this study, because the tests belong to different test families and because different theoretical inferences can be drawn from their results (Lakens, 2016). Moreover, we will look at each result individually and not generalize from one single significant result within a test family to the presence of an effect in both experiments and in all possible number ranges, so that our interpretations will not inflate the familywise error rate.”
This sounds very thorough, but I am not sure it is sound/coherent. First you state that you adopt a conservative significance level so that you don’t have to adjust for multiple comparisons – it would be more transparent to state what the significance criterion is, and how it has been adjusted for (how many) comparisons. Without this, it is unclear what your effective significance threshold is. In apparent contradiction to the above you then go on to state that the threshold does not need to be adjusted because the individual tests are all testing independent hypotheses, and you will interpret each individually. This logic is repeated in the design table.
Although this approach may seem appealing, I am not sure that it is convincing in the present case. As far as I am aware, there is no theory proposing that functionally independent SNARC effects may exist for your different number ranges. In any case, you also state that “not finding it [the SNARC effect] in one of the four ranges despite our large sample would speak against the robustness of the SNARC effect”. This means that the results are not really being evaluated independently for each number range, but considered together to bear on the same theoretical question. Moreover, it is not convincing to state that the failure to find the result in one of the four ranges would speak against the robustness of the SNARC effect, because 90% power implies a 10% chance of a false negative in any one range (~40% chance of at least one false negative result).
In general, I think that your statistical approach needs better justification and specification, and that it might benefit from simplification (e.g. by deciding on either a frequentist or Bayesian approach). In passing, I note that you refer to another paper (Roth, Lukács, et al., 2022) for your power calculations (which, confusingly, seems to be an earlier version of this same RR plan). In any case, power calculations are an integral part of a Stage 1 RR plan and so should be described fully in the RR itself.
I hope that the reviewers' helpful comments will be useful to you in taking this project forward, and if you decide to revise this Stage 1 RR, then you should indicate how you have responded to each of the comments made, including the additional ones above.
Reviewed by Melinda Mende, 04 Feb 2023
The article is well-written and targets an interesting and relevant issue. The researchers aim to investigate RMdependency and AMdependency of the SNARC effect. I think that this work is a positive example of a well-designed study where a lot of considerations were made, starting with the optimal number of trials per cell and power calculations. Further, also the planned data analysis is well described with useful measurements to improve the quality of the statistical analysis. Overall, I have just some minor suggestions to further improve this work.
p.7
“In that study, the observed result pattern looked like Scenario 5 in Figure 1”.
Figure 1 is too far away from this claim. I suggest either introducing the figure earlier or not referring to it at this stage.
p.7
The content of footnote 1 would be better in the main text together with the previous explanation on how to calculate the SNRAC effect.
p. 10/11
Scenarios are very hard to understand, even though they are illustrated in Figure 1. One needs to scroll up and down a lot. Maybe you could divide the figure into parts and explain each of the scenarios and then directly show the figure.
p. 12
“namely 0 to 5 and 4 to 9 in Experiment 1, and 1 to 5 (excluding 3) and 4 to 8 (excluding 6) in Experiment 2”
Which study are you referring to?
p. 16
I do understand your design approach and I think that the two experiments are well elaborated. Nonetheless, I do not understand the content of Table 2. What do you mean by “Parity +0.5”/”Parity -0.5”?
p.17
Why not visualize the time course of the stimuli presentation with a Figure?
p. 18
“This figure shows the four between-subjects conditions”
Why don’t you want to use a fully within-subject design?
p. 18
“handedness, and finger-counting habits”
How will these be measured?
p. 18
“Participants may choose response keys for the experimental task which are to be located in the same row and about one hand width apart from each other on their keyboard”
Even if the keys are one hand width apart from each other, how do you make sure that participants do not use just one hand for giving their responses?
p. 19
“Only trials with RTs between 200 and 1500 ms will be included in the analysis. Further outliers will be removed in an iterative trimming procedure for each participant separately, such that only RTs that are maximum 3 SDs above or below the individual mean RT of all remaining trials will be considered. Finally, only datasets of participants with at least 75% valid remaining trials and without any empty experimental cell (number magnitude per response side) in both number ranges will be considered.”
Please specify how and why you selected these criteria. Such data trimming criteria are often similar in the literature but not entirely equal. Thus, I would like to learn about your justification for using these criteria.
Download the review
Reviewed by anonymous reviewer 1, 02 Apr 2023
The authors of the present Stage 1 Registered Report aim to investigate flexibility of Spatial-Numerical associations by means of two experiments, one being a close replication and one a conceptual replication of previous studies in the field. I think the topic is highly timely, given the accumulating evidence on SNA and its implications. Overall, the authors have obviously taken great care in reviewing the existing literature and in assessing the current methodological limitations. However, the implementation of this study as a registered-report is still suboptimal, my main concerns are outlined below.
- The existing section “How could absolute magnitude affect…” left me wondering whether this is all really necessary, or whether this part could be shortened focusing on Table 1 which seems the one that readers can easily link to the rest of the manuscript.
- Statistical power. The authors opted for d=0.2 as minimal effect size of interest and explained how estimating this effect based on the existing literature could be biased by various factors. As a reference, it would be anyway useful to report what the typical effect size in this literature is. What was the original effect size in the studies they aim to replicate? Also, when reporting the a priori calculation, the authors refer to a specific standard deviation but do not report any value - please add.
- Participants. The authors report only a minimal age (18) as a requirement. Since the experiments measure reaction times, for which age differences might exist, wouldn’t it be more sensible to also add a maximal age? What’s the aim of giving not just full but also partial compensation?
- Procedure. The experiment will be implemented in the Wextor online platform. Since the expected effects are very small, do the authors have any information regarding the measurement accuracy of this tool (e.g., compared to lab-based experiments?)
- Quality check. The authors report a seriousness check that will be used prior to the beginning of the procedure, and a self-assessment to be filled right afterwards (e.g., the participants will rate the condition in which the experiment took place etc). However, they do not report any concrete quality check to assess correct implementation of the procedure and participant’s compliance with instructions. In line with this, the authors do not appear to have implemented any positive control. These aspects need to be carefully addressed for any registered report, especially in online procedures.
- Demographic questions. What’s the rationale behind allowing the “I prefer not to answer” option? It seems rather essential to collect complete answers from all participants. Also, why explicitly use the term “finger counting habit” in these questions? This might seem rather obscure to the participants.
- Response keys. The phrasing here is rather obscure to me. Why allow the participants to use any other key than the two that were assigned by default? Especially if no check is put in place, e.g. how will they check that the distance between the two keys is optimal?
- Dropout rates. The authors plan to further investigate the reasons for dropouts, but it’s unclear how the results of this analysis will affect subsequent analyses (e.g., in case they show significantly different dropout rates in some conditions?)
- Statistical approach. The authors aim to combine null-hypothesis testing with estimation of bayes factors. I assume the former was chosen because of earlier studies, however since the main analyses will employ t-tests why not opt directly for a full bayesian approach? Combining the two approaches always appears rather complex to manage in a registered report - especially when outline the interpretations based on different outcomes. A plus of opting for a bayesian approach is that it would allow the authors to use sequential analyses for a more efficient recruitment and sampling plan (e.g., Schönbrodt, F.D., Wagenmakers, EJ. Bayes factor design analysis: Planning for compelling evidence. Psychon Bull Rev 25, 128–142 (2018))
- Analyses. The authors report three types of analyses: main, follow-up and exploratory. However, by definition a Stage 1 submission that will later become a pre-registration, cannot include exploratory analyses. While they could be generically referred to in the analysis plan, they cannot be outlined in detail and included in the design planner - otherwise they’d be pre-registered as well. I’m more uncertain regarding the follow-up analyses, which have a more nuanced status - I invite the authors to reconsider whether these analyses should be pre-registered or not.