Review for To see, not to see or when to see: An #EEGManyLabs replication study with a twist
This is one of the best pieces of work I have been asked to review! It aims to replicate an important study with foundational implications for influential theories of oscillatory brain function. There have been previous replication attempts in this area, including a Registered Report, but I am not aware of such a multi-lab attempt with potentially such high power (but see point 4 below). The outcome of this, I expect, will be of great interest to the related field.
Below I list some suggestions for adjustments in order of appearance in the manuscript, none of which I see as fundamentally critical.
1. Title: would it be possible for the title to be a bit more specific and informative? Something along the lines of ‘Testing the replicability of alpha phase determining visual perception’. I realise that’s not as catchy, but I found the current title a little vague.
2. The third sentence assumes subjective experience is a continuous flow, but phenomenology suggests this is not straightforwardly the case (see e.g., Husserl, The Phenomenology of Internal Time-Consciousness, or Busch and VanRullen, 2014, Is visual perception like a continuous flow or a series of snapshots? In Subjective Time, MIT Press, or Dainton, 2010/232023, Stanford Encyclopaedia of Philosophy, Temporal Consciousness ). This could be simply resolved by losing the first part of that sentence.
3. In the hypothesis at the end of the introduction, I thought the statement under a negative finding that “there is no evidence for visual perception to operate in cycles” was a little strong, and perhaps should be rephrased to refer to this study's evidence. This also relates to the point 9 below. Relatedly, I thought it would be potentially more informative if the decision criteria at the end of the introduction (which seems primary) were based on the outcome of the Bayesian analysis rather than the frequentist statistics.
4. I found the first and second sentences of the participants section a little ambiguous. It could be read that the 7 labs will collectively provide a 35 participant data set, i.e., 5 each, OR, each lab would aim to contribute a 35 participant data set. My confusion was not helped by other sections seeming to imply that both the full data set had n=35 (e.g., “final sample size (N=35)”), and each lab is to produce a complete data set (e.g., “compute effect sizes (Cohen’s d) for each individual lab”). If it is the former, and given the reasons well specified in the introduction for effect sizes reported by Mathewson et al., being larger than we might expect in replication, then I do not think the study as it stands should be described as “high-powered”. If there is a real effect but it is smaller, as the authors suggest is likely, the study would be underpowered. Would it be possible to recalculate power estimates based on smaller effect sizes? Perhaps, based on the estimates of VanRullen et al., 2016 described in the introduction? I appreciate the intention behind the use of the Mathewson et al., effect sizes, but if they are used, I suggest explicitly describing the limitations in interpreting a negative outcome (also see point 9).
If, however, each lab is to contribute a 35-participant data set, then great, and I think this should be made clearer and the total minimum n should be stated. It might also help if the smallest effect size reliably detectable was stated. Inclusion of such should be informative either way. I also wondered whether a simpler concatenation of the data across labs might complement the meta-analytic approach but be slightly more powerful. I also thought it might be possible to combine evidence across labs more efficiently by taking advantage of Bayes Factors being transitive (Morey and Rouder, Psychological Methods, 2011).
5. I didn’t see the need to restrict the age of participants to a 12-year range.
6. Stimuli and procedure. I thought there should be a sentence qualifying the timings as approximate, or specifying two sets of timings, as the refresh rates of the monitors used across the different labs would mean these exact timings cannot be followed by all labs.
7. The sentence “The experimental session counts 16 blocks of 72 trials each”, should that be “The experimental sessions consist of 16 blocks of 72 trials each”?
8. Should an impedance target be prescribed? I understand this can be equipment-dependent, but then maybe it could be lab-specific.
9. I would have liked to see more weight given to adjudication between support for null vs. lack of evidence in case of a negative finding in the main text. I think differentiating between these two potential outcomes could be important information for the field. I understand that the Bayesian meta-analytic approach offers the potential to do this, so I was wondering whether the Bayesian analyses could receive a greater emphasis in terms of decision criteria. For example, at the end of the introduction. This might mean using Bayesian equivalents for the primary t-tests (which can be derived from T statistics) and recalculating the sample size estimation based on Bayesian simulations, but in my experience, the outcome invariably aligns between Bayesian and frequentist estimations. Alternatively, frequentist statistics can assess equivalence (Lakens et al,. 2018, AMPPS). This relates to point 4 where to fairly assess equivalence, a smaller expected effect size may be required.
10. Related to the above point and based on the expectation of small effects described in the introduction, I thought the prior applied in the Bayesian analyses should probably reflect this aspect of the hypothesis and be smaller than the default 0.707 scaling factor (see Dienes, 2011, Perspective on Psychological Science).
11. Table 1 describes decision criteria based on p<0.05, whereas the text describes p<0.02 criteria (also true in the final table). Can this be checked or reasons for the discrepancy given? Also, I did not find the description of Hp b1 in this table very easy to understand.
12. Table 2 and the final table seem to repeat much of the same information, I realise the final one conforms to guidelines, but I still wondered whether they could be integrated. I also thought the final table would benefit from basing criteria on Bayesian tests to more straightforwardly differentiate between negative and inconclusive outcomes, and this might help align the tables.