This is a well-thought study that will contribute to the literature. It is closely based on a previous study, so the general idea is not novel. However, it will provide a better controlled test of the previously reported effect.
The authors do a good analysis of previous studies, and present a compelling case for revisiting the findings of Chang et al (2016). They discuss theoretical and empirical reasons to doubt that masked information could allow future disambiguation of two-tone images, and identify limitations of the methods of Chang et al (2016). The background and motivation for the study are clearly presented, with good arguments.
The authors have also given careful thought to the methodology. The primary challenge is ensuring that "unconscious" stimuli are truly unconscious. I think the authors do a good job meeting this challenge. Trials will be classified based on multiple measures in a graded manner. The criteria for "fully unconscious" is more conservative than in the previous study, so we can more confident that any post-exposure effects will not be due to some conscious awareness. I think this is the main feature of the new study. They have also given good consideration to details like attention checks, exclusion criteria, and statistical power. The fact that it is pre-registered RR is positive feature in itself.
I question whether Experiment 2 is needed. Experiment 1 already implements blind ratings, and specifies a coding plan. Any errors or biases in rating responses would just add noise or shift the baselines. As the authors point out, adopting the forced choice method also has drawbacks, which might end up increasing the variability. More data is always nice, so if the authors want to repeat with this variation in method, that is fine. It seems like a lot of extra data collection to address an issue that is unlikely to affect the results, and might introduce some new problems.
If Experiment 2 is going to be included, the authors should say more about what they would conclude if the results from the two experiments are not entirely consistent. What if Experiment 1 finds strong evidence for unconscious priming but Experiment 2 finds only a weak trend? Would they conclude that there was experimenter bias in Experiment 1, and that the effect may not be reliable? Or conclude that the data in Experiment 2 was noisier due to methodological issues, so it should be discounted?
I think that the third experiment makes more sense as a follow-up because it addresses a potential alternate explanation that is more likely and problematic, and which might be ruled out by the Experiment 1 results. If the main trials and catch trials don't show a difference, then the follow-up experiment will be important, but if there is a clear difference between main trials and catch trials, then it isn't needed. This is more important than the issue of subjective ratings. In fact, I suggest reversing the order. If the evidence suggests that effects in Experiment 1 are due to spontaneous disambiguation, then it would be better to know this before conducting the proposed Experiment 2, which would otherwise have the same confound.
For the third experiment (the results contingent follow-up), the authors should say something about the conclusions that would be drawn from different possible outcomes. What if Experiment 1 appears to show disambiguation from unconscious stimuli, but the follow-up study does not?
To evaluate the planned analysis and presentation of the results, I would like to see some sort of draft of the results section. The authors could use simulated data or placeholders for statistical results. The authors describe the planned analyses in the study design table, but there are a lot of hypotheses and analyses, and it is a bit hard to follow. Presenting the planned analyses in the format of a results section will make it easier to check that the analyses make sense and nothing is missing, and also provides an opportunity for reviewers to give feedback about the presentation.
I am not a fan of the "study design table" required by PCI-RR. Answering all the questions in a single row for each hypothesis requires a table that spans multiple pages, with narrow text blocks. The sampling plan is generally the same for all hypotheses, so that column has redundant information. The space limitation encourages enumeration of hypotheses, so a reader has to keep track of many non-descriptive labels (H1a, H1b, …). Given the limitations of the format, I think the authors did a reasonable job conveying the information. I hope that PCI-RR changes this requirement, or allows some flexibility in how the information is organized. In the meantime, it would be helpful to see the analysis plan presented as a results section.
Using the Bayesian sequential sampling procedure is a good idea, and the proposed stopping criteria should provide good power for a range of possible effects. I have some suggestions.
For computation of Bayes' factors, the authors propose using a Cachy prior with scale parameter r = 1/sqrt(2). Schönbrodt & Wagenmakers (2018), following Rouder et al (2009), recommend a scale parameter of r = 1. They note that smaller scale parameters take longer to reach the H0 criteria in the null case. Their simulations of stopping criteria BF>6 also found that the Type I rate is slightly inflated with r = 1/sqrt(2), but not with r = 1. I suggest that they follow Schönbrodt & Wagenmakers (2018) and use r = 1.
I also think that the authors should provide a justification for the choice of boundary criteria based on expected effect size, and describe the power for one or more possible effect sizes. The methods section includes statement about the boundary criteria: "A BF of 6 (or 1/6), taken to indicate moderate evidence (Lee & Wagenmakers, 2014, as cited in Quintana & Williams, 2018), was chosen as an estimated equivalent for a medium effect size." That helps make a connection from the BF criteria to effect size, but does not say anything about why a medium effect size is targeted. Later, the authors report estimated effect sizes from the previous study, but that is not connected to the choice of stopping criteria.
The boundary criteria and prior determine the range of possible effect sizes that could be reliably detected, so a given BF criteria implies a target effect size. For example, the simulation results of Schönbrodt & Wagenmakers (2018) found that a criteria of BF>6 and r=1 would have 86% power for d = 0.4 in a between-subjects, so this criteria would correspond to targeting an effect size of d ≥ 0.4. In the present study, using BF>6 will allow detection of smaller effects because it is a within-subjects design. Reporting the minimum effect size that could be reliably detected will make it easy for the reader to see that the study is well-powered (even if not familiar with BFs).
The lower bound on sample size, N=60, seems higher than necessary. Sequential procedures are more efficient because they can stop early if the evidence shows clear evidence one way or the other. This efficiency is lost if the lower bound is higher than needed. An effect size of d = 0.5 only needs N=44 for 90% power. In the case of no effect, N=30 would be enough for reasonably sized confidence intervals around zero in the not-recognized condition (SE = 5.3%/sqrt(30) = 1.02%). I suggest that the authors use a smaller lower bound, N=30-40, so they can take advantage of the efficiency of the sequential testing. The sample size will still go past N=60 if the data is ambiguous, but not if the true effect turns out to be large or zero. If the authors want to ensure power for smaller effects, the BF criteria for stopping could be slightly increased, which would be more efficient than using a large minimum sample size.
I am not sure that abbreviated labels "C1", "C2" etc are needed. Descriptive labels could be used ("Fully Unconscious", "Mostly Unconscious", etc) without adding too much clutter in the text. Or could use "U" and "C" in the abbreviations to make it easy to remember which are unconscious vs conscious, "U","MU","MC","C", or "U1", "U2", "C2", "C1".
This topic sentence in the introduction is awkward: "Another relevant literature is the one referring to longer-term learning effects, and pertains to the increase in accuracy following repeated exposure to some stimuli over time." I suggest re-wording in a simpler manner, and maybe breaking off the second part to a new sentence.
Another line that could be simplified: "In a conceptually similar context to that adopted by Chang and colleagues (2016), we aim to study whether the visual system can organise two-tone images into meaningful percepts after masked greyscale image exposure." Maybe something like this: "Using a similar method as Chang and colleagues (2016), we tested whether the visual system can organise two-tone images into meaningful percepts after masked greyscale image exposure."
The use of catch trials is listed as a different in method, but Chang et al (2016) also had catch trials. Are the catch trials different in the proposed study different?