Michotte’s studies are a great target for replication and I read the manuscript with interest. Though I think this will be a very important piece of work given the influence of Michotte’s research, I found that keeping track of 14 studies (some with multiple manipulations) quite burdensome. Once the results are in, the Stage 2 manuscript will be considerably longer and I think attempting replication of this many effects may too much for one paper unless it’s possible to make the Introduction and Methods considerably more concise. Perhaps some of Michotte's methodological details could be relegated to supplementary materials, and key information placed into a table/figure?
The methodology is described in considerable detail, however the analysis plan isn’t sufficiently described to fully restrict researcher degrees of freedom. For example:
i. What data quality checks will there be, if any? For example, if a participant gives a high rating to both “The initially moving rectangle made the other rectangle move by bumping into it” and “The initially moving rectangle passed across the other rectangle, which moved little or not at all” what will happen to the trial and/or participant data?
ii. Apologies if I have missed this, but I couldn't figure out what exactly the will DV be. Will it be mean rating? Table 1 suggests that there will be 1 ANOVA per study, however when there are ratings on multiple statements surely there will be multiple DVs and therefore multiple ANOVAs? What correction for multiple tests will be conducted? Alternatively, if the ratings are to be combined into one DV how will this be done?
iii. Following from the above, might it be more straightforward in terms of analysis to have participants give a categorial report on their impression (“what was your impression, A, B or C?”) and follow that up with a continuous intensity rating (“please rate the intensity of your impression from 1-10”)?
iv. Will any sphericity corrections be used?
v. Given the number of tests that will be conducted, and therefore the high type 1 error rate, how will unexpected interactions (e.g. between speed and width in Experiment 1) be interpreted?
I also have some concerns about the proposed analyses as described in Table 2, however it is possible I have misunderstood. If I have misunderstood I apologise and could the analyses to be conducted please be clarified in text.
Experiment 1: this suggests the ratings on different statements will be directly compared – “Transition from high passing ratings at low width to high launching ratings at high width would be successful replication”. Could you explain how “transition” will be tested for
Experiment 2: “Significantly higher launching ratings for standard than for camouflage stimuli would be successful replication. All other results would be failure to replicate. Reported effect of fixation were not interpreted by Michotte; interpretation here will depend on results” Could these possible interpretations please be given.
Also, the text here and in Table 1 suggests to me that the Experiment 2 analyses involve running a 1 way Fixation (yes, no) ANOVA and another 1 way Stimulus (camouflage, standard) ANOVA for each of the 5 camouflage stimuli, i.e. 10 ANOVAS (potentially x 2 for each statement reported on). Is that correct? If so, why ANOVA and not t-tests? To avoid running 10-20 ANOVAs, a linear mixed model that permits different means for the 5 different stimuli would probably be more appropriate. Something like Rating ~ Fixation + Stimulus + (1 | participant ID) + (1 | stimulus ID).
Experiment 3: “Significant effect of size of either object on launching ratings would discnfirm Michotte's claim. Non-Significant effect would be consistent with it” If the null is being predicted then Bayes Factors would be more appropriate so that you can make inferences about H0
Experiments 4, 5, 6, 13, 14: I’m not sure what the advantage of using Tukey posthocs is here. On my understanding it sounds like the appropriate test here is a single contrast testing for a linear trend
Experiments 7, 8, 9, 10: An ANOVA won’t be able to test this hypothesis. Multiple one-sample t-tests against 5 for each condition may be more appropriate
Experiments 11, 12: The ANOVA seems to be testing for effects of motion and speed (ie those are the factors), but the hypotheses pertain to differences between the scale ratings (launching vs pulling etc)
i. Will each experiment recruit a new (and non-overlapping) sample or will some participants take part in multiple studies? I think it’s important that participants only take part in 1.
ii. P4-5, L 95-102: “The perceptual nature of the launching effect is shown by evidence that it 96 can influence other contemporaneous perceptual processing. […] Detection occurred sooner for launching stimuli than for non-causal controls, supporting the 102 hypothesis that causality is constructed at an early stage of perceptual interpretation.”
I don’t think this is correct – differences in breakthrough time doesn’t mean the effect is perceptual. Participants must decide when to report a percept as present versus absent, and in that sense variations in breakthrough times may well be a function of differences in decision thresholds.
iii. P11, L275-277: “It is, however, important to the replication study that participants should, as far as possible, report perceptual impressions and not products of post-perceptual processing”
I think the point being made here is clearer later in the manuscript, but on my first reading it sounded to me as though participants were being asked the impossible – to report on some pre-decisional state of perceptual processing. Could this be reworded to make clear the point that particpants are being asked to report what they saw and not what they think following conscious deliberation?
iv. Power analysis: In my view it’s important to conduct a power analysis for each study. Most/all of the hypotheses can be tested with a t-test, so hopefully the calculations should be far more straightforward than powering for the multifactorial ANOVAs.
v. Table 2 (Rationale column): why does the smallest effect of interest change in each study? Is this the effect size that would be powered for given n = 50?