DOI or URL of the report: https://osf.io/bfztd
Version of the report: https://osf.io/bfztd
Dear Maciej Behnke and co-authors,
Thank you for all careful revisions and responses. I have now received all reviews and the reviewers collectively agree that the work is almost ready for in-principle acceptance. There are a few minor reviewer comments that I encourage you to consider in your final revision. Note that this round we had one more expert who was unable join the review in the first round – this ensures that, at Stage 2, we will have experts who are familiar with the Stage 1 plan even if someone is unavailable next year. I leave a few brief notes of my own.
1. As a follow-up to my earlier comment #3 where I referred to the gaming disorder scale as an exclusion criterion, I think it’s worth giving it a bit more thought. Since participants are paid for playing games and you might learn that some meet gaming-related diagnostic criteria, it could strengthen the study to have a more explicit plan regarding participants in this hypothetical risk group.
2. I suggest a minor revision for the justification of exclusions (p. 13). The age limit is clear, but the other exclusions don’t seem to follow logically: “We will recruit Polish-speaking players as the study will be run in Poland. We will recruit only male players due to their predominance (76%) among first-person shooter gamers.” I believe in both cases the justification is feasibility, along the following lines: “Because including non-Polish and non-male participants would entail producing and testing different sets of group-specific research materials, the study will include only Polish male players” (just an example, feel free to rephrase as you see best or rebut).
3. This comment doesn’t need a response, but I want to leave it at Stage 1 in case it will be discussed at Stage 2. Note that in the design table column “theory that could be shown wrong” you only name the synergistic mindset model. I agree it’s good to be very careful and selective about theoretical inference, but at the same time I am thinking whether the results might be theoretically informative also beyond this single model – after all, the synergistic model stands on other established theory. Preregistered meta-theoretical inference for the upcoming discussion section could be informative for the research program’s development at larger scientific scale.
I am aware that your time is scarce, so I can promise to deliver you the final decision letter in 48h of receiving the next version. This should give you 1 week for revisions and still be able to receive the decision this month.
Sincerely,
Veli-Matti Karhulahti
Thanks to the authors for considering my suggestions. As I already expressed in my review of the first submitted version, I think that the proposed study will be informative and may serve as one of the good-practice examples in the field. Therefore, not then, not now do I see any “disqualifying factors”, to use authors’ words. Being pragmatic about research has its merits. Every study has strengths and weaknesses and it is completely fine as long as the writing is clear in that manner and provided that the weaknesses of the design do not disproportionately warp the reflection of the underlying studied phenomena. My two main worries about the reliability of the measurement and the unbalanced demand characteristics of experiment conditions remain, but I also get the other side of the coin, namely that the authors chose to optimize for a greater cumulative potential of the proposed research with respect to the existing evidence in the field. Any one of these two possibly biasing factors (measurement error does not have to be just random noise) or their interaction can lead to false positive results, so laying out these possible weak links in the limitations section seems important to me.
Below, I offer a few follow-up comments/responses on authors’ edits and replies. Regarding the issues that I do not return to, I was either satisfied or okay with the proposed revision/rebuttal. Overall, I think that there are no outstanding issues that should prevent the authors from running the study as proposed and am happy to hand over the final call on any further revisions to the discretion of the editor.
1. “As requested, we have included a critical interpretation of the existing literature in the Introduction. However, we respectfully disagree that psychophysiological challenge/threat or affect regulation research provides “weakly informative designs.” But, we acknowledge that you might have a different opinion on this topic.”
As I said, I have no expertise in literature on reappraisal. Why I made such a bold claim? Some time ago, I was doing an internal review of a protocol of one large multi-site study studying the effect of a similar reappraisal intervention. The lead authors explicitly chose this type of affect regulation intervention because it proved to have the strongest effect in the meta-analysis by Webb et al. (2012). In my view, this is always a poor strategy, for various reasons. This is also the metaanalysis used in the present RR to get an expected effect size. For the sake of the review and protocol revision in that past study, I looked at the included studies in detail and I carried out a re-analysis using arguably more state-of-the-art methods. I also looked at studies from another systematic review of reappraisal interventions by Cohen & Ochsner (2018). What I found and documented was an array of studies having various methodological issues (sizeable proportion of experiments lacking a control group), mostly feeble manipulations on tiny samples yielding huge effects, indications of selective reporting (p-values < .01 completely lacking, e.g., six out of seven available focal tests of the claims for mixed reappraisal had a p-value ~ .04), or study-level data patterns inconsistent with expectations under both, H0 or H1. Since you are dealing with the reappraisal literature, that is why I tried to voice my concerns about the robustness of the given literature and a need for a more critical appraisal of the evidence reported in that literature. Of course, I looked only at a slice of that literature, but that slice did not spur much confidence, to say the least.
If interested, I’m putting this part of my review dealing with the re-analysis of Webb et al. here: https://docs.google.com/document/d/1Q_8134QurWIKdUEzmJRjs7SZN9aJpUiPnZxbTe_3KiA/edit?usp=sharing
Analytic output is available here: https://rpubs.com/ivanropovik/592468
Code with data are available here: https://github.com/iropovik/PSAcovid002review
No need to react to that on your part, I just wanted to back up the bold claim from my review of your RR.
2. Response letter: “The PCI RR community might be up to date with the novel approaches for sample estimation” … expected effect sizes based on the scientific literature are often also effects of interest to scholars in these fields (Lakens, 2022).
Offering an overview of effect sizes for the given effect from past literature is fine. Trying to provide a sample size justification (in a field where it is still rare) is great. So I’m not insisting that it is to be removed. But I cannot help myself but seeing a “power estimation” or “sample estimation” based on previous findings or expectations in general as just bad practice that exacerbates the issue of studies having low power to detect effect sizes that would be considered theoretically relevant. Power is a pre-data concept, just like alpha. Specifically, it is the sensitivity of a given statistical test to reliably detect a range of *hypothetical* population effects of interest. Btw, that is the definition also put forward by Morey & Lakens (2016). This paper lists 4 misconceptions of power and the treatment of power in the present RR is in fact an example of misconceptions #3 (Sample size choice should be based on previous results) and #4 (Sample size choice should be based on what the effect is believed to be). Implicitly also misconception #1 (Experiments have actual/observed/post hoc power). From the paper: “the power of a test depends not on what the effect size is, but rather on all the hypothetical values it could be. … Attempts to “estimate” the power of an experiment, or a field, are based on a mis- understanding of power and are uninformative and confusing (Goodman & Berlin, 1994; Hoenig & Heisey, 2001; O’Keefe, 2007).” (p. 17). There is also still conflation (in the manuscript and response letter) of power analysis with sample size determination (the issue of powering for primary vs secondary hypotheses).
Anyway, if the sample size determination was mainly based on resource constraints (beta = .22 would seem a strange target to many readers), I think it is completely fine (and the most honest way) to say so and just let the reader see the sensitivity of the design across a range of effect sizes (or alternatively, across Ns as you do). IMHO, that would be better than reinforcing misconceptions about power and touring the reader through a determination of SESOI and wide array of past ESs, and arriving at the inability to formally reconciliate the two approaches and setting an arbitrary target of r = .22. That said, all this is inconsequential w.r.t. the informativeness of the present design, so let’s leave it at that.
3. Response letter: “We added the attention check in the questionnaire sets to screen for careless responding. Now item #59 states:“ “Please select "Strongly disagree" for this item to show that you are paying attention.”
Just reiterating from my review exclusion of careless responders should only be applied using pre-treatment measures, as carelessness itself may have been affected by the treatment, and exclusions based on that would induce bias into your model. You may also consider the methods for detecting careless responding patterns in the “careless” R package. One attention check is better than nothing, but there are more advanced methods.
4. Response letter: “Our software (Microsoft Forms) does not allow us to randomize the order of the items within the scales. … As some studies suggest, the order in which items are presented or listed is not associated with any significant negative consequences (Schell et al., 2013) and does not cause differences in average scores (Weinberg et al., 2018).”
There may or may not be order effects, as it is a highly idiosyncratic effect tied to the actual content of what is being measured. Within-block randomization is the way to play it safe. You can always use say three different forms. For instance, just three forms have been found to perform relativelly well compared to the full randomization in planned missingness designs, if the latter is not possible (e.g., paper-pencil data collection). The downside is a bit more complicated administration and joining of the data. I leave it upon authors’ discretion if they think it is worth it.
5. “If we find biologically impossible values, we will delete them. We will report the number of outliers for a given variable.”
Not necessary. I’d rather try to be very conservative with the exclusion of outliers (or specific values) just as you propose and only remove very improbable or impossible values. What is more important than, e.g., counting the number of outliers, is to run the entire analysis twice, with the MAD > 3 outliers in, out, and thus checking if the decision to remove outliers or not has any material impact on any of the main substantive findings. Reporting even that in a short paragraph in the results would be nice, IMO.
6. “The support for the hypotheses will be provided if the models fit the data well, i.e., RMSEA < .06; SRMR < .0 08, CFI > .95, χ2 > .05 (Bentler, 1990)”
It is the p-value of chi^2 that should be > .05. Your model will fit very well if the actual chi^2 value would approximate the degrees of freedom. Having a threshold for the chi^2 value would be uncommon. There is also a typo in SRMR threshold.
7. “If the fit indices suggest model misfit, we will not be interpreting effect sizes.”
A df = 37 model does not offer a hell of a lot dimensions of data space along which the model could be rejected, but from my experience, there is a pretty decent chance that the chi^2 will point to significant global misfit between the model and the data. This is a really risky little note in an RR :D
Anyway, this is a terribly strict requirement. I’m definitely not saying to disregard evidence against the exact or even approximate fit. Chi^2 test is the only formal test of a model and the best guard against misspecified models. A significant chi^2 only tells you that there may be a misspecification in the model and that you need to take a closer look. Taking a deeper look is essential in such case. I think it is reasonable to plan the following:
If the exact model-data fit hypothesis will be rejected, a set of careful diagnostic procedures to identify the possible local sources of causal misfit will be carried out (examining the matrix of residuals and modification indices). The fit would therefore be regarded as adequate if either (1) the exact fit test (chi^2 test) did not signal significant discrepancies between the data and the model or (2) if there was no larger pattern of substantial residuals (say > .1) indicating systematic local misfit.
With some reservations, a disconfirmed model can still be useful and its estimates can still have interpretational value provided that the fit of the model is not very bad – CFI, TLI way below .9, chi^2 being higher multiples of the model df. You also need a contingency plan for model modification, if this is the case. Half data-driven, half theory-driven careful modifications are less of an evil than interpreting a badly fitting model, where the serious misspecifications propagate through the entire model (or throwing the data away).
8. “We did not find strong enough evidence on whether these factors moderate the effects of synergistic mindset intervention on cardiovascular and performance outcomes (Yeager et al., 2022) to include them in the primary model.”
Effectively, only examining effect that proved to be significant in past research goes against the nature of scientific inquiry. If you are not interested to testing a moderation hypothesis, that is completely fine, but I think it should be said directly – that you chose not to test moderation within your primary model. Period. The current justification is a bit awkward.
9. Response letter: “However, we would like to keep the option of using the overall negative and positive affective experience scores by averaging the four negative affective experiences in the exploratory analysis. Although it might not be a pure robustness check for our conclusions, in this way, we will be able to observe the difference between the most popular operationalizations of affective experience and statistically superior options.
Such contingency is completely fine and even desirable! What I was objecting only is to qualify the robustness of the results by using a psychometrically inferior model – an unweighted sum score.
10. Response letter: “If we cannot use multiverse analysis, we will run multiple models and report the results in supplementary materials. After eliminating the different operationalizations of affective experience, we counted 72 possible models (3 options for affective experience x 8 options for cardiovascular measures x 3 options for game measures). This analysis aims to describe the range of effect estimates based on all reasonable data analytical decisions.”
Completely agree about the fact that planning things like multiverse analyses is not a concern at this stage.
That said, what you are describing is in fact a form of multiverse analysis! You don’t need any expert on that, IMO. It’s fine just to run all these possible models representing different design options and report what distribution of effect sizes for a few selected focal estimates did you find. Sure, having a script running through all combinations and spitting out a nice multiverse visualization is a great feature but you can do it easily later on “by hand” too. E.g., estimating the effect size and SE (or CIs) for each model, ordering by effect size and easily plotting them using a forest plot (see a very convenient forest() function in the metafor package). The reader would then easily see the distribution of effect sizes, what proportion their CIs cross zero, etc. Or, alternatively, even only briefly describing the distribution of effect sizes verbally would be far better than nothing.
Thanks again for the opportunity to discuss the design of this important study with you. Good luck with the study.
Best wishes,
Ivan Ropovik
The authors have done a good job of revising the registered report based on my feedback and suggestions (although it is worth them reading my published work more closely when describing how they will score the demand and resource evaluation data - i.e., substract evaluated demands from resources to get a score ranging from -5 to +5). While I could quibble with one or two of the authors responses, overall, the registered report is excellent and describes what will be a highly rigorous and superb piece of research in a comprehensive, accurate, and replicable way. It has been an interesting process reviewing this registered report and so thanks for the opportunity to be involved. I wish the authors all the best with the data collection and analysis phase and I look forward to seeing the final write-up in due course.
As I have joined the review process after a very extensive and well-articulated first round of revisions, I have very few additional comments. I have read the manuscript, materials, and response to round 1 reviews thoroughly. My overall assessment is that this is a well-designed study which has been thoroughly described in the revised Stage 1 Report. The authors have also thoughtfully responded to the comments in the first round of reviews. The application of the synergistic mindset intervention to optimizing esports performance is an innovative idea and I am sure the results of the study will have substantive theoretical and practical value.
One minor point is that I note the concerns raised by Reviewer 2 about the control condition. I agree that the authors’ decision to retain the original procedure for the control condition is reasonable. This will allow comparability, and it has been tested using a large number of participants in the prior studies testing the synergistic mindset intervention. However, on page 12 of the revised manuscript, the control condition is still referred to as a “placebo control”. I recommend dropping the placebo wording as the Yeager et al. (2022) paper did not describe the control conditions as a placebo control, and as has been discussed in prior reviewer comments, it doesn’t appear that there are matched expectancies across conditions.
I wish the authors all the best with conducting their study, and I look forward to reading the about the results in the full paper.
The authors have incorporated my comment into the code, so I have no further comments on the code at this stage.
BR
DOI or URL of the report: https://osf.io/9zmp8
Version of the report: https://osf.io/9zmp8
Dear Authors,
Thank you for submitting a highly rigorous Stage 1 proposal to PCI RR and giving us the opportunity to assess it. I have now received three reviews, two with highly detailed feedback regarding various aspects of the study and one specifically verifying computational reproducibility. We had one unfortunate reviewer cancellation in the process, which delayed decision, but I believe the present reviews were worth the wait and provide highly useful comments that help making final improvements to your plan.
We fully agree with the reviewers that it’s important to have this study carried out, so please utilise the rich feedback in a way that is most useful for your purposes (considering practical limits). I add a few comments as well; again, take what is valuable and skip the rest.
1. As the reviewers poin out, it would be good have explicit inclusion/exclusion criteria. It is mentioned on p. 11 that inclusion requires 6h/week of CSGO, but how is this measured (baseline item #5?) and is this the only criterion? Does this mean veterans cannot participate if they don’t train/play anymore? What about age, language?
2. Related to the above, I am thinking whether it would be better to recruit participants based on rank or another performance indicator, rather than hours of training/play, considering performance is a hypothesis. It is possible that rank also partially explains affective experiences, so having min/max rank could help. Additionally, it feels important to control previous experience with the bot deathmatch used in the intervention; individuals who have already learned how to produce high scores in this mod could generate unwanted data (see another comment later).
3. Still about participants: it is mentioned that individuals with significant health problems will be excluded. Does this refer to standard cut-offs with the applied scales? I’m specifically looking at gaming disorder, which you measure with a DSM-5 scale; although I’m sceptical about the scale’s clinical validity, it would be a concern to recruit (or continue having) participants who meet gaming disorder criteria at screening. Consider moving this scale to baseline registration measures as an exclusion screener.
4. Some existing Polish scale translations are cited. It would be good report whether you will use your own translations (when you do) or if English versions are used (if they are). The supplementary materials are informative, but it would help to have this information clearly in the manuscript.
5. Health scale from Ware Jr & Sherbourne (1992) is reported to ask physical health (supplement p. 4), but the item of the original scale measures general health. Is the modification “psychical” part of the Polish translation?
6. I don’t want to further complicate scale selection (reviewers already address that in detail), but it also seems theoretically plausible for the synergetic mindset intervention to work specifically for individuals with low self-esteem (I believe this was part of Dwek’s original reasoning). I wonder if e.g. Rosenberg’s Self-Esteem Scale for exploratory or future analyses might be informative.
7. On p. 13 it is noted that SESOI would be d = .07 among .03 and .05, but I don’t see it explained why the former and not one of the latter. This has no pragmatic relevance, but you may wish to elaborate unless I've missed something.
8. Deathmatch AI is used (p. 21). For future work (no sense to make last minute changes here), the Aim Botz mod could provide performance data in a more standardizable setting. Although deathmatch is more organic, scores are influenced by weapon choice, bot behavior, etc. That said, I wonder if it would be possible to further standardize deathmatch conditions, e.g., by fixing participants to one weapon.
9. Related to the above, it might be good to stress that the performance situation is human-AI and not human-human. This may be a highly relevant component in the production of competitive challenge/threat response. I don’t know if the following study ever replicated but see e.g. Kätsyri et al. 2013: https://doi.org/10.1093/cercor/bhs259 (perhaps to be considered more at Stage 2 discussion).
10. Related to the above still, I’m also thinking to what degree gender affects this intervention. Especially in CSGO, competitive women players have always been a small minority, which has affected in how they experience competitive situations (e.g., Balakina et al. 2022: https://doi.org/10.1145/3569219.3569393 ). I believe the support for stereotype threat effects is currently weak at best, but I would consider e.g., including images of and quotes from top women players in the materials for women participants (i.e. have separate sets of materials for men and women). This could be a simple way to improve intervention effectiveness.
11. I ran a face validity check with the materials through a Polish player, and one additional note (see also reviewer feedback) came up: the term “gamer” (graczy) addresses a specific subgroup of players, which has a strong identity connotation in this cultural context and tends to exclude some potential participants (in the same way as “scientists” would likely exclude e.g., philosophers among researchers). It’s totally ok if this is your target group, but just ensuring you’re aware of that (as it would be very easy to use different terminology).
12. Regarding the baseline measures on p. 29 and supplement p. 6, I note the following. #5: considering using “playing” instead of “training”, as people interpret “training” in many ways, e.g., ranked play isn’t training? (especially if this item is used as an inclusion crieterion) #6: Some people have multiple accounts and have played other mods of (almost identical) CS, so maybe provide an option to self-estimate total hours played CS?
13. P. 36 data availability says that all data will be made available, but I assume e.g., all video data will not be made available? Any other exceptions?
I was informed during the process that your reservation for the lab is now in April. Let’s make sure you have a decision before that. I know it's a lot of feedback. I can be contacted directly at any point for all concerns and questions, and if possible, please inform me some days before when you’re about to resubmit so that I can prearrange time to fully prioritize this and provide a rapid turnaround. This is an important study and it’s a privilege to help you with it.
Best wishes,
Veli-Matti Karhulahti
Thanks to the authors for the opportunity to read their manuscript (ms). Overall, I think that the proposed study would be informative. One of its main selling points is that it will produce a rich dataset, making it possible to examine the effect of the designed intervention on the longitudinal trend in performance and affective measures, while not relying entirely on self-reports but also collecting physiological measurements. Thanks to being a RR, the present study has the potential to bring evidence that would likely be much more robust than a modal study published in this field. That said, I have also some critical takes and suggestions for improvement.
An acknowledgment upfront, I am not a social psychologist and don't have much expert knowledge about the substantive aspects targeted by the present study. In my review, I will mainly focus on the measurement, design, and analysis side of things. As my role as a reviewer is mainly to provide critical feedback, I provide it in a form of comments below, not in order by importance but rather chronologically, as I read the paper. I leave it to the authors’ discretion which suggestions they find sensible and choose to incorporate. I hope that the authors find at least some of the suggestions below helpful.
1. In the introduction, I am missing a bit more critical interpretational viewpoint. As usual, in most research studies, the presented past research is all taken at face value. Especially in such literature, where weakly informative designs yielding very heterogeneous findings are rather the norm, I think it makes sense to identify which of the past studies presented in the intro are vitally important for informing the theoretical underpinnings of the present study and qualify the strength of their conclusions by the methodological robustness of the design they utilized.
2. The data will not support very wide generalization, so I would suggest revising the title to sth more specific like Optimizing *Esports* Performance Using a Synergistic Mindsets Intervention. The same with abstract.
2. My hunch is that affective response patterns may be rather stable characteristics that will be difficult to structurally alter with a self-administered one-shot type of intervention. Getting *a* significant effect is not that hard. Finding *the* effect using a rigorous method (even though the intervention seems face valid) is far less likely, IMO. The good thing about the design of the present study is that it attempts to partly stretch the intervention over a week. Kudos to the authors that they choose a RR format to give it a try.
3. “The other participants will be assigned to a validated placebo intervention focused on learning about the brain (Yeager et al., 2022).” Well, this appears to be a stretch too far for me. The cited study did not carry out *any* validation of this control condition. Being used before (even though in Nature paper) is not equal to being validated. I’d suggest removing that remark. I’ll have more comments on the control condition below.
4. Again, rather loose use of a validity claim. “In sum, our study will provide a unique combination of high internal and external validity levels”. I think get what you mean (using a controlled experiment & real-world outcome?), but I am not sure about the “high” part (more on that later – I see issues with the measurement and comparativeness of the controls). Anyway, instead of this slightly hyped-up language, I’d stick to being more descriptive.
5. The targeted population could have been described in more detail. It does not even mention that the sample will be entirely Polish (true?). Fully ok, but these details need to be acknowledged. How are the participants going to be sampled/recruited? Predominantly, what kind of people will they be? Students? General population? Is >6 hours of playing time the only inclusion criterion? Is there an expectation regarding the sample composition?
6. Part Sampling Plan/Expected Effect Sizes, the first paragraph could be structured more clearly. It currently mixes substantive assertions with generic stats meta-talk.
7. I think that the Expected Effect Sizes part is a conceptually relatively weakly informative part of the research design justification. I understand that the authors wanted to provide a solid ground for the sample size determination but I think it misses the point. To elaborate, I see the following. I like the determination of SESOI that is grounded in some reality. That part is fine. But I don’t get the need for “expected ES”. Trying to inform the design of a study based on (probably) non-systematic picking among relatively idiosyncratic, heterogeneous effects from published (thus subject to publication bias) literature is unfruitful, IMO. The meta-analysis by Webb et al. (2012) is, IMO, not helpful too, for that matter (more on that later). Even if there was no bias in the literature, considering “expected ES” is incompatible with the frequentist notion of power, a *pre-data*, theoretical concept (just like α), i.e., a sensitivity of a given statistical test to reliably detect a range of hypothetical population effects of interest (see Morey & Lakens, 2017). Instead of such long, numbers- and references-ladden part, I, as the reader, would prefer to see the power curve (given the specific model) for the hypothetical range of effect sizes. That range would, of course, also include the SESOI. A figure would say more than thousand words, not forcing the reader to appraise the informativeness of the present design at fixed points.
8. The unfruitfulness of the SESOI & Expected ES combo can IMO be seen in the Sample Size Determination part. Absent any formal mechanism (or common conceptual footing) to reconcilliate the two, the authors are pushed to conclude that the SESOI is unfeasible, while the “surprisingly large” ES from the past literature also did not pass some of the internal checks present in a skeptical reasoner. So the outcome of the several paragraphs long justification is an arbitrary set of ESs. There’s inherently nothing wrong with setting an arbitrary target, or one that is doable given some budget. My point is only that looking at a wider hypothetical range would be more informative. That way, the reader would gain a comprehensive outlook of what power does the given design/test provide for any given ES. Btw, I liked the justification for the target ratio of type I/II error rates.
9. It is fine to compute power for individual SEM parameter estimates (not “for the structural equation model” as put down), but I think it always makes sense to report whether the SEModel has decent power to pick up significant model-data discrepancies if these are present. That can be done for an approximate fit hypothesis using the RMSEA (see https://www.quantpsy.org/rmsea/rmsea.htm) and be reported at least in SMs.
10. Re: assuming factor loadings of .50… This is a serious design blunder, IMO. If the employed scales have such an abysmal overrepresentation of error variance (loading of .50 implies 75% of the total variance being error), it has serious consequences for the efficiency, precision, and likely also the accuracy of the design. Yes, in most of the social science research, measurement properties of the measures are hidden away behind convenient sum scores, so I don’t want to scold the authors for paying higher-than-usual attention to some of their auxiliaries. But still, if it is the case, this should be discussed.
11. Just an idea always worth considering, IMO. Maybe it would make sense to try to screen out careless responders (e.g., based on longstring detection or some sort of insufficient variance in responding pattern, or being a multivariate outlier indicating random response pattern). If done, it should only be applied to pre-treatment measures, as carelessness itself may have been affected by the treatment, and exclusions based on that would induce bias.
12. For the description of stages, I found it hard to understand at times, what follows what. E.g., "participants will provide informed consent and fill out baseline questionnaires"... "Next, the researcher will apply sensors to obtain cardiovascular measurements, and participants will fill in the baseline questionnaires”. What comes first? Cardiovascular measurements or questionnaires? The figure with the procedure workflow is clear but the text description was sometimes difficult to follow. Maybe it’s because I’m not acquainted with the subject matter, but it was difficult for me to keep track of what was measured, when, and for what purpose.
13. In general, there is little information on how the measures will be ordered. Within the questionnaire block, why not randomize to minimize the order effects? The same with items within scales. Will their order be randomized?
14. In stage I, is gaming for 2 minutes enough to provide a reliable picture? Just asking.
15. It is fine to have confirmatory RQs, with all other things lumped in exploratory analyses. But the reasoning on p.24 does not make sense to me. This one: “We treat them as secondary because we did not include them in the power analysis, and we may not have enough statistical power to infer about the effects of synergistic mindset intervention on them.”. Why? Power seems to be an independent issue to me.
16. Number of items for RESS-EMA scales not reported. Also, why a different response scale in the baseline and stage III?
17. Several scales are based on only 2 items. I guess also the alpha will be very low.
18. Re data reduction contingencies: I’d say that with factor loading below .40, the modeled latent is at higher risk of being “hijacked” by some idiosyncratic factor (or poor construct validity in general) but that is also the case with factor/component scores or sum scores. The latter two only hide that problem. So although far from ideal, I think it would be reasonable to model the latents as long as the model converges and fall back to some observed scores only if that is not the case.
19. Regarding the use of difference scores, I think it is statistically superior to use residual scores. E.g., regress pre-match baseline score on resting baseline and taking the residual.
20. Re missing data: the plan is to exclude participants with missings on a per-analysis basis. If you use SEM, that makes little sense. Why not use full-information maximum likelihood to impute the missing values? Mplus can do that easily. Deletion means discarding information and is only ok when the data are missing completely at random.
21. Any conceptual or statistical reason to exclude outliers with a such number of observations and type of variables?
22. After controlling for the effect of the intervention, your model implicitly assumes that the covariances between the residuals of mediators are all zero. Is that what you want? If untrue, this represents a model misspecification that will propagate throughout the model and bias the other estimates.
23. RMSEA and CFI are okay approximate fit indices, but why is the plan for model fit evaluation missing the only formal model test, the chi^2 test? I’d definitely want to see that. Maybe also the SRMR.
24. Is it correct that you won’t be interpreting effect sizes if there will be a significant model-data misfit? Btw, ironically, low loadings help lower the chi^2 value.
25. “We do not plan to use the same approach for hypothesis 2 because we did not find a way to operationalize the smallest effect of interest for cardiovascular responses”. Alternatively, it is fairly easy to compute Bayes factors for individual model parameters by a model selection approach, using just the BIC (Bayesian information criterion) approximation – comparing the BIC of models with and without the given parameter (see Wagenmakers, 2007). No SESOI or prior needs to be specified (a weakly informative unit information prior is implicitly assumed) and BIC can easily be extracted for any model. Presenting the continuous BFs alongside equivalence tests may be informative for the readers.
26. “We will test the robustness of our findings by adding to the primary model the moderation of the negative prior mindsets, negative appraisals, and gaming experience.” Robustness? How? The target causal effect is identified regardless. Btw, I think it would make things clearer if you framed your research goals as either testing causal effects (intervention –> mediators, outcome) or examining mechanisms through which the causal effects operate (mediation effects).
27. Exploratory Analyses section: “We treat the moderations as exploratory analysis because the initial studies were inconclusive on whether the prior mindsets moderate the effects of synergistic mindset intervention on cardiovascular and performance outcomes (Yeager et al., 2022).” I see your point but the fact that "initial studies were inconclusive" is irrelevant with respect to the inclusion of a moderator to the model. The thing is, that the inclusion of a moderator and modeling (incoming and outgoing) paths to the treatment and outcome nodes is an act of expressing ignorance about the presence of the given paths. Meaning, there may or may not be an effect. It's fine to choose your confirmatory research aims but justifying it based on the inconclusiveness of prior research appears conceptually weak to me.
28. Also Exploratory Analyses section, “We will also test the robustness of our findings by testing alternative operationalizations of the variables used in the model. For positive/negative affect, we will use the sum of the positive/negative items instead of the latent factor.” The sum score is only a special case of a latent variable model, where you assume equivalence of factor loadings and reliabilities of all measured indicators equal to 1. Therefore, from a measurement perspective, I personally wouldn't qualify the interpretation of the robustness of the conclusions based on employing psychometrically inferior measurement models. Instead, I would only plan using observed sum scores as a fallback plan if they were locally under-identified (say in case of collinearity issues) or producing estimation issues due to the violation of local independence assumption (large residual covariances). That is far from ideal as sum scores in such case hide serious measurement issues but at least provides you with an opportunity to empirically address your target research questions, albeit more tentatively. Btw, if you really were to resort to observed scores as a fallback, I'd use a PCA component score, where you don't assume equal component loadings at least.
29. Third, “For positive/negative affect, we will also try the single difference score (sum of negative emotions subtracted from the sum of positive emotions)”. Do you mean the difference in *mean* scores if they are using the same scale? If there will be some missing data, subtracting sums will not work.
30. I am a fan of testing the robustness of findings by employing alternative operationalizations. But with so many, how are you going to do that specifically? There will be quite a few combinations. Maybe you should consider doing a multiverse analysis for these robustness checks.
31. It is practically not feasible for me to review the measures as only links for items are provided.
31. The last one and important. When I read the control condition instruction, it seems obvious to me that this condition likely doesn’t elicit the same degree of expectancy as the reappraisal manipulation. The control condition needs to seem smart and face valid to the participants, but be inert w.r.t. the outcomes. I encourage the authors to think about how to make the control condition far more believable and applicable because at the moment, any post-treatment difference in the outcome between the experiment groups may be due to a likely substantial difference in the strength of demand characteristics perceived by the participants. Apart from that, it does not help if only “the synergistic mindsets group will report the adherence and progress in scheduled affect regulation training” (got that right?).
Just a few examples. Maybe it's just me but that sounds trivial to me and I definitely wouldn't expect any effect on my esport performance or affect by reading about Phineas Gage and suchlike stuff:
"What I didn't expect, however, was that the brain is so involved when I'm playing! In fact, everything I need to play - seeing the map, hearing my teammates on headphones and speaking to them, moving around the map with the mouse and keyboard, thinking about my next move, anticipating my teammates' and opponent's moves - are all made possible by different areas of my brain!"
“I now know that my eyes are not responsible for my vision and that I know how to go to the local store is due to my temporal lobe. I learned that when I am stressed, my behavior depends on the cooperation of two parts of the nervous system - one responsible for normal functioning and the other responsible for immediate reactions."
"On the other hand, it is sad how much brain damage can impair further functioning. Fortunately, thanks to such injuries, scientists are learning more and more about how the brain works and how to treat various diseases."
To sum up, I think the present proposal has merits and can provide rich data and relatively robust findings. What worries me most is (1) the outlook of poor measurement and (2) pretty obvious differences in demand characteristics of treatment and control conditions. I also feel the study underutilizes the data, where the authors are depriving themselves of the opportunity to examine interesting questions (specifically using the longitudinal measurements, modeling more complex models, and looking at the follow-up). But I am a fan of the principle that authors should be free to study what they want. Anyway, this got way too long (sorry for that) so please, feel free to integrate what you see fit and react only to what requires a reaction.
Good luck with the revision!
Best wishes,
Ivan Ropovik
Thank you for the opportunity to review the Mplus scripts for this RR.
Power analysis script: In this script, the semicolon at the end of the line (NAMES =...) is missing, so the code firstly reports an error, but after this correction, the code works well and produces the reported results, so I can confirm the reproducibility of the calculations with this script in Mplus, listed in the supplements.
Primary analysis script: The mediation code is correct and on simulated data produces the expected results without errors or warnings.