Dependence of power and type I error on model misspecification for mediated moderation

Zoltan Dienes

Dependence of power and type I error on model misspecification for mediated moderation

Zoltan Dienes based on reviews by Mijke Rhemtulla, Pier-Olivier Caron and Reny Baykova

A recommendation of:

STAGE 1

How Does Model (Mis)Specification Impact Statistical Power, Type I Error Rate, and Parameter Bias in Moderated Mediation?

Jessica L. Fossum, Amanda K. Montoya, and Samantha F. Anderson https://osf.io/bauzn version 3

Read report on server

Abstract

ZH-CN

Submission: posted 04 February 2023
Recommendation: posted 06 December 2024, validated 31 December 2024

Cite this recommendation as:
Dienes, Z. (2024) Dependence of power and type I error on model misspecification for mediated moderation. Peer Community in Registered Reports, . https://rr.peercommunityin.org/articles/rec?id=398

Recommendation

Researchers are often interested in moderated mediation. A predictor variable, such as number of counselling sessions, may predict an outcome, such as approach to a feared object, by way of a mediator, for example number of times the object was described in counselling. The strength of mediation in turn may depend on a moderator, such as vividness of imagery: Counselling reduces fear by way of imaginative exposure, particularly in those with vivid imagery. There may be a number of mediators ("indirect" paths), and any or all of these mediators may be moderated. In testing moderated mediation, a statistical model is specified which may or may not match the data generating process; in particular, there may or may not be moderators in the model corresponding to moderators that may or may not exist in the real data generating process, resulting in overspecification (more moderators of the indirect paths in the model than reality), underspecification (less moderators of indirect paths in the model than reality) or complete misspecification (where the moderated indirect paths in the model are not moderated in reality, and vice versa).

Researchers rely on the validity of tests (correct type I error rates), if they use frequentist statistics. Model misspecification may impact the validity of inferential tests for moderated mediation. Similarly, researchers need to be able to assess power for any analysis. In simulating power for mediated moderation, it may be important to know the possible extent to which the model is misspecified, and take this into account in planning numbers of participants. Fossum et al. (2024) will address this important problem with a series of simulations to determine if power is reduced meaningfully with over or under specification, or type I error and parameter estimates are biased for complete misspecification.

The Stage 1 manuscript was evaluated over two rounds of in-depth review. Based on detailed responses to reviewers’ and the recommender’s comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance.

URL to the preregistered Stage 1 protocol: https://osf.io/8gwfu

Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.

List of eligible PCI-RR-friendly journals:

References

Fossum, J. L., Montoya, A. K., & Anderson, S. F. (2024). How Does Model (Mis)Specification Impact Statistical Power, Type I Error Rate, and Parameter Bias in Moderated Mediation? A Registered Report. In principle acceptance of Version 3 by Peer Community in Registered Reports. https://osf.io/8gwfu

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviews

Evaluation round #2

DOI or URL of the report: https://osf.io/vgkdt/

Version of the report: 2

Author's Reply, 26 Nov 2024

Download author's reply Download tracked changes file

Decision by Zoltan Dienes, posted 31 Oct 2024, validated 02 Nov 2024

Thank you for your extensive changes - the revewiers are now largely happy.

One major point: You appear to have lost your study design template.

One crucial point of inference: You decide not to use equivalence testing because your hypothesis does not predict equivalence. But a test of a hypothesis is not severe unless it could show the hypothesis wrong; so it is precisely because the hypothesis predicts a difference that is inferentially desirable equivalence can be concluded (that is one reason why I prefer to call the inferential procedure using null interval hypotheses and confidence intervals "inference by intervals"). Otherwise how do results show the hypothesis wrong? Non-significance in itself does not do that. You say "We also report 99.9% confidence intervals and odds ratios to contextualize the results further." The motivation for this is of course that non-significance by itself does not do the job; indeed, for high N simulations, significance may not do the job either (if an irrelevant effect size is found significant). The problem in terms of a Registered Report is that statement that results will be contextualized allows inferential flexibility in. Just how will you do the contextualizing?

This is where the Design Table comes in: One is asked to nail down the whole inferential chain - including how one may draw different conclusions dependent on different patterns of data.

I am not saying you should use inference by intervals; just that if you test a hypothesis, its severity should be at least known. One alternative is just to estimate with CIs and drop p-values; then the hypotheses (Still listed in the DesignTable) are not about what effect exists, just a question about how big each effect is. (Then one must be scrupulous in not drawing any existential conclusions.)

You have done a good job of specifying how you will draw inferences in many cases; you just need to make sure you now make everything watertight.

Reviewed by Mijke Rhemtulla, 17 Sep 2024

Thanks to the authors for their detailed revisions and responses to my comments. I'm satisfied with the revised stage 1 manuscript and have no further comments to raise. Good luck with data collection!

Reviewed by Reny Baykova , 29 Oct 2024

I appreciate all the updates that the authors have made to the manuscript. I have some outstanding comments. Most of the related to the clarity of writting.

1. Including a statement which specifies which parts of the study have been completed and which have not is good idea. However, this should also already be clear as one is reading the stage 1 report - everything that has already been done should be in past tense, and everything that has not been done should be in future tense. Verb tenses can be changed in the stage 2 registered report.

2. On page 6, the first sentence of the last paragraph is unclear - "When the indirect effect is moderated...". Here, it is not clear what "indirect effect" is being discussed.

3. On page 11, hypothesis H1c states: "we hypothesized that parameter bias for over-specified models would be acceptable (<10%) in each condition". Up to this point, no conditions have been discussed explicitly, so it is not clear what is meant by "each condition".

4. On page 12, hypothesis H3a states: "We hypothesized that the type I error rate would be too high (liberal) in completely misspecified models". It is not clear what would be considered"too high". Hypothesis H3b should also be described in more detail - it is not clear what amount of bias would be considered "unacceptably high".

5. On page 17, the conditions under which H1b would be supported are unclear. The text states: "H1b would be fully supported if all four coefficients for the number of moderated paths are
significant... " . Here it is not clear what the "four coefficients for the number of moderated paths" refer to. If there is one model for dichotomous W and a second model for continuous W, then there will be 2 coefficients related to the number of moderated paths. It is not clear where the other 2 coeffients are coming from.

6. On page 17, the conditions under which hypothesis H1b would be partially supported need to be specified in more detail. The text says "If only some of the coefficients are significant in the predicted direction, H1b would be partially supported". "Some" is not specific enough.

7. On page 17, it is not clear which models will be used to assess hypothesis H1c. It is also not clear what is meant by "conditions".

8. On page 17, it is not clear how hypothesis H2b will be assessed. Same as above, it is not clear what is meant by "conditions". As a general comment, the analysis plan will be much clearer and eaiser to followig if the tests corresponding to each hypothesis are described in detail.

Evaluation round #1

DOI or URL of the report: https://osf.io/vgkdt/?view_only=e57e1c45183d48979b9a1d0a335d0a97

Version of the report: 1

Author's Reply, 03 Sep 2024

Download author's reply Download tracked changes file

Decision by Zoltan Dienes, posted 06 Jun 2023, validated 06 Jun 2023

I found your submission clearly written and well motivated. The three reviewers are very positive; they also provide detailed feedback on areas in need of clarification.

One additional point on your inferential procedure. You say that "only effects with an odds ratio greater than 1.68 will be considered meaningful," but also that you will reject a hypothesized effect on power if the effect is non-significant. In itself, non-significance is consistent with any population effect size, hence one over 1.68. So this leads to a possible inferential contradiction. Why not set up an equivalence region/null interval of +- ln 1.68 for ln OR (or make it one sided if you like), and if a suitable CI is wholly outside the null interval conclude power is meaningfully affected, wholly inside it conclude that power is not meaningfully affected, and otherwise suspend judgment? In any case, as you have provided no grounds for asserting non-significance is meaningful, some justified inferential procedure is needed to allow the conclusion that power is not affected by a variable/interaction.

On another point of inferential procedure, to clarify a point by Baykover, for a registered report, pre-register only analyses where the full inferential chain is nailed down; that is, to keep things clear, exploratory analyses are not mentioned in the Stage 1. They can of course be reported in the Stage 2 in a separate exploratory results section.

best

Zoltan

Reviewed by Mijke Rhemtulla, 03 Jun 2023

I enjoyed reading this Stage 1 report! It’s clearly written and laid out, and I noted no major flaws in the logic or plan. I have a concern about the under-specified condition that I describe below, a number of clarifying questions about the design of the simulation and the analysis plan, and then a bunch of tiny typo-level comments.

First, my only concern about the proposed simulation is how to interpret results in the under-specified condition. This concern stems from the fact that this paper focuses on power and not on the consequences of missing out on true moderation effects. So in the underspecified condition, suppose you find high power, would you conclude that under-specification is a good thing? Or no big deal? Or that a minimal strategy is good? I guess I’m a little worried that the study is set up to find that over-specification (which, to my mind, is not really a misspecification, because nothing is missed – we’re simply estimating zeroes rather than fixing them at their true unknown population values) is worse than underspecification, in which the model is not just inefficient but actually wrong. In the analysis section, the authors write that “We believe that in some cases underspecification may increase power and in other cases decrease power.” That’s my intuition as well – for this reason I think it’s important to clarify why you think it’s worthwhile to examine these conditions, and how the results will be interpreted or translated into recommendations for practice. It might be worthwhile to consider also reporting bias in addition to power, because bias may be able to reveal the consequence of the misspecification even when power is high. I think the paper could benefit from some discussion of whether missed moderation effects are something we should worry about – although the complete misspecification condition will at least shed some light on the possible consequences of missed moderation effects on type-I error rates for other effects.

Simulation Method:

I’m not clear on the total number of simulation conditions. The design is described as 6x9x6x2x2x6, but not every one of those cells exists, and some factors are not included in this design – like the 9 sample sizes. (There is a sentence reading “Effect size on the interaction term and sample size were varied”, which made me think that effect size on the interaction term is a different factor than the 2-level “effect size” factor that is described as a between-level factor in the simulation. At the bottom of page 17, the interaction is said to account for 1%, 3% or 5% of explained variance – so I guess this isn’t the 2-level effect size factor, because it has 3 levels. (Also at some later point it was mentioned that the interaction effect is sometimes negative but I’m not sure I saw that in the design). In short, I’m a bit confused about the setup. A table would be helpful, as well as a clearer description of what was varied and how it was varied, and what the total number of simulation conditions is.

“Interaction terms such as XW also had a variance of one” – how was this ensured? Was the product term rescaled to have variance of 1? More generally, I wanted to read about how the data were simulated – was it piecemeal (e.g., you first generated data on X and W and errors and then computed M and Y from those?) or did you use a joint distribution (in which case, how did the interaction terms work)? What was the correlation between the interaction terms and X and W? etc.

Is nonconvergence ever an issue with mixed effect logistic regression models? If it might be, please specify what you intend to do with nonconverged replications (will they be included in the analyses? will they be reported? will another estimator be used as a backup?)

“We chose this effect size metric instead of statistical significance due to the large amount of data likely favoring statistical significance.” I agree that significance tests are inappropriate as a way to interpret results of a simulation, and that effect sizes are more appropriate, though my personal preference would be for you to simply plot all the results (with confidence intervals) and describe the trends that you observe. But I’m confused here because this sentence suggests that significance tests will not be used, but the next sentence goes on to give a p-value threshold and the remainder of the analysis section describes how results will be interpreted on the basis of their statistical significance. At a minimum, the rationale for the analytic choices should be clarified. I would be very happy to see a strategy with no p-values involved.

I’m unclear on why the effect size variable is sometimes coded sequentially and sometimes as two variables (valence and absolute value). It would be helpful to see an explanation of why the coding of this factor changes.

Hypothesis 1b describes an interaction, which will be tested by including an interaction term in the analysis model, but then it sounds like only the main effect will be interpreted as evidence for the hypothesis (“a significant effect of number of moderated paths in the analysis model”), and no mention is made of the interaction effect.

For the analyses, I wanted to see more detail on how the models were fit (in R? using lmer or brms? etc.). Including equations for the linear mixed effects models, or lmer code (if that’s what will be used to run them) would help to make the analyses totally transparent. I wasn’t sure, for example, if “random intercepts only” means random intercept for condition, or if you intended a more complicated set of random intercepts.

There seems to be a disconnect between using the percentile bootstrap confidence interval at 95%, as described, and using a Wald test (mentioned several times as a method for determining significance) and using a p-value threshold of .001.

Figures/Tables:

Figure 1: it took me a minute to find the conceptual vs. statistical models – the description “(right)” had me thinking that the right column of the figure would have the statistical models, but instead it’s the right figure within each cell. It would be very helpful if these figures were re-drawn to have larger arrowheads and larger coefficient labels – I had to zoom in on the pdf quite a lot to decipher these.

Table 2: Consider using boxplots here to show the range of sample sizes in addition to the median for each model? I love this systematic review and think it would be neat to see a little bit more detail in these sample sizes that are presented.

I found it hard to keep straight the research questions and associated hypotheses – maybe a succinct table that lays these out would be valuable? I’m not sure to what extent these are just laid out like this for the stage 1 RR vs. the final paper.

Equations: I found the verbal description of all possible equations for M and Y given in the text around Equations (1) – (6) not very easy to read. I wondered if putting these 6 equations in a table, or include them in the Figure that shows all 8 models, with the path labels corresponding to the equation coefficients, would be both clearer and more succinct.

Typos and other super minor things:

p. 2 “whether a proposed mediator” à “whereby a proposed mediator”

p.2 re: the WebofScience count increasing from 2020-2022, is there a denominator number that can be pulled from WebofScience to indicate whether moderated mediation models are increasing as a proportion of published articles vs. following the trend of increasing publicaations overall?

p. 2 “it’s” à its

p. 3 “detect a small effect” à should this be a small mediation effect? or is that effect size for a main effect? (or is power the same?)

p. 3 “foul play” is pretty judgey – I think many proponents of science reform encourage the narrative that p-hacking is usually unintended, not malicious.

p. 4 “methods” à “method’s”

p. 4 “types I”

p. 5 “estimated with [a] commonly used [SPSS?] macro”

p. 9 “model being estimates” à estimated

Table 1 note: “assume dichotomous moderated is with” à moderator has?

p. 10 “dichotomous vs. continuous predictor variables” does predictor variables include the moderator?

p. 10 “tools available to [do] sample size planning”

p. 11 “researcher[s] may need”

p. 11 “Statistical power [analysis] for moderated mediation…”

p. 12 “at least one additional path is moderated” à “allowed to be moderated”?

p. 12 “can introduce excessive collinearity, especially with the interactions” à Is this true? I’m not overly familiar with the debate, but I was under the impression that the thought the idea that not centering interaction terms can result in multicollinearity had been debunked?

p. 12 “Under-specification … may also add unnecessary parameters” this sentence confused me – I thought it meant that by omitting one parameter, that would produce another phantom effect (i.e., the whack-a-mole effect whereby leaving something out causes that omitted effect to go somewhere else) but I think you just meant that by under-specification you include situations in which a parameter is under-specified but it may also be over-specified w.r.t. other parameters.

p. 12 “where none of the paths included in the indirect effect are correctly moderated” is unclear – does “included in the indirect effect” refer to the DGP or the analysis model? does “correctly moderated” mean that they are correctly allowed to be moderated? The end of this sentence “should be 0 according to the DGP” also totally confused me for a while – I eventually figured out what you mean by complete misspecification but this description is not clear. It might be helpful to describe that for “complete misspecification” to occur, (a) the DGM contains exactly one moderated path in the indirect effect, that is left out of the analysis model, and (b) the DGM contains exactly one unmoderated path that is included in the analysis model. (Is that right?)

p. 13 “base don”

p. 13 “By incorrectly specifying where the moderation occurs in the model, researchers may get biased estimates of the paths, coming to incorrect conclusions about which paths are moderated.” à this can also be a consequence of some kinds of underspecification, right?

p. 14, “This study examined the effect of model specification (over-, under-, or correctly specified)” – also completely misspecified.

p. 14, “continuous moderators and X focal predictor variables” – is the X meant to be there?

p. 14, “Power was only assessed for over-specified models because … and under-specified models because…” the “only” threw me off here. I first read it as “only for over-specified models” but then you added under-specified models. Then I read it as “only because”, as in “we only assessed power for this reason”. I think it may be clearer to say that only power was assessed for over- and under-specified models, and only type-I error was assessed for completely misspecified models

p. 15 “type I error rate will increase as the number of incorrectly moderated paths increases” – clarify incorrectly moderated?

p. 19 “a path with an interaction term from the DGP must be included in the data analysis model”: clarify that “must” follows from the way you defined over/under-specification.

p. 19 “we use the criteria from Bradley (1978)” clarify what these were (.025 /.075?)

p. 21 “will be supported would be supported”

Download the review

Reviewed by Pier-Olivier Caron, 05 Jun 2023

This project is about investigating power and type I error rate of misspecied moderated mediation. The method is very interesting and answers its own research questions and hypotheses. Put simply, it simulates a population model among six different ones, and the rejection rate of the index of moderated mediation of all of them. They will then look at what factors influences rejection rates and to which degree.

Regarding the key issues to consider at Stage 1:

· The research question is clearly defined. The hypotheses will answer the research questions.

· Authors have clearly stated the novelty of their approach.

· The protocol is sufficiently detailed to enable replication. I am sure I could do it myself.

· The proposed statistical analysis will answer the research questions and is in line with recommendations on design, analysis, and interpretation of Monte Carlo experiments.

· It is clear how a hypothesis will or will not be confirmed.

· Authors adequately use mull hypothesis testing. They proposed their significance threshold to .001 and proposed a minimal effect size of an odd ratio of 1.68 (small effect).

· 5000 replication is probably enough to answer the hypotheses. Even though, with my own experience with moderation analysis, 50000 would be preferable to estimate parameters and their variability (which is not an objective of the project).

As such, I can only recommend the project as it is. The authors certainly know their domain and how to carry their research.

I have two substantial concerns though. I am not sure if they are relevant for a Stage 1 report.

I am unsure what will be the outcome, more specifically, the guideline that will emerge from this study. I am worried the message might misleading for applied researchers: to emphasize on rejection rates rather than on correctly specifying a model can send the inappropriate message that “model specification is less important than mere statistical power”. I see this future article to be cited specifically for this purpose.

The study did not vary mediated paths, and many paths are kept null. Mediated paths (X to M, M to Y) are all fixed to 7% (.26 when standardized), the direct path X to Y is null. They are known to have differential effects on rejection rates. Keeping them fixed may be desirable at this stage, but it greatly reduces the scope of the project.

Reviewed by Reny Baykova , 01 Jun 2023

Download the review