Examining attentional retraining of threat as an intervention in pathological worry

ORCID_LOGO based on reviews by Thomas Gladwin, Jakob Fink-Lamotte and 1 anonymous reviewer
A recommendation of:

The Efficacy of Attentional Bias Modification for Anxiety: A Registered Replication

Submission: posted 15 September 2023
Recommendation: posted 15 January 2024, validated 17 January 2024
Cite this recommendation as:
Meyer, T. (2024) Examining attentional retraining of threat as an intervention in pathological worry . Peer Community in Registered Reports, .


Cognitive models ascribe a pivotal role to cognitive biases in the development and maintenance of mental disorders. For instance, attentional biases that prioritize the processing of threat-related stimuli have been suggested to be causally involved in the development and maintenance of anxiety disorders, including generalized anxiety disorder (GAD), which is marked by pathological worry. Therefore, these biases have garnered significant interest as potential diagnostic indicator and as targets for modification.
The idea that attention bias modification (ABM) can serve as a therapeutic intervention for GAD and other disorders was fueled by a seminal study by Hazen et al. (2009). In this study, 23 individuals experiencing high levels of worry underwent a computerized attentional retraining of threat stimuli (ARTS) or placebo control training during five training sessions. Relative to control, attention retraining was found to reduce preferential attention to threat, as well as depression and anxiety symptoms. However, as Pond et al. (2024) highlight in their review of the literature, the evidence endorsing the efficacy of ABM in alleviating anxiety disorders is still inconclusive. Moreover, some researchers contend that early positive findings might have been inflated due to demand effects.
Based on these considerations, Pond et al. (2024) propose a direct replication of Hazen et al. (2009) by subjecting a high-worry sample to five sessions of ARTS or placebo control. Departing from the frequentist analyses used in the original study, the authors will employ Bayesian analyses that allow more nuanced interpretation of the results, allowing consideration of evidence in support of the null hypothesis. The sampling plan will adhere to a Bayesian stopping rule, whereby the maximal sample size will be set at n=200. Furthermore, the authors extend the original study by addressing potential demand effects. For this purpose, they include a measure of phenomenological control (i.e., the ability to generate experiences align with the expectancies of a given situation) and evaluate its potential moderating impact on the attention bias training.
The Stage 1 manuscript was evaluated by three expert reviewers in two rounds of in-depth review. Following responses from the authors, the recommender determined that Stage 1 criteria were met and awarded in-principle acceptance (IPA).
URL to the preregistered Stage 1 protocol:
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA. 
List of eligible PCI RR-friendly journals:
1. Hazen, R. A., Vasey, M. W., & Schmidt, N. B. (2009). Attentional retraining: A randomized clinical trial for pathological worry. Journal of Psychiatric Research, 43, 627-633. 
2. Pond, N., Meeten, F., Clarke, P., Notebaert, L., & Scott, R. B. (2024). The efficacy of attentional bias modification for anxiety: A registered replication. In principle acceptance of Version 5 by Peer Community in Registered Reports.
Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Evaluation round #3

DOI or URL of the report:

Version of the report: 4

Author's Reply, 15 Jan 2024

Download tracked changes file

Response to Review

  1. I’ve once more carefully looked at the revised Stage 1 Registered Report and the good news is that I believe Stage 1 IPA can be issued shortly. I have just one more request for clarification concerning the procedure to obtain PC scores on p.22: "PC scores will not be collected during the experimental procedure, as most of the participants will already have their PC scores in a PC database maintained by researchers at the University". This appears to include the possibility that PC scores cannot be obtained for individual participants. If this is the case, you might want to specify what will be done in these cases, e.g. exclusion or administration of the PC scale?

Author Response: Thank you, we are glad you find our Stage 1 RR to be of high quality. As now clarified on P.21, any participants for whom we don't have PC scores already on the University database will be excluded from the PC analysis. 

Decision by ORCID_LOGO, posted 13 Jan 2024, validated 14 Jan 2024

I’ve once more carefully looked at the revised Stage 1 Registered Report and the good news is that I believe Stage 1 IPA can be issued shortly. I have just one more request for clarification concerning the procedure to obtain PC scores on p.22: "PC scores will not be collected during the experimental procedure, as most of the participants will already have their PC scores in a PC database maintained by researchers at the University". This appears to include the possibility that PC scores cannot be obtained for individual participants. If this is the case, you might want to specify what will be done in these cases, e.g. exclusion or administration of the PC scale?

Evaluation round #2

DOI or URL of the report:

Version of the report: 3

Author's Reply, 12 Jan 2024

Decision by ORCID_LOGO, posted 07 Dec 2023, validated 07 Dec 2023

I have received reviews from all three experts from the first round of reviews. All three are positive and once again, I completely share their overall positive evaluation, both regarding the proposed study and the revisions. Only a small number of remaining/additional questions have been raised, and I’m looking forward to your point-by-point response. I would only add the small observation that the formulation of hypothesis 3 under “final hypotheses” could be changed to specify the direction of the effect, and I wonder whether the word “significant” can/should be removed from the statistical hypotheses.

Reviewed by ORCID_LOGO, 03 Dec 2023

Thanks to the authors for their responsiveness. I only have a few comments and questions that I thought might be useful to consider.

- "While it is certainly true that to go through on an individual trial basis would be an inappropriate way to analyse such data and would certainly increase the risk of bias, Clarke et al. (2014) observed this effect at the group level, suggesting there is sound evidence for this argument." I didn't understand this sentence, in relation to the counterarguments involving effects of ABM (which otherwise now seem very well described!). As I understand it, the criticism of Kruijt etc *is* about the interpretation of the pattern of results over studies (which is "at the group level" if I understand the phrase here correctly). The issue is that this pattern involves a kind of indirect cherry-picking - i.e., to select studies for an effect on bias at post-test could be to select studies for an effect on (e.g., a clinical) outcome with *any* association with the bias, without this implying specifically that causal relationship that runs from bias to outcome. That's merely one possible interpretation - but, e.g., a sceptical observer could equally posit the possibility that p-hacking will tend to generate pairs of false positives for both bias and outcome that tend to occur together in particular sets of studies; or, perhaps improvements in outcome over time tends to cause changes in bias over time, even if the effect of ABM on outcome was a false positive, so selecting studies on change in bias means implicitly picking out false positives on outcome.

However, I feel like the literature has been presented clearly and sufficiently, so making this argument is up to the authors - whether or not it's a good or bad argument can be judged by readers. I'd just suggest that perhaps the issue is best explicitly described in terms of a high degree of uncertainty and speculation given the available (lack of) evidence - it could well be possible that the pattern of results indeed reflects only some ABM experiments causing a change in bias, and this factor causing a change in outcome; but the pattern of results doesn't provide evidence for that particular interpretation of it over other possibilities.

- "We agree with the response raised by Parsons (2018), whom argues that" - "whom" should be "who".

- "However, in line with advice from a discussion with Professor Zoltan Dienes, we will retain our final analytical decision threshold at BF >= 3 as evidence for H1, and BF <= 1/3 as evidence for H0. This is because if you have the same threshold on your stopping rule as you have on the analytical decision threshold, then the Robustness Regions reported will show no robustness (essentially by design) as you stopped data collection the moment it reached that point." I wasn't sure I understood the argument here. Does "final" in "final analytical decision threshold" mean the threshold used is the maximum sample size is reached? If the stopping criteria are 30 and 1/6, then the thresholds of 3 and 1/3 will be irrelevant except in the case the maximum sample size is reached, but I'm not sure what the problem with "robustness" mentioned in the response would be to maintain the 30 and 1/6. However, as above, if the authors are comfortable this is correct and will be clear enough in the text to readers, as mentioned I'm not an expert; otherwise it might be helpful to try to clarify the rationale.

- "For all Bayes Factors we will adopt the conventional thresholds of values greater than 3 indicating evidence for the alternate hypothesis and values less than 1/3rd indicating evidence for the null." and "Robustness regions will be reported as: RRconclusion [x1, x2], where x1 is the smallest and x2 is the largest SD that gives the same conclusion: B < 1/3; 1/3 < B < 3; B > 3." Possibly related to the above, is this still correct / will this be clear given the proposed changes to 30 and /6?

- "In using the procedure detailed by Palfi & Dienes (2019, Version 3, p. 15), it was determined that given a long-term relative frequency of good enough evidence of 50%, the proposed sample size allows for a discriminating Bayes factor (B > 30 if H1 is true, and a B < 1/6 if H0 is true)." Is this still correct, since the numbers in brackets changed while the rest of the sentence didn't?

Reviewed by , 01 Dec 2023

The authors have answered my comments in detail and satisfactorily - thank you very much. In my view, this is exciting and methodological sound study. I am very much looking forward to the results!

Reviewed by anonymous reviewer 1, 04 Dec 2023

I have reviewed the responses made by the authors with regard to my comments. I am overall happy with the responses they gave. One concern remains with regards to the data analysis plan. I understand the considerations of the authors, however, the ABM field would benefit greatly from taking into account the many random factors that come into play and that can have quite a substantial effect on the outcomes. I do however agree with the added value of the Bayesian approach and can see that not all limitations in a field can be addressed in one study. I would recommend acceptance of the stage 1 report at this point. 

Evaluation round #1

DOI or URL of the report:

Version of the report: 2

Author's Reply, 29 Nov 2023

Decision by ORCID_LOGO, posted 02 Nov 2023, validated 02 Nov 2023

I have now received the detailed and helpful evaluations of three experts. They all welcome the proposed replication study as a relevant contribution to the field of ABM research. I share their overall positive evaluation and believe that this submission is a promising candidate for eventual Stage 1 in-principle acceptance. I will not attempt to reiterate all of the detailed and constructive points that have been raised, especially as the reviewers point out specific ways in which these concerns can be addressed. I would only like to highlight a few issues that appear particularly important. 

First, with respect to the adequacy of the sampling plan, I agree with the observation by Dr. Gladwin that the combination of low minimum N (n_min=11 per condition) and a lenient stopping rule (BF>=3) may be perceived as concerning. With these parameters, the risk of false positive evidence appears to be avoidably high, while the achieved evidential standard is only weak to moderate. Regarding this issue, Schönbrodt and Wagenmakers (2018) write: “False positive evidence happens when the H1 boundary is hit prematurely although H0 is true. As most misleading evidence happens at early terminations of a sequential design, the FPE rate can be reduced by increasing n_min (say, n_min = 40). Furthermore, the FPE rate can be reduced by a high H1 threshold (say, BF10>=30). With an equally strong threshold for H0 (1/30), however, the expected sample size can easily go into thousands under H0 (Schönbrodt et al. 2015). To avoid such a protraction, the researcher may set a lenient H0 threshold of BF10<1/6”. Thus, I encourage you to carefully revisit their sampling plan according to these considerations. 

Second, regarding the analysis plan, the reviewers also noted that some clarification is needed regarding the precise statistical methods and the mapping between hypothesis and statistical tests. Other points of note include potential limitations of the operationalization of demand characteristics, and that the presentation of the literature underpinning the research question can be strengthened further. You may also find the suggestion helpful to complement the sampling and analytical approach with the frequentist analyses used by Hazen et al. (2009) and/or power analysis for smallest effect size of interest (e.g. to determine n_min).

Reviewed by ORCID_LOGO, 22 Oct 2023

Thank you for the opportunity to review the Stage 1 Registered Report "The Efficacy of Attentional Bias Modification for Anxiety: A Registered Replication".

### Criterion 1E. The scientific validity of the research question(s).

Under this heading, I primarily have some concerns about the presentation of the literature underpinning the research question.

In terms of the literature, the debate around ABM seems to deemphasize arguments from one side, expressed in particular in:

- Kruijt & Carlbring (2018), "Processing confusing procedures in the recent re-analysis of a cognitive bias modification meta-analysis",,


- Cristea (2018), "Author’s reply",

E.g., from Cristea's reply: "Yet a larger and more crucial problem relies in the central claim of Grafton et al, echoed by many leading CBM advocates: the effectiveness of these interventions should only be weighed if they successfully modified bias. Kruijt & Carlbring adeptly liken this to familiar arguments for homeopathy. However, it also reflects a fundamental misunderstanding of how causal inferences and confounding function in a randomised design. Identifying the trials in which both bias and outcomes were successfully changed is only possible post hoc, as these are both outcomes measured after randomisation; reverse engineering the connection between the two is subject to confounding. Bias and symptom outcomes are usually measured at the same time points in the trial, thus making it impossible to establish temporal precedence.Reference Kazdin4 Circularity of effects, reverse causality (i.e. bias change causes symptom change or vice versa) and the distinct possibility of third variable effects (i.e. another variable causing both symptom and bias changes) further confound this relationship.Reference Kazdin4 For instance, trials where both bias and symptom outcomes were successfully modified could also be the ones with higher risk of bias, conducted by allegiant investigators, maximising demand characteristics or different in other, not immediately obvious, ways from trials where neither bias nor symptoms changed. Randomised controlled studies can only show whether an intervention to which participants were randomised has any effects on outcomes measured post-randomisation.Reference Kaptchuk5 Disentangling the precise components causally responsible for such effects is speculative and subject to confounding. To this point, randomised studies show CBM has a minute, unstable and mostly inexistent impact of any clinically relevant outcomes." While this is all in the context of a debate with clearly varying opinions on the merits of different positions and analyses, it does seem to me important to accurately represent all sides and present any strengths of their arguments as well as possible.

I'd additionally suggest that another elephant on the room that would be worth mentioning, especially given the advantages of the current approach of writing a registered report, is the replication crisis and the potential role of questionable research practices in general, to which ABM/CBM research hasn't necessarily been immune.

However, also with an arguably fuller representation of the debates, I still think the research questions of the registered report remain scientifically valid.

### 1B. The logic, rationale, and plausibility of the proposed hypotheses, as applicable. 

I have no concerns with the hypothesis of an effect of the ABM training.

The secondary hypothesis, on demand characteristics, seems only partly sound. The issue is how strong and one-to-one the auxiliary assumptions would have to be to work back from a possible null effect on the current measure of Phenomenological Control back to a conclusion on demand characteristics as envisioned, in particular, by Cristea et al. (2015).

### 1C. The soundness and feasibility of the methodology and analysis pipeline (including statistical power analysis or alternative sampling plans where applicable). 

I am not an expert in Bayesian methods, so these comments are only intended as observations for consideration in case they're helpful.

First, I think a replication of Hazen et al. (2009)'s statistical approach would be very helpful to include, even if the author's specify their Bayesian approach takes precedence for their conclusions. If there's a discrepancy, say, a non-sigificant effect using significance testing but evidence considered supportive with the Bayesian analysis, then I think readers might want to know and evaluate what could explain that.

Second, relatedly, I'd be concerned if the current method produced a sample that would be considered underpowered from other perspectives; in principle, as per the current method, this could potentially end up being N=23. This also perhaps relates to the Bayes Factor cut-offs proposed here (i.e., the analogue to the .05 p-value) of 3 and 1/3, which are only just past what would be considered "weak" and into a "moderate" range (see, e.g., van Doorn et al., 2021, The JASP guidelines for conducting and reporting a Bayesian analysis). It seems that the approach, dependning on how the first few dozen observations work out, might allow a "support-refute" decision that would easily be overstated given the evidence. E.g., from van Doorn et al. (2021), "The strength of evidence in the data is easy to overstate: a Bayes factor of 3 provides some support for one hypothesis over another, but should not warrant the confident all-or-none acceptance of that hypothesis."

### 1D. Whether the clarity and degree of methodological detail is sufficient to closely replicate the proposed study procedures and analysis pipeline and to prevent undisclosed flexibility in the procedures and analyses. 

As above, I'm not very qualified to comment here.

### 1E. Whether the authors have considered sufficient outcome-neutral conditions (e.g. absence of floor or ceiling effects; positive controls; other quality checks) for ensuring that the obtained results are able to test the stated hypotheses or answer the stated research question(s). 

As noted above, it doesn't seem like a null effect for the secondary hypothesis would be very meaningful, at the design/measures level; i.e., even if very strong Bayesian evidence for the null were found, this wouldn't address whether the one particular operationalization adequaltely represents the effect of demand characteristics. This potentially could be mitigated by creating a more meaningful test of demand characteristics, e.g., by including additional measures and concepts. Or, this test could be acknowledged to be quite weak and not to be overinterpreted. Maybe it would even be useful to take a more exploratory, qualitative view and use interviews asking participants about experiences related to demand characteristics.

Reviewed by , 30 Oct 2023

In the proposed stage 1 replication study, the work of Hazen et al, 2009 will be directly replicated.

In my view, it makes absolute sense to replicate ABM studies. In particular, I think that the variance in the findings is due to the fact that many researchers repeatedly change the experimental paradigms in such a way that comparability is more difficult (e.g., different presentation times, image sizes, designs, etc.). This is a point that the authors can also gladly include in their argumentation. Besides that, I have the following other aspects that the authors should take up in the theory in order to derive the research question and hypotheses more clearly:

  • Delimitation and connection between attention and interpretation bias should be described in more detail in the theory. Concerning this aspect, it would be great if the authors would argumentatively present why it is worth to look more closely at only one bias and not at the connection, e.g. in the context of a combined cognitive bias hypothesis (Everaert et al., 2012)? (Everaert, J., Koster, E. H., & Derakshan, N. (2012). The combined cognitive bias hypothesis in depression. Clinical psychology review, 32(5), 413-424.)
  • It is somewhat irritating that the authors highlight in great detail and appropriately the previous meta-analytic effects on ABM in different disorders, but then propose a replication of a study in which precisely this distinction does not matter. Perhaps it would be sufficient for the authors to focus more on the results of GAS on pages 6 to 8.
  • Could the authors give a direct example of pheomenological experiences? (S. 10)

The design is very detailed and accurately presented. It would be helpful if the authors could present the central hypothesis again more clearly on p. 9, describe here which outcomes exactly confirm the hypotheses and also take up the hypothesis again explicitly in the context of the presentation of the Baysian stopping rule (p.10). It would also be helpful for the "mapping between hypothesis and statistical tests" if the authors would present the hypothetical predictions again in a formal-statistical way.

For a better understanding, a figure presenting the trial procedure would be helpful.

Were the word pairs validated with respect to valence and arousal?

Further notes:

  • The planned analyses seem very appropriate and adequate to answer the research question.
  • The sample size is sufficiently planned - especially with regard to the hypothesis.
  • By using Bayesian hypothesis testing, the authors will not infer evidence of absence from null results.
  • There are already positive ethical votes for the study.

Reviewed by anonymous reviewer 1, 25 Oct 2023

Review of ‘the Efficacy of Attentional Bias Modification for Anxiety: A Registered Report’
The authors’ pre-registered report describes a relevant replication in the ABM field with a valuable addition; namely addressing the demand effects within the laboratory setting. The report is well-written and incorporates a clear theoretical overview of the ABM literature. I have some concerns specifically pertaining the data analysis strategy that was chosen that should be addressed before the manuscript can be resubmitted. If this is addressed, the study will make a valuable addition to the literature. 
- The authors address attention bias for GAD. It would be helpful, especially for generalized anxiety, to give an example of attention bias for GAD.
- In the introduction the procedure of ABM is introduced, make clear that the target probe replacement is manipulated so that it more often replaces the neutral stimulus.
- Can the authors still add a reference for the study on demand effects in the lab on p. 5?
- In the replication of the study by Haazen et al., the authors choose to follow the choice for a composite of anxiety and depression as primary dependent variable. Even though comorbidity with depression is high for individuals with GAD, in a high worry, subclinical sample, this won’t be relevant for all individuals and may obscure results. I wonder whether this composite outcome variable is also the standard in other ABM trials for general anxiety. I would like to see a short discussion on this in the introduction to help the reader place this choice adequately in the literature. I would suggest to at least also analyze these two constructs (anxiety and depression) separately (if necessary, in a supplementary file). 
- I wonder about the role of baseline attention bias levels. This varies considerably in the literature (and probably specifically in a subclinical sample) and has also led to mixed results in the CBM field. It would thus make sense to at least control for this in the analyses. 
- My main concern is with the data-analysis part. It does not become entirely clear to me what specific analyses are being conducted. The authors describe their reasons for conducting Bayesian analyses instead of the original analyses, which are sound. However, which specific type of Bayesian analyses (e.g., based on ANOVA's, mixed effects models?) will be conducted and, with which type of program (e.g., how will the bayes factor be computed, with which program) – please clarify this. 
I would suggest (if not already implied in the data-analysis section) to conduct mixed effects models considering the nestedness and random factors inherently present in dot-probe/ABM designs (e.g., trials nested within persons, training sessions nested within persons, random slope for stimuli etc.). Mixed effects models can also be conducted ‘Bayesian style’ (see the brms package by, which is very user-friendly). Further, I would suggest, in the interest of replication, to conduct the original analysis of the Haazen et al. study as well to be able to make fair comparisons. 
- It would be helpful for the authors to explicitly state whether certain choices are in line with the study by Haazen et al. For example, it is unclear whether the decision to schedule trainings twice a week is in line with the study by Haazen et al.
Some additional small points: 
- Please add a reference for the Bayesian analyses on p.11
- What is the PSWQ >60 score based on? Please include a reference.
- Is the maximum of N=200 based on previous studies?
- Some small spelling/interpunction errors were found. The authors should check the text again for these errors. For example, on p.3 in the Wittchen et al. reference and on p. 7 (‘disorder’ instead of ‘disorders’). 

User comments

No user comments yet