Exploring determinants of test-retest reliabilty in fMRI

ORCID_LOGO based on reviews by Xiangzhen Kong and 2 anonymous reviewers
A recommendation of:

Test-Retest Reliability in Functional Magnetic Resonance Imaging: Impact of Analytical Decisions on Individual and Group Estimates in the Monetary Incentive Delay Task

Submission: posted 17 April 2023
Recommendation: posted 22 July 2023, validated 23 July 2023
Cite this recommendation as:
Bishop, D. (2023) Exploring determinants of test-retest reliabilty in fMRI. Peer Community in Registered Reports, .


Functional magnetic resonance imaging (fMRI) has been used to explore brain-behaviour relationships for many years, with proliferation of a wide range of sophisticated analytic procedures. However, rather scant attention has been paid to the reliability of findings. Concerns have been growing following failures to replicate findings in some fields, but it is hard to know how far this is a consequence of underpowered studies, flexible analytic pipelines, or variability within and between participants. 
Demidenko et al. (2023) plan a study that will be a major step forward in addressing these issues. They take advantage of the availability of three existing datasets, the Adolescent Brain Cognitive Development (ABCD) study, the Michigan Longitudinal Study, and the Adolescent Risk Behavior Study, which all included a version of the Monetary Incentive Delay task measured in two sessions. This gives ample data for a multiverse analysis, which will consider how within-subject and between-subject variance varies according to four analytic factors: smoothing (5 levels), motion correction (6 levels), task modelling (3 levels) and task contrasts (4 levels).  They will also consider how sample size affects estimates of reliability. This will involve a substantial amount of analysis.
The study is essentially focused on estimation, although specific predictions are presented regarding the combinations of factors expected to give optimal reliability.  The outcome will be a multiverse of results which will allow us to see how different pipeline decisions for this task affect reliability. In many ways, null results – finding that at least some factors have little effect on reliability – would be a positive finding for the field, as it would mean that we could be more relaxed when selecting an analytic pathway. A more likely outcome, however, is that analytic decisions will affect reliability, and this information can then guide future studies and help develop best practice guidelines. As one reviewer noted, we can’t assume that results from this analysis will generalise to other tasks, but this analysis with a widely-used task is an important step towards better and more consistent methods in fMRI.
The researchers present a fully worked out plan of action, with supporting scripts that have been developed in pilot testing. The Stage 1 manuscript received in-depth evaluation from three expert reviewers and the recommender. Based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).
URL to the preregistered Stage 1 protocol:
Level of bias control achieved: Level 2. At least some data/evidence that will be used to answer the research question has been accessed and partially observed by the authors, but the authors certify that they have not yet sufficiently observed the key variables within the data to be able to answer the research questions AND they have taken additional steps to maximise bias control and rigour.
List of eligible PCI RR-friendly journals:


1. Demidenko, M. I., Mumford, J. A., & Poldrack, R. A. (2023). Test-Retest Reliability in Functional Magnetic Resonance Imaging: Impact of Analytical Decisions on Individual and Group Estimates in the Monetary Incentive Delay Task. In principle acceptance of Version 3 by Peer Community in Registered Reports.
Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Evaluation round #2

DOI or URL of the report:

Version of the report: 1

Author's Reply, 21 Jul 2023

Currently both read as very generic investigations of reliability in task fMRI, and the specific task focus is only mentioned quite briefly/late (pg. 7 in the introduction and in parentheses in the abstract). A clearer and earlier statement regarding the focus on the MID task and why would be helpful for appropriately contextualizing the findings.

  • We have revised the title of the manuscript to include "Monetary Incentive Delay task".

Decision by ORCID_LOGO, posted 21 Jul 2023, validated 21 Jul 2023

Dear Dr Demidenko

I sent your revised manuscript to the three original reviewers and they all were impressed with the way you had responded to the issues raised. There is one outstanding issue noted by reviewer 3, who felt it was still not clear enough that you analysis focused on a specific task, so results may not generalise. I think that the simplest and most effective way of dealing with that would be to change the title to:

Impact of analytic decisions on test-retest reliability of individual and group estimates in functional magnetic resonance imaging: a multiverse analysis using the monetary incentive delay task

If you are happy with that (or other wording that including the specific task in the title), then I am happy to go ahead and recommend in principle acceptance.

Best wishes


Reviewed by anonymous reviewer 1, 05 Jul 2023

I would like to thank the authors for their detailed responses and useful clarifications. I understand that they have to make some choices, given the many parameters that might impact on the reliability of test-retest fMRI data. I look forward to reading the main results in the next stage.  


Reviewed by , 01 Jul 2023

The authors have addressed my questions well. I have no further comments. 

Reviewed by anonymous reviewer 2, 18 Jul 2023

I appreciate the authors’ revisions of their registered report. I generally found them quite receptive and responsive to the suggested adjustments made by the reviewers, and that the report is substantially improved. I am eager to see the results of this work!

My only remaining comment regards the generalizability of the results from the current study and how the authors plan to address this issue. I though the authors’ motivation to limit analyses to the MID task was reasonable (given the broad range of analyses already proposed and the ubiquity of this task in the field). However, I also think that it is very unclear whether the results of this study are likely to generalize to other task contrasts in very different domains. As such, it seems important to me that the authors acknowledge their specific focus on the MID task further in their abstract and introduction. Currently both read as very generic investigations of reliability in task fMRI, and the specific task focus is only mentioned quite briefly/late (pg. 7 in the introduction and in parentheses in the abstract). A clearer and earlier statement regarding the focus on the MID task and why would be helpful for appropriately contextualizing the findings.

Evaluation round #1

DOI or URL of the report:

Version of the report: 1

Author's Reply, 30 Jun 2023

Decision by ORCID_LOGO, posted 06 Jun 2023, validated 06 Jun 2023

Dear Dr Demidenko

I now have three reviews of your submission to PCI:RR by experts in the field. All reacted favourably to the protocol, and I agree this paper covers an important topic – it’s also very well-written and the analysis plan is coherent.

Nevertheless, the reviewers have a number of queries and suggestions for improvement.  Most of these are straightforward, so I will not repeat what is said in the reviews, but there are a few issues where I have additional thoughts.

Assessment of reliability

The question of how to assess reliability has raised questions by reviewers 1 and 2: given that even experts are finding this confusing, my recommendation is that you could take the opportunity here to expand the introduction a little to say a bit more about how the different methods work. You do cite excellent primers – I was pleased to be pointed to the papers by Liljequist et al (2019) and Noble et al (2021), but I think that a point that does not come across, and can be confusing, is the key role of systematic change across sessions. I suggest you don’t mention the six ways of computing ICC: as Liljequist notes, most are irrelevant for the current context. An easier way to explain the ICC is to say that in a test-retest design it is equivalent to the Pearson correlation if there are no systematic differences between the two sessions (i.e. no practice effects).  The larger the mean differences between sessions, the greater the difference between  ICC and Pearson r, with the ICC being lower. If introduced this way, then I think it may be easier for people to grasp the point about between-subjects and within-subjects sources of variance, as most readers will be familiar with the idea that r is affected by restriction of range.

Reviewer 2 makes the point that behavioural variability is also important. I would also add that, if we see between-subjects variation as important, then it is important to have a bit more information on the participants – do we have any information on their demographics? Are they very uniform in terms of social background, ethnicity, scholastic attainment? A few key details, including gender, and method of recruitment of participants, should be reported, even if they are not used in analysis, just to make it possible to compare with other samples. Again, with the two adult samples, some demographic information is needed. The “Neuropsychological Risk” cohort of MLS is to be used: how was that risk defined and how were they recruited?

Given that the brain is changing rapidly during adolescence, one might expect greater between-subject variability in task performance for the ABCC sample than the young adult samples of MLS and AHRB. This is briefly alluded to on line 155, but it’s not clear exactly what the prediction is here. (i.e. what is ‘more the case’?).


I can see that there is a limit to the number of factors that you can consider in this kind of analysis, but it did seem a bit surprising that you did not consider thresholding, which is likely to have a key effect. Reviewer 1 queried your choice of threshold. And like Reviewer 2, I wondered what we would learn from analysis of ‘task irrelevant voxels’. If you do want to stick with the proposed analysis, it needs to be better justified. (I would also question whether being below threshold makes a voxel “task-irrelevant” – what about voxels that show significant task-related *deactivation*? – I’m not suggesting you do analyses of that, but am flagging up need for caution in use of language here).  

Aim 2

Like reviewer 1 I found the account of the MSBS and MSWS analysis hard to follow. If the ICC is positive, then MSBS will always be greater than MSWS. Won’t the ratio between these be predictable from the ICC? I tried a little simulation, and for the range I looked at, they were monotonically related, though in a nonlinear fashion. I think if one reviewer and I find this confusing, so will other readers, so it is worth double checking this aspect. And, given that the ICC will be lower if there are differences in mean between sessions 1 and 2, it may be worth considering how far low ICCs are driven by practice effects.

Adaptive design?

It’s great to see large datasets being used to address the question of reliability, but your plan involves a huge amount of analysis, and I wondered if it would make sense to adopt a more adaptive approach.  For instance, suppose you did an initial analysis of data from 500 of the ABCC participants, and found that some of the analysis factors had no material effect on reliability. Rather than slogging on through the next 1500 samples, you might then decide to drop that variable – and perhaps substitute another (e.g. threshold).  

It is possible to incorporate an adaptive design in a Registered Report, though it does add complexity. I suspect, though, that you have already put some thought into which factors to vary in your analysis, and may have discarded some promising candidates. So you could have a prioritised list to work through. E.g. with 1st 500 participants, do study as planned. Then for next 500 continue with optimal level of factors you have tested and vary other steps in the analytic pipeline from your prioritised list.

Please regard my comments in this section as suggestions rather than requirements. You are free to disregard if you prefer.

I was also thinking about the logic of specification curve analysis in this context. I think the value of doing this is greatest when you have a number of predictors that may interact, and so you need to look at all combinations. You are including predictors of two different types: it seems feasible that the smoothing and motion correction processes could interact and so it makes sense to look at all 5 x 6 combinations of those. But the task modelling and contrasts seem a different type of predictor, where it is likely that there may be just a main effect, with one level of each predictor being optimal. Task modeling and contrasts may interact with one another, but it doesn’t seem likely that their impact will depend on the smoothing/movement correction settings. So I wondered if a more economical approach would be to start with specification curve analysis for the 12 task model x task contrast combinations, using the smoothing and motion correction settings you anticipate as optimal. This may allow you to identify clear winners for task analysis settings, and these could then be set as constant for exploring the smoothing/motion settings. I don’t insist you do this – I have little experience with this method, and am happy to be persuaded that the full specification curve analysis with 360 combinations should be done. But if you can save time and energy by reducing the amount of analysis without compromising the value of the study, that would be good.

A related point concerns Aim 3, i.e. finding the sample size at which the ICC stabilizes. Since the denominator for variance is square root of N, won’t this just follow the trajectory of a plot of 1/sqrt(n) by n? (see below and for illustrative figure).

If so, then N = 2000 is overkill, as the marginal benefits of additional subjects get pretty small after about 500. This might be another justification for adopting a more adaptive approach to allow more flexible use of this large sample, and possibly to explore other factors affecting reliability.

The three samples

I may have missed this, but I could not find a clear rationale for running the analysis on 3 samples. I assume this is to see if there is generalisability of findings across samples/scanners etc?


Reviewers 2 and 3 have particular concerns about generalisability, noting that any recommendations for optimising reliability that come from this analysis may not generalise to other tasks/samples. While this is indubitably the case, my sense is that you have to start somewhere, and just doing this for one task across different samples involves a substantial amount of work. So I think what you have proposed here is worthwhile, and could set a standard for others working with different tasks to follow. However, it is clear that this will be a criticism of the study that you should anticipate, and as noted by reviewer 3, the limitations on generalisability do need to be emphasised.

Evidence for accepting/rejecting hypotheses

We could do with a bit more clarity in how you propose to interpret the output from hierarchical linear models. What would be the criterion for deciding a particular processing choice (or combination of choices) was optimal? What would be the criterion for deciding that it made no difference? Would these interpretations be influenced by the overall level of reliability?

Open scripts and data

I see problems with the statements about accessing the MLS and AHRB datasets, as these depend on the permission of named individuals who are not immortal and could become unavailable. I appreciate that when using data provided by others one cannot dictate the terms of data access, but it is problematic if the analyses planned here are not reproducible. I’m not sure what the solution is, but hope that there is one. Ideally, the owners of these datasets might be persuaded of the value of depositing the BIDS compliant data on a repository such as Open Neuro or Neurovault.

In terms of the scripts, it is good to see these openly available; before publication, we will need a version with DOI, but I gather this is possible with Github integration with Zenodo.


Minor points

Line 25 – ‘scripts for’

Line 100- I would reword for clarity: ‘…relative to individual differences, and to assess….’

References: we aren’t fussy about formatting and you can ignore this, but when you come to submit to a journal they may prefer uniform case (there’s a mix of sentence case and title case).

In sum, I am pleased to invite you to submit a revision. Please include a response letter where you address each comment by reviewers and those raised above point-by-point. To facilitate a quick turnaround, please also include a version with changes tracked/highlighted. Also please ensure that the link on your submission directly leads to the manuscript. This can be the version with highlighted changes - upon in-principle acceptance of the Stage 1 manuscript we the highlights can be removed.

Yours sincerely


Dorothy Bishop

Reviewed by , 26 Apr 2023

The aim of this registered report is to investigate how certain analytic decisions affect individual and group-level reliability in task fMRI using three independent datasets and 360 different modeling permutations. While the report is well-written, I have a few minor comments.

For estimating reliability at the group level, the authors plan to use "significance thresholded (p< 0.001, uncorrected) group [binary] estimates." This is not a common threshold strategy reported in publishable studies. To enhance the applicability of the results, the authors may consider using a more commonly used group analysis threshold strategy e.g., one that combines an uncorrected p threshold (e.g. 0.001) and a cluster size threshold. 

The authors may clarify why two metrics (Jaccard's similarity coefficient and tetrachoric correlation) are planned to use and whether they provide complementary information on reliability results.

“the Aim 2 Hypothesis is that, on average, there will be higher between-subject than within-subject variance across the three samples within task-relevant regions.” The wording of the Aim 2 hypothesis is a bit confusing. E.g., we may never expect between-subject variance to be lower than within-subject. The authors may consider rephrasing it to make it clearer. Additionally, it would be better to clarify how the "task-relevant regions" will be defined. 

“Aim 3 evaluates at what sample the ICC stabilizes using the most optimal pipeline (e.g., highest ICC) from Aim 2.” Is it correct that all samples will be used for Aim 2, and only smaller subsets of the samples will be used for Aim 3? 

Year 1 and year 2 data were used for each of the three datasets. As reliability estimates may decrease with the time lag between two scans, I would suggest the authors report the exactly time lag in each dataset. 

“Structural and functional MRI preprocessing is performed using (?) 23.0.0rc0”. The package name is missing. Is it fMRIPrep? 

The meaning of A/B/C/D in Figure 1c and Equation 3 is unclear. In addition, there seems to be some confusion regarding the two "Ses-1" in the first row. The authors may consider clarifying these points to enhance the readability of the manuscript. 


Reviewed by anonymous reviewer 2, 05 Jun 2023

In this Stage 1 registered report, Demidenko and colleagues propose to examine how different options in the fMRI analysis pipeline affect reliability of task activation results. Specifically, the authors propose to examine the effects of different levels of smoothing, forms of motion correction, methods of task parameterization, and types task contrasts influence the reliability of results. The authors state they will examine this question using two runs of Monetary Incentive Delay task data (anticipatory period), from 3 different datasets, with sample sizes of 65, 150, and 2000. A combination of ICC and Jaccard overlap similarity will be used to assess similarity. Aim 1 proposes to determine which “multi-verse” analytical set of options produces the most reliable results, Aim 2 proposes to separately examine how analysis steps affect the size of within and between subject variance, and Aim 3 proposes to investigate how sample size affects the reliability of the optimal analysis pipeline.


In the past 5 years or so, a number of papers have noted the low reliability of many fMRI results. This low reliability likely limits the ability to interpret fMRI data and use this data to assess brain-behavior relationships, a topic that has also received recent scrutiny. The current planned project represents an important contribution in this domain, in terms of addressing the extent to which different analysis options will influence reliability. I generally found the report to be well written and analyses to be carefully explained, with hypotheses and tests well laid out. I have a few suggestions, however, that I think would increase the impact of this work.


Major suggestions

-       My biggest concern is that the current work is based on only a very narrow set of data: anticipatory cue periods from the monetary incentive delay task. This will strongly limit the generalizability of the current results – it is unclear if the findings from this particular work would have any bearing on many other potential tasks that could be run. Given the premise of the article, it’s not clear why the authors chose to limit themselves in this way. This work would have a much bigger impact if the authors were to test the pipelines across a range of tasks (e.g., each of the tasks in the ABCD, HCP). It may not be possible to apply all of the combinations of analysis steps in the new tasks (e.g., the different versions of task parametrization), but even if only a subset of analysis options were repeated, it would be very helpful for understanding the degree of generalizability of the current results to other contexts. If the authors choose to keep their focus narrowly on the MID, this should be reflected more clearly in the title and abstract, and in discussing the types of conclusions that can be drawn from the current work.

-       The authors discussed reliability issues at length, but did not touch on the topic of validity. These usually go hand in hand, as methods to increase reliability at times reduce the validity or utility of a dataset – for example, consistency of datasets is likely to increase monotonically with smoothing, although this will come at the cost of being able to distinguish meaningful/valid results from distinct areas of the cortex. Similarly, different task contrasts will have different levels of validity in terms of the underlying cognitive constructs that they can be connected with. At a minimum, I think this bears additional discussion in the final manuscript. If the authors could also propose a confirmatory test of validity (e.g., using a motor task to find distinct components of the motor system), that would also help with the interpretation of the results.

-       I was surprised that the authors did not consider the amount of per subject data (rather than the total number of subjects) in their evaluation of the reliability, given that recent reports have consistently found this to be a primary way of improving reliability of the findings. This might make sense if the focus of the work is only on analytic means to improve reliability, but given the proposed analyses in Aim 3, it seems reasonable to ask whether the amount of data per subject or number of subjects makes a bigger difference for reliability (I would guess the first).

-       In a similar vein, the current work is based on a very small amount of data from each participant (comparison of single runs), and for a set of contrasts that likely have relatively small effects (e.g., relative to sensory/motor stimulation). I am concerned that the authors will be at floor in terms of reliability, preventing them from finding meaningful differences across analysis strategies. Is there any evidence that this will not be the case? Including a positive control dataset with more data per participant would help to alleviate this issue. The hypothesis that between subject variance is larger than within subject variance may not be true in the context of this very small amount of data.


Minor suggestions

-       Effects of younger vs. older samples may be influenced by study design (e.g., fMRI imaging parameters, block lengths, etc.) that are not matched across datasets. How will this be addressed? It seems like it may be challenging to interpret differences across datasets.

-       Table 1: #5 says “censor high motion frames” but not what the cut off for these will be? This should be specified in the report.

Reviewed by anonymous reviewer 1, 31 May 2023

Title: Test-Retest Reliability in Functional Magnetic Resonance Imaging: Impact of Analytical Decisions on Individual and Group Estimates

By: Demidenko and colleagues

This is an interesting and well-written report that addresses an important question about the reliability of fMRI-based brain mapping with respect to the relative size of intra-subject (inter-session) and inter-subject variability. The main strength of this paper is the systematic comparison of multiple combinations of different methodological choices (almost 360 different parametrizations) in fMRI data analysis while task is held constant. The authors also considered the use of different datasets with variable sample sizes. Overall, the topic of this paper is important and relevant to current debate(s) about the consistency of fMRI data analysis procedures.

I have the following comments about the current version.

First, while the topic is important, the novelty of this work is not clear. There is already a massive literature about test-retest reliability in fMRI in different domains (visual, motor, language, memory, decision making…etc.). Many previous test-retest studies are not cited in the current version. I believe it would be useful to discuss reliability with respect to its origin/source and its implications in fMRI: e.g. differences due to situational factors (e.g. Dubois and Adolphs 2016 TICS) or ‘meaningful’ differences that reflect inherent cognitive and behavioral differences (Seghier and Price 2018 TICS). For instance, if task performance would vary across sessions, should fMRI activations vary as well to reflect such differences in task performance? My point here is that test-retest reliability in task performance (behavioral data) should not be overlooked when explaining test-retest reliability in fMRI (as long as there is variability in behavioral data there will be variability in fMRI data). 

In the same way, performance during the selected MID task varied across runs and subjects. The targeted accuracy was set to 60%, which is relatively low (is the task challenging?). How the first-level GLM models will be defined? The authors mentioned in Page 12 that the design matrix included 15 task-relevant regressors (5 cue and 10 feedback types), but it is not clear if all trials will be included or not? For example, it is likely that correct trials might show different activations (and different reliability) than incorrect trials. Again, task performance is a huge confounder in such test-retest type of analysis and thus the authors should clarify how correct versus incorrect trials will be modelled.

As the authors know, test-retest reliability may vary with task. For instance, it is likely that tasks involving primary sensory cortices might show higher reliability than tasks activating high-level processing regions. Even for the same domain, reliability might vary with activation condition (e.g. see for the language domain: Otzenberger et al. 2005). In this context, I’m not sure what is the exact impact of the expected findings on current practices; i.e. the best parameterization (or combination of methodological choices) might not generalize to other contexts that (1) use different tasks and cognitive domains, (2) different populations, (3) different time points between repeated sessions, (4) different magnetic fields, (4) different paradigms (block, event-related, mixed…etc), (5) different acquisition protocols (high temporal or spatial resolution EPI), (6) variable run/session length (e.g. Genovese et al. 1997 MRM) …etc. There are so many factors that can impact upon the BOLD signal and its reliability. Also, there are already many guidelines (recommendations) about optimal pipelines and analysis protocols to improve reliability and thus it would be nice that the authors could spell out the expected implications to current practices. 

I find the inclusion of task-irrelevant voxels slightly confusing. Basically, task-irrelevant voxels are just the complement set of task-relevant voxels (assuming relevant voxels + irrelevant voxels = the whole brain as a constant). So, I’m not sure what would be gained by that inclusion, in particular when using binary maps to calculate the similarity metrics like Jaccard. Note also that the definition of task-irrelevant voxels is contingent to the selected threshold and the used baseline/control condition. 

Some statements need to be integrated in a more coherent way; e.g. Line 161 “there will be higher between-subject than within-subject variance across the three samples within task-relevant regions” and Line 329 “the reliability within sessions would be hypothesized to be greater than between sessions” versus Line 395 “To test whether the between-subject variance is lower than within-subject variance consistently across the three samples”. 

In Line 735, I didn’t get what the authors meant by this statement: “so it could be the case that a less efficient model captures the data variability better and the estimated variance of contrast estimates is reduced.” How this statement should be understood with respect to modelling efficiency of Figure S2?

In the same way, Line 134 “However, there is little empirical research on whether the culprit in the reportedly low reliability of fMRI signal across measurement occasions is a decreased between-subject and/or an increased within-subject variability.” If there is a decrease in the between-subject variability, it would then make sense to expect a subsequent increase in reliability. I feel there is a confusion in the current version (at least to me) about the exact impact of low/high variability toward low/high reliability that should explained beyond the issue of reliability quantification with ICC. Put another way, ICC is just one type of objective measures used for the assessment of reliability, but the relationships between intra/inter-subject variability and reliability in general is much broader than that. 

In the study of Kennedy et al. (2022) with the ABCD dataset, the authors assessed reliability and stability for short term (within-session) and long-term (between-session) and reported overall poor reliability and stability. Please add a statement about how the current study can help make sense of Kenedy et al.’s findings.    

Jaccard similarity index might show low values for small activated volumes, which means that Jaccard values can decrease with more conservative thresholds. The issue of thresholding is thus critical when calculating Jaccard.   

Equations (1), (2) and (3): explain what these abbreviations/parameters stand for? (In addition to what is already mentioned in Figure 1).

Line 456: how the sample size N=60^5 was defined? 

Explain what the tetrachoric correlation will add on top of what one can get with Jaccard (both indices use binary data). 

Figure 1b: Jaccard index (i.e. Intersection over Union) can be simplified as Red divided by the sum of Red, Blue and Green (no need for the two extra Red squares in the denominator of 1b). 

Table S3: TE of the BOLD run for MLS/GE Sigma looks odd (3.6ms !). 

User comments

No user comments yet