# PCI RR review 

```
guidelines as headers: https://rr.peercommunityin.org/help/guide_for_reviewers#h_7586915642301613635089357
```

# Summary of submission and impact
This registered report describes an experiment to assess the impact of defacing on manual and automated assessments of MR (brain) image quality. A lack of information about image quality can impact downstream processing and introduce bias to further analysis. Understanding the impact of defacing is important as this practice is essential to maintain the privacy of our participants while maximising reuse potential. 

The registered report follows the publication of a pilot investigation (Povins, 2022) and extends the sample size from n=10 to n=185. Hypotheses are clear for both measures, but there is no directionality or Bayesian tests described for the analysis of MRIQC IQMs. The publication of the pilot study, code, reuse of an open dataset and submission of a registered report demonstrates exemplary use of this scientific process. The authors should be commended for their commitment. 

This is an important and interesting question which is described well. My comments relate in places to the positioning and reporting of the study itself, rather than the research methodology, which I think is sound. Below I have commented on the submission in the order which the material was presented. I have reviewed the main submission and supplementary materials, but I was not able to review the code. 


# Reviewer comments

Sample size: The analysis is to be conducted on a subsample of all images in the IXI Dataset (n = 185 / 580). References to "n = 580" in the abstract and elsewhere should be removed as they are misleading. 

Defacing has become necessary "in compliance with privacy regulations" (line 30-31). Not in all regions. Perhaps caveat with "local" and "in some areas", or better still list a few regulations, e.g. GDPR and equivalents. This is an important inclusion to give context to the motivation to understand these effects. 

How are the images perceived to be altered?
- It would be useful to highlight more explicitly how the raters (and MRIQC) are *interpreting* the removal of data by defacing. The removal of the eyes as a source of artifact is proposed as one mechanism. Are there any other signals if image quality used or altered? It would be helpful to include a more explicit discussion of this and what 'removal of information' actually means in this context.  
- The MRIQC rating widget contains a list of artefacts which raters can identify. How was this list developed/tested? How will the data acquired here be tested? State that there will be no analysis of these data or only exploratory analysis of this. 

Hypothesis 2: "Defacing [will] introduce bias in vectors of IQMs computed by MRIQC between the defaced and the non-defaced conditions." (line 82-83). Do you have hypotheses  regarding directionality or any IQMs? 

Data: Please state the licence or data usage agreement that the IXI Dataset is made available under.

Manual assessment procedure
- Really good description of the manual assessment protocol, and it appears to be a robust procedure. 
- Would be useful to have an estimate of how long each rater will be working for, to guage the workload involved in a replication. 
- Great extensions of MRIQC to collect interval ratings and randomise the presentation. 
- Could the authors add anything about the training of raters, how experienced they are or how they will be recruited?

3T imaging site only included: While I accept the reasoning to avoid filed strength as a variable, it would be possible to run a separate experiment on the 1.5T data. There are many experienced radiographers who would be able to conduct a skilled manual quality assessment of 1.5T data, and MRIQC biases would be systematic. It would be interesting to discuss/consider the possibility of repeating the protocol for the 1.5T data, given that there is so much more available (n = 395 over two scanners). I understand this may be outside of the scope of this project, but it could perhaps be discussed in the supplementary material as a potential follow-up.

Experiments: 
- Good description of the rm-ANOVA and test of assumptions. 
- How is the rater's confidence going to be investigated or controlled for? Or state that there will be no or only exploratory analysis. 

Visualisation: I find the BA plots difficult to read. This doesn't really impact the quality of the registration, but it would improve the reporting and publication if these were easier to engage with. e.g. shading different areas of significance, or some other visualisation. 

Analysis of MRIQC IQMs: Please state what statistical packages will be used to perform the PCA and describe all analytic choices.

Conclusion
- The authors state that data will be shared CC-BY. The original IXI Dataset is shared CC-BY-SA-3.0. The CC-BY-SA-4.0 license states that any adapted or remixed data must also be shared on the same (CC-BY-SA-4.0) licence. Please confirm whether this is the case for CC-BY-SA-3.0 and update your proposed data sharing license accordingly. 
- I was not able to access the code ocean capsule ("You don't have permission to access this Capsule."). Please could access be provided to my email address? 

Supplementary material
- Really good discussion of the implications ("Confirming our hypothesis would build a strong support to the idea of sharing QA/QC assessments..."). I would like to see part of this included in the introduction to the main RR material. More explicitly, this is an important step to mitigate some of the impact of the necessary and appropriate pseudonymisation processing. 

- I would be interested to see a brief discussion of https://www.nitrc.org/projects/mri_reface on recovering/replacing some of the lost quality assessment signal.


# PCI RR Reviewers guide questions

### Does the research question make sense in light of the theory or applications? Is it clearly defined? Where the proposal includes hypotheses, are the hypotheses capable of answering the research question?

Yes. The theory ("defacing removes information that might be relevant for humans and for automated decision agents in their assessment of image quality") is only lightly explored. I would like to see more about the explicit features of the images which are used to raters to make a quality assessment, how they are impacted by data removal, and how they relate to the MRIQC measures. A more complete exploration of this would strengthen conclusions drawn at Stage 2.

### Is the protocol sufficiently detailed to enable replication by an expert in the field, and to close off sources of undisclosed procedural or analytic flexibility?

Yes. Missing some computational details on PCA. I was not able to access the code. I hope/assume the modified MRIQC toolbox code with manual rating widget has been shared under an appropriate licence for others to replicate the procedure.

### Is there an exact mapping between the theory, hypotheses, sampling plan (e.g. power analysis, where applicable), preregistered statistical tests, and possible interpretations given different outcomes?

The sample size is convenience driven rather than derived from a power calculation. I would like to see this discussed as a limitation and the potential to increase the sample size by working with the 1.5T data. Statistical tests are appropriate and well-described. 

### For proposals that test hypotheses, have the authors explained precisely which outcomes will confirm or disconfirm their predictions?

Yes.

### Is the sample size sufficient to provide informative results?

See above discussion of sampling plan. 

### Where the proposal involves statistical hypothesis testing, does the sampling plan for each hypothesis propose a realistic and well justified estimate of the effect size?

See above discussion of sampling plan. There was no clear link between effect sizes observed in the pilot (Povins, 2022) and the effect sizes expected here.

### Have the authors avoided the common pitfall of relying on conventional null hypothesis significance testing to conclude evidence of absence from null results? Where the authors intend to interpret a negative result as evidence that an effect is absent, have authors proposed an inferential method that is capable of drawing such a conclusion, such as Bayesian hypothesis testing or frequentist equivalence testing?

Bayesian analysis is described for human rater assessments (hypothesis 1). Could Bayes or some other evidence be used to support the MRIQC analysis? This might also be important because of the [apparent] lack of hypothesised directionality in the MRIQC IQM effects, and the high number of IQMs which will be returned by MRIQC. 

### Have the authors minimised all discussion of post hoc exploratory analyses, apart from those that must be explained to justify specific design features? Maintaining this clear distinction at Stage 1 can prevent exploratory analyses at Stage 2 being inadvertently presented as pre-planned.

I have suggested that all acquired data sources should be explicitly disclosed, including the artefact descriptions obtained during the manual assessment via MRIQC, and rater confidence ratings as shown in the figures. I have suggested the authors state that there would be no planned analysis of those measures, or only exploratory. I feel acknowledging these data sources in Stage 1 is important as a matter of transparency, and is a higher priority than the risk of misreporting in Stage 2. Happy to concede to the Editor on this if my view point is not consistent with PCI recomendations.

### Have the authors clearly distinguished work that has already been done (e.g. preliminary studies and data analyses) from work yet to be done?

Yes. Provins et al. 2022 is referenced as the pilot. This work is a replication of Provins et al. 2022 on a larger dataset. There is a clear description of how the test data have al;ready been interacted with.

### Have the authors prespecified positive controls, manipulation checks or other data quality checks? If not, have they justified why such tests are either infeasible or unnecessary? Is the design sufficiently well controlled in all other respects?

Limited discussion of how the manipulation check of defacing will be assessed ("Only if PyDeface fails resulting in the preservation of substantial facial features from the original image, will images be excluded from the analysis", line 101-103). Could be more explicit about how/why images will be "failed".

### When proposing positive controls or other data quality checks that rely on inferential testing, have the authors included a statistical sampling plan that is sufficient in terms of statistical power or evidential strength?

Unsure.

### Does the proposed research fall within established ethical norms for its field? Regardless of whether the study has received ethical approval, have the authors adequately considered any ethical risks of the research?

The authors will be using and resharing data acquired in UK NHS settings under GDPR, and that data by necessity will contain identifiable facial images at some stages of the processing. I would like to see a reference included to the authors institutional policy or governance describing how data is being handled in accordance with local data security processing guidelines, or terms of use imposed by the data originator or their local restrictions. Secondary data analysis of this type does not usually require ethical approval, however it should be described in a Data Privacy [Impact] Assessment or equivalent outside of the EU. This is not yet a well understood and established norm in this field, however given that this research is specifically looking to explore the effects of privacy related processing on the data, it would be valuable to show the data are being handelled with high regard for participant privacy.