DOI or URL of the report: https://doi.org/10.1101/2024.06.13.598769
Version of the report: 3
Dear authors
Your Stage 1 manuscript has been reviewed by two experts now. Before I can recommend IPA, could you please address the outstanding minor points raised by reviewers? There should be no need for another full round of review after this.
I’m Haiyang Jin and I always sign my review.
Review of “Is subjective perceptual similarity metacognitive?” (PCI-RR_Stage1).
Thank you for thoroughly addressing my feedback. The authors have clearly invested significant effort into the revision process, and the manuscript is now in much better shape. I sincerely appreciate their hard work.
Minor points:
1. I would suggest including references to support the definition of “metacognition” (Line 72) and the hypothesis (or perhaps assumption) that “similarity judgments involve a type of implicit metacognition”(Line 73).
2. It is possible to directly evaluate the significance of the null result. Authors may refer to equivalence tests (Lakens et al., 2018).
Reference:
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963
The authors have provided thoughtful responses to all of my comments and I am satisfied with their answers. I have no further suggestions.
DOI or URL of the report: https://doi.org/10.1101/2024.06.13.598769
Version of the report: 2
Dear authors
Your Stage 1 RR manuscript has now been reviewed by two experts in the field. While they are general enthusiastic about your proposed research, they raised several points I'd like you to consider. I also included a few comments of my own:
I’m Haiyang Jin and I always sign my review.
Review of “Is subjective perceptual similarity metacognitive?” (PCI-RR_Stage1).
The manuscript presents an interesting study aiming to test whether subjective perceptual similarity is metacognitive. Specifically, the study plans to measure participants’ similarity judgements among faces and their near-threshold face discrimination abilities, as well as test the correlation between these two measures with the hypothesis that these two measures are correlated positively. It is proposed that the potential findings could provide insights on the main research question, i.e., whether subjective perceptual similarity is metacognitive.
The research question is interesting; the introduction is well documented, and it considers some potential concerns (e.g., the potential alternative hypotheses). However, I do have grave concerns about some theoretical and methodological aspects of the manuscript.
Although the main research question is about “metacognitive”, the manuscript surprisingly does not seem to explain what “metacognitive” means in the manuscript or the measures used in this study do not seem to tap into the popular understanding of “metacognitive”. To my knowledge (I’m not an expert in metacognition), “metacognition of face ability” refers to whether the participant know how good is his/her face recognition ability. For example, both a person with bad face recognition ability knowing his/her recognition ability being bad and a person with good face recognition ability knowing his/her recognition ability being good have high metacognition in face recognition ability. But this does not seem to be measured by the tests in this manuscript. As such, the meaning of “metacognitive” in the manuscript (and its relationship with the potential understanding above) should be explained and clarified further.
Throughout the manuscript, “perceptual similarity” is emphasized as subjective (e.g., the term “subjective perceptual similarity”), whereas “discriminability ability” seems to be treated as objective/“quasi-objective”. But both “perceptual similarity” and “discriminability ability” were reported/responded by participants subjectively. Thus, it remains unclear why there is such differences (subjective vs. objective) between “perceptual similarity” and “discriminability ability”.
It is highly appreciated that the introduction discusses the potential alternative hypotheses, which to some extent addresses my concerns what other hypotheses may account for the correlations between perceptual similarity and ability judgements. However, (1) it remains elusive whether these alternative hypotheses were mutual exclusive to the main hypotheses; (2) if not, it is unclear why these alternative hypotheses were not tested in the manuscript; (3) a potential relating issue is that the alternative hypotheses are too vague to test in practice. Since this is a registered report, there is possibility that the main hypothesis would not be supported (see later for potential issues in employing statistical evidence disconfirming the main hypotheses). In this case, it should be clarified what we can conclude from the findings (and the potential specific results).
The analysis of Hypothesis 1 seems to be related to Simpson paradox, and therefore, it is necessary to explain how the obtained group-level correlation results are different from the correlations between two tasks. For instance, when we talked about correlations between two tasks, the popular understanding is that participants completed two tasks and their performance in two tasks were correlated among participants. But this is different from the group-level correlation calculated in the manuscript, which should be clarified.
The test for the second hypothesis seems biased. To test the second hypothesis, only selected face pairs with large standard deviations (SDs) were included. Indeed, since in these trials different participants more likely have different responses for the same face pairs, using face pairs with large SDs have higher probability to provide evidence support the second hypotheses. However, on the other hand, face pairs with smaller SDs are more likely to provide evidence that different participants made the same or similar responses for the same face pairs, which does not seem to provide support for the second hypotheses. Instead, it may suggest that the perceptual similarity responses are not unique to individuals. But the method considers more trials with large SDs, which is not appropriate. Maybe trials with smaller SDs can also be used to test the potential alternative hypotheses.
It is great that the manuscript included the sample size planning section. But some key information is missing, and some procedures do not make sense in practice. First, as no statistical power analysis is conducted, the proposed sample size (i.e., 12) and the procedure of adding more participants does not guarantee sufficient statistical power. Although precision of CI is used as the stopping rule, it remains unclear why CI size of 1 is used. By assuming CI of 1 is sufficient somehow, there is not an upper limit of the sample size in the procedure, which brings the risk that the study would never end. Second, with the stopping rule of CI size of 1, a non-significant result could be obtained; in this case, it remains unclear what conclusions could be drawn. As a registered report, tests for supporting null hypotheses also should be included. Third, the procedure of adding participants conflicted with the experiment procedures. It was introduced that all participants will complete task 2 only after all participants complete task 1, and Hypothesis 1 can only be tested when both task 1 and 2 were completed by all participants. But if the criteria were not met (e.g., CI for H1 is larger than 1), more participants would be added. 1) it remains unclear how many participants will be recruited additionally before the next round of analysis. Only 1 or more? 2) Since more participants would be recruited, it is possible that the face pairs for task 2 would change. Then will the first 12 participants re-complete the task 2? It remains elusive what the specific procedures would be. Fourth, it is unclear whether and how 95% HDI would be used for hypothesis testing. For instance, what conclusions could be drawn if the 95% HDI includes 0? Also, what is the prior for calculating Bayes factor?
Minor points:
1. Since the correlation between the measures is the main interest, and both tests were conducted on two separate days for each participant, the reliability of each task should be reported.
2. The steps to get the dissimilarity matrix seem to be quite complicated, and maybe a figure together with the text explanation would help. (I made all my comments with the assumptions that all the steps to get the dissimilarity matrix is appropriate.)
Ali and colleagues plan to test the hypothesis that judgments of perceptual similarity reflect a metacognitive awareness of one’s own perceptual capacities. They propose to test this by comparing perceptual judgments of similarity for a set of faces with threshold measurements of perceptual discriminability between pairs of faces (using a morphed continuum).
There are two key hypotheses: 1) clear association between similarity judgments and perceptual discriminability, and 2) individual differences such that the association is stronger within than between participants.
This is a well-motivated proposal and the rationale for the work is clearly and comprehensively laid out. The experiments have been thoughtfully designed and the pilot data demonstrates feasibility and provides preliminary results that are supportive of the hypotheses.
Overall, this is a very solid proposal, but I have a few comments/suggestions/questions that the investigators might want to consider:
1) The nature of the subjective similarity task with multiple target faces presented below the sample face, means that participants are not just comparing the sample with one target, but considering all faces simultaneously. Might this introduce strong context effects and would it be better to present triplets to minimize such effects? I assume that part of the rationale is to speed up data collection, but perhaps this also leads to less stable data?
2) The investigators propose running 12 participants with four sessions per participant (2 similarity judgment, 2 threshold discrimination) with, for example, 24 pairs of faces for the perceptual discrimination tasks. The rationale for all these numbers is partly based on the pilot data, but the numbers seem arbitrary.
Lines 145-147 – “Each participant performs four sessions on different days with each session taking more than 60 minutes. This provides us with enough data to perform our statistical analysis at the individual-level.”
Lines 262-264 – “Our decision to select 24 pairs is supported by our pilot study, as we achieved reasonably robust results by examining only 13 pairs, almost half of our planned 24 pairs.”
The basis for these statements is not clear. I was wondering if the authors could use the pilot data to run simulations to estimate how much data they actually need, both for each participant to reliably estimate their performance on each task and at the group level to estimate the relationship between performance on the two tasks. This would help increase confidence in the proposed plan and potentially avoid collecting too little or too much data. I’m a little bit concerned about the latter and the burden currently placed on each participant – overly taxing the participants could actually lead to less reliable data.
3) Do the investigators have any sense of how stable/reliable the similarity judgments and perceptual discrimination judgments are? What is the test/retest reliability across days? In the context of the similarity ratings, they suggest that “subjective similarity ratings may be made based on whatever visual features that happen to be more salient, depending on one’s fluctuating attentional states, or arbitrary preferences that aren’t necessarily related to one’s own performance in near-threshold psychophysical tasks.” (Lines 71-74). To the extent that performance on the different tasks fluctuates, combining sessions across days may be worth reconsidering.
4) Lines 197-198 – can the investigators give some intuitive sense of how the trials are selected based on the embeddings.
5) I like the idea of using precision as the basis for the stopping criterion, but what is the rationale for choosing <1 as the desired 95% confidence interval? Might it be worth setting an upper limit for the number of participants that will potentially be recruited in case the precision does not converge as the investigators anticipate?
Download the review