The relationship between perceptual discriminability and subject similarity
Is subjective perceptual similarity metacognitive?
Abstract
Recommendation: posted 16 October 2024, validated 17 October 2024
Schwarzkopf, D. (2024) The relationship between perceptual discriminability and subject similarity. Peer Community in Registered Reports, . https://rr.peercommunityin.org/PCIRegisteredReports/articles/rec?id=844
Recommendation
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.
List of eligible PCI RR-friendly journals:
- Advances in Cognitive Psychology
- Collabra: Psychology
- Cortex
- Experimental Psychology *pending editorial consideration of disciplinary fit
- Journal of Cognition
- Peer Community Journal
- PeerJ
- Psychology of Consciousness: Theory, Research, and Practice
- Royal Society Open Science
- Studia Psychologica
- Swiss Psychology Open
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #2
DOI or URL of the report: https://doi.org/10.1101/2024.06.13.598769
Version of the report: 3
Author's Reply, 16 Oct 2024
Decision by D. Samuel Schwarzkopf, posted 10 Oct 2024, validated 11 Oct 2024
Dear authors
Your Stage 1 manuscript has been reviewed by two experts now. Before I can recommend IPA, could you please address the outstanding minor points raised by reviewers? There should be no need for another full round of review after this.
Reviewed by Haiyang Jin, 10 Oct 2024
I’m Haiyang Jin and I always sign my review.
Review of “Is subjective perceptual similarity metacognitive?” (PCI-RR_Stage1).
Thank you for thoroughly addressing my feedback. The authors have clearly invested significant effort into the revision process, and the manuscript is now in much better shape. I sincerely appreciate their hard work.
Minor points:
1. I would suggest including references to support the definition of “metacognition” (Line 72) and the hypothesis (or perhaps assumption) that “similarity judgments involve a type of implicit metacognition”(Line 73).
2. It is possible to directly evaluate the significance of the null result. Authors may refer to equivalence tests (Lakens et al., 2018).
Reference:
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963
Reviewed by anonymous reviewer 1, 12 Sep 2024
The authors have provided thoughtful responses to all of my comments and I am satisfied with their answers. I have no further suggestions.
Evaluation round #1
DOI or URL of the report: https://doi.org/10.1101/2024.06.13.598769
Version of the report: 2
Author's Reply, 02 Sep 2024
Decision by D. Samuel Schwarzkopf, posted 25 Jul 2024, validated 29 Jul 2024
Dear authors
Your Stage 1 RR manuscript has now been reviewed by two experts in the field. While they are general enthusiastic about your proposed research, they raised several points I'd like you to consider. I also included a few comments of my own:
- Reviewer Haiyang Jin raises questions about the use of the term 'metacognitive' in the context of this study. While I agree with you that you are investigating metacognition here, he is right that it would be worth spelling out clearly why this is about metacognition rather than simply comparing objective thresholds with subjective ratings.
- It is not clear whether excluded participants will be replaced. For your stopping criterion based on confidence intervals, this doesn't technically matter. But for your minimum sample size of 12 it is important to clarify. I assume that's what you mean but please state this explicitly.
- Note that both reviewers raise the issue that you did not specify a maximum sample size. Please either do so or at least show (e.g. through simulations) that you can feasibly obtain your stopping criterion with a realistic sample size. It is imortant to minimise the risk of inconclusive results.
- You are determining the 95% CI via bootstrapping. I assume that this is done by using the 2.5th and 97.5th percentile. Please clarify, because there are various adjustment methods available. As the exact limits here will affect your stopping decision and statistical inference it is critical to ensure there is no flexibility.
- Perhaps I'm misunderstanding something, but in several places you state than higher #JND corresponds to "higher capacity," i.e., better performance. This sounds incorrect. Your staircase converges on ~71% correct threshold. Wouldn't fewer morphing steps therefore correspond to better performance (i.e. higher capacity)?
- Design Table, Hypothesis 1, Sampling plan: When you write "execution" criteria, I assume you mean "exclusion"?
- A Discussion section isn't required in a Stage 1 manuscript. It is fine to include one and it will be acceptable if this is changed completely at Stage 2 if the results suggest different interpretations at that stage. However, for simplicity and to avoid reviewer time being spent on a part of text that isn't needed and I'd suggest removing it.
Reviewed by Haiyang Jin, 05 Jul 2024
I’m Haiyang Jin and I always sign my review.
Review of “Is subjective perceptual similarity metacognitive?” (PCI-RR_Stage1).
The manuscript presents an interesting study aiming to test whether subjective perceptual similarity is metacognitive. Specifically, the study plans to measure participants’ similarity judgements among faces and their near-threshold face discrimination abilities, as well as test the correlation between these two measures with the hypothesis that these two measures are correlated positively. It is proposed that the potential findings could provide insights on the main research question, i.e., whether subjective perceptual similarity is metacognitive.
The research question is interesting; the introduction is well documented, and it considers some potential concerns (e.g., the potential alternative hypotheses). However, I do have grave concerns about some theoretical and methodological aspects of the manuscript.
Although the main research question is about “metacognitive”, the manuscript surprisingly does not seem to explain what “metacognitive” means in the manuscript or the measures used in this study do not seem to tap into the popular understanding of “metacognitive”. To my knowledge (I’m not an expert in metacognition), “metacognition of face ability” refers to whether the participant know how good is his/her face recognition ability. For example, both a person with bad face recognition ability knowing his/her recognition ability being bad and a person with good face recognition ability knowing his/her recognition ability being good have high metacognition in face recognition ability. But this does not seem to be measured by the tests in this manuscript. As such, the meaning of “metacognitive” in the manuscript (and its relationship with the potential understanding above) should be explained and clarified further.
Throughout the manuscript, “perceptual similarity” is emphasized as subjective (e.g., the term “subjective perceptual similarity”), whereas “discriminability ability” seems to be treated as objective/“quasi-objective”. But both “perceptual similarity” and “discriminability ability” were reported/responded by participants subjectively. Thus, it remains unclear why there is such differences (subjective vs. objective) between “perceptual similarity” and “discriminability ability”.
It is highly appreciated that the introduction discusses the potential alternative hypotheses, which to some extent addresses my concerns what other hypotheses may account for the correlations between perceptual similarity and ability judgements. However, (1) it remains elusive whether these alternative hypotheses were mutual exclusive to the main hypotheses; (2) if not, it is unclear why these alternative hypotheses were not tested in the manuscript; (3) a potential relating issue is that the alternative hypotheses are too vague to test in practice. Since this is a registered report, there is possibility that the main hypothesis would not be supported (see later for potential issues in employing statistical evidence disconfirming the main hypotheses). In this case, it should be clarified what we can conclude from the findings (and the potential specific results).
The analysis of Hypothesis 1 seems to be related to Simpson paradox, and therefore, it is necessary to explain how the obtained group-level correlation results are different from the correlations between two tasks. For instance, when we talked about correlations between two tasks, the popular understanding is that participants completed two tasks and their performance in two tasks were correlated among participants. But this is different from the group-level correlation calculated in the manuscript, which should be clarified.
The test for the second hypothesis seems biased. To test the second hypothesis, only selected face pairs with large standard deviations (SDs) were included. Indeed, since in these trials different participants more likely have different responses for the same face pairs, using face pairs with large SDs have higher probability to provide evidence support the second hypotheses. However, on the other hand, face pairs with smaller SDs are more likely to provide evidence that different participants made the same or similar responses for the same face pairs, which does not seem to provide support for the second hypotheses. Instead, it may suggest that the perceptual similarity responses are not unique to individuals. But the method considers more trials with large SDs, which is not appropriate. Maybe trials with smaller SDs can also be used to test the potential alternative hypotheses.
It is great that the manuscript included the sample size planning section. But some key information is missing, and some procedures do not make sense in practice. First, as no statistical power analysis is conducted, the proposed sample size (i.e., 12) and the procedure of adding more participants does not guarantee sufficient statistical power. Although precision of CI is used as the stopping rule, it remains unclear why CI size of 1 is used. By assuming CI of 1 is sufficient somehow, there is not an upper limit of the sample size in the procedure, which brings the risk that the study would never end. Second, with the stopping rule of CI size of 1, a non-significant result could be obtained; in this case, it remains unclear what conclusions could be drawn. As a registered report, tests for supporting null hypotheses also should be included. Third, the procedure of adding participants conflicted with the experiment procedures. It was introduced that all participants will complete task 2 only after all participants complete task 1, and Hypothesis 1 can only be tested when both task 1 and 2 were completed by all participants. But if the criteria were not met (e.g., CI for H1 is larger than 1), more participants would be added. 1) it remains unclear how many participants will be recruited additionally before the next round of analysis. Only 1 or more? 2) Since more participants would be recruited, it is possible that the face pairs for task 2 would change. Then will the first 12 participants re-complete the task 2? It remains elusive what the specific procedures would be. Fourth, it is unclear whether and how 95% HDI would be used for hypothesis testing. For instance, what conclusions could be drawn if the 95% HDI includes 0? Also, what is the prior for calculating Bayes factor?
Minor points:
1. Since the correlation between the measures is the main interest, and both tests were conducted on two separate days for each participant, the reliability of each task should be reported.
2. The steps to get the dissimilarity matrix seem to be quite complicated, and maybe a figure together with the text explanation would help. (I made all my comments with the assumptions that all the steps to get the dissimilarity matrix is appropriate.)
Reviewed by anonymous reviewer 1, 25 Jul 2024
Ali and colleagues plan to test the hypothesis that judgments of perceptual similarity reflect a metacognitive awareness of one’s own perceptual capacities. They propose to test this by comparing perceptual judgments of similarity for a set of faces with threshold measurements of perceptual discriminability between pairs of faces (using a morphed continuum).
There are two key hypotheses: 1) clear association between similarity judgments and perceptual discriminability, and 2) individual differences such that the association is stronger within than between participants.
This is a well-motivated proposal and the rationale for the work is clearly and comprehensively laid out. The experiments have been thoughtfully designed and the pilot data demonstrates feasibility and provides preliminary results that are supportive of the hypotheses.
Overall, this is a very solid proposal, but I have a few comments/suggestions/questions that the investigators might want to consider:
1) The nature of the subjective similarity task with multiple target faces presented below the sample face, means that participants are not just comparing the sample with one target, but considering all faces simultaneously. Might this introduce strong context effects and would it be better to present triplets to minimize such effects? I assume that part of the rationale is to speed up data collection, but perhaps this also leads to less stable data?
2) The investigators propose running 12 participants with four sessions per participant (2 similarity judgment, 2 threshold discrimination) with, for example, 24 pairs of faces for the perceptual discrimination tasks. The rationale for all these numbers is partly based on the pilot data, but the numbers seem arbitrary.
Lines 145-147 – “Each participant performs four sessions on different days with each session taking more than 60 minutes. This provides us with enough data to perform our statistical analysis at the individual-level.”
Lines 262-264 – “Our decision to select 24 pairs is supported by our pilot study, as we achieved reasonably robust results by examining only 13 pairs, almost half of our planned 24 pairs.”
The basis for these statements is not clear. I was wondering if the authors could use the pilot data to run simulations to estimate how much data they actually need, both for each participant to reliably estimate their performance on each task and at the group level to estimate the relationship between performance on the two tasks. This would help increase confidence in the proposed plan and potentially avoid collecting too little or too much data. I’m a little bit concerned about the latter and the burden currently placed on each participant – overly taxing the participants could actually lead to less reliable data.
3) Do the investigators have any sense of how stable/reliable the similarity judgments and perceptual discrimination judgments are? What is the test/retest reliability across days? In the context of the similarity ratings, they suggest that “subjective similarity ratings may be made based on whatever visual features that happen to be more salient, depending on one’s fluctuating attentional states, or arbitrary preferences that aren’t necessarily related to one’s own performance in near-threshold psychophysical tasks.” (Lines 71-74). To the extent that performance on the different tasks fluctuates, combining sessions across days may be worth reconsidering.
4) Lines 197-198 – can the investigators give some intuitive sense of how the trials are selected based on the embeddings.
5) I like the idea of using precision as the basis for the stopping criterion, but what is the rationale for choosing <1 as the desired 95% confidence interval? Might it be worth setting an upper limit for the number of participants that will potentially be recruited in case the precision does not converge as the investigators anticipate?
Download the review