Recommendation

How does perceptual and contextual information influence the recognition of faces?

ORCID_LOGO based on reviews by Lisa DeBruine and Haiyang Jin
A recommendation of:

The importance of consolidating perceptual experience and contextual knowledge in face recognition

Abstract
Submission: posted 09 September 2022
Recommendation: posted 26 February 2023, validated 01 March 2023
Cite this recommendation as:
McIntosh, R. (2023) How does perceptual and contextual information influence the recognition of faces?. Peer Community in Registered Reports, . https://rr.peercommunityin.org/articles/rec?id=313

Recommendation

When we familiarise with new faces over repeated exposures, it is in situations that have meaning for us. Here, Noad and Andrews (2023) ask whether meaningful context during exposure matters for the consolidation of faces into long-term memory. Participants will be shown video clips from TV shows that are ordered either in their original chronological sequence, preserving meaningful context, or in a scrambled sequence. It is expected that the original sequence will provide a better understanding of the narrative. The critical question is whether this will also be associated with differences in memory for the faces. Memory will be tested with images of the actor from the original clips (‘in show’) or images of the same actor from another show (‘out-of-show’), both immediately after exposure and following a four-week delay. It is predicted that memory for faces will be better retained across the delay when the original exposure was in a meaningful context, and that this benefit will be enhanced for ‘in-show’ images, where the person’s appearance matches with the original context. The pre-registered predictions and the targeted effect sizes for this study are informed by pilot data reported within the manuscript.
 
The Stage 1 manuscript was evaluated through an initial round of editorial review, followed by a further round of external review, after which the recommender judged that it met the Stage 1 criteria for in-principle acceptance (IPA).
 
URL to the preregistered Stage 1 protocol: https://osf.io/8wp6f
 
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA. 
 
List of eligible PCI RR-friendly journals:
 
 
References
 
1. Noad, K. & Andrews, T. J. (2023). The importance of consolidating perceptual experience and contextual knowledge in face recognition, in principle acceptance of Version 4 by Peer Community in Registered Reports. https://osf.io/8wp6f
Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Evaluation round #2

DOI or URL of the report: https://osf.io/c3uzk?view_only=a59c340099d44a8db190a1b382b3b4d8

Version of the report: Perceptual_Contextual_Reg_Report_PCIRR_v3

Author's Reply, 22 Feb 2023

Decision by ORCID_LOGO, posted 15 Jan 2023, validated 16 Jan 2023

Thank you for your revised Stage 1 RR, which has taken account of the initial triage comments. I have now received two expert (signed) external reviews on this version, from Haiyang Jin (HJ) and Lisa DeBruine (LDB). These reviews should be helpful to you in developing and clarifying your experimental and analysis plan further.

HJ has provided a number of useful comments for clarification, and LDB provided an exceptionally detailed review of the analysis plan in the form of RMarkdown code with HTML output. LDB has also generously offered that she would be happy to correspond (or zoom chat) directly (Lisa.Debruine@glasgow.ac.uk) about any points that need clarification.

Looking over these reviews, I anticipate that you could feel uncertain how best to respond where different perspectives have been taken by different reviewers, and also where these may differ from advice given in the initial triage comments. In general, if you feel uncertain about how best to resolve such conflicts then I am happy to consult on potential solutions to prevent further back-and-forth at a later stage.

For instance, both reviewers ask questions about how your interpretation of a targeted hypothesis test would be affected by the full pattern of interaction (which you are not formally testing). You should think carefully about this issue and respond to it in whichever way you see fit but - at an editorial level – I would judge the targeted comparisons that you have proposed sufficient to test the hypotheses as stated, and so I would not require you to add additional tests unless you wish to.

LDB has advised adopting a mixed-effects approach to analysis, and has provided guidance on this (including code). I would encourage you to consider the possible merits of this approach, and what benefits it might offer. Again, I would not require that you change your analysis strategy, unless you wish to, particularly as your approach is informed by triage comments recommending the current approach. (On the other hand, it is possible that a mixed-effects approach might have advantages in terms of power, to consider here too.)

LDB has suggested that you could present a sensitivity analysis, showing the level of expected power for your target sample size, with effect sizes from pilot data superimposed (with 95% CIs). This would be a good way to present a broader consideration power, but the critical feature (at least for an outlet such as Cortex, which I think is your target journal) is that you convincingly establish .9 power (at alpha .02) for all relevant effect sizes, and that the testing of these effect sizes would provide an important message for the field (whatever the outcome).

This means that the justification for the tested effect size is critical. LDB has noted that drawing your target effect size directly from the pilot data can be misleading, and recommends powering for a smaller effect size. This concurs with the comment made at triage that the effect size from pilot data may be biased upward, related to the fact that you have selectively chosen to follow this effect up because it seems encouraging. This would suggest that a more conservative approach to the effect size of interest might be wise. You refer to the ‘smallest effect size of interest’, and my impression is that you are using this to mean the smallest of the effect size estimates that is targeted by any of your tests. If so, this use of the term would be misleading, because it should properly refer to the smallest effect that you would think to be of theoretical or practical importance (i.e. an effect size smaller than this would not be of interest to the field). However you choose to address this, the critical point is that you should make your rationale clear for the effect size that you are targeting, and explain how you are addressing/offsetting possible bias in pilot data (or alternatively why you do not think this is a problem).

Above I have just highlighted some issues where I anticipate that you could see a conflict between recommendations, but you should respond in your revision to all of the comments raised by HJ and LDB, and I am sure that their advice will help you to optimise your plan for this RR.

Your sincerely,

Rob McIntosh

PCI-RR

 

Reviewed by ORCID_LOGO, 13 Jan 2023

I wrote my review as an RMarkdown file with HTML output, so will send that along to the editor by email. This is because I ended up creating a full simulation and attempt at constructing the specified analyses from the description, but ran into a few parts where I was confused.

Note from Managing Board: for technical reasons it is currently not possible to attach .rmd and .html files to reviews. Therefore, we have instead created a private OSF page under the PCI RR OSF account and have uploaded the files there. They may be downloaded from: https://osf.io/qvk7g/?view_only=da800a96722b4d1ea6c5eb4faf148711

Reviewed by ORCID_LOGO, 14 Nov 2022


Evaluation round #1

DOI or URL of the report: https://osf.io/6hmtv?view_only=a59c340099d44a8db190a1b382b3b4d8

Author's Reply, 06 Oct 2022

Decision by ORCID_LOGO, posted 16 Sep 2022

Thank you for submitting your Stage 1 manuscript to PCI-RR. As the recommender assigned to this manuscript, it is my role to perform an initial triage assessment to determine whether the submission is ready to be sent for external review, or requires some revision. This assessment is primarily with respect to the RR aspects of the proposal, rather than its specific topical content, on which I am not an expert. With RR still relatively new for many people, we most commonly find that some revisions are required at this initial stage. That is the case with your manuscript too.

The study that you propose seems to me a clever and elegant design to get at an interesting question regarding the effect of context on memory consolidation for faces. The sample size of your pilot data is also impressive, and puts you in a strong position to propose an empirically well-informed RR. However, the design is complex, and I believe that that logical links between your theoretical hypotheses (and proposed conclusions) and the actual analyses proposed are not sufficiently precise. Your approach to the critical issue of power also needs further consideration.

The design is complex, and the issues are quite complex, and I have limited time to draft this letter, so it is possible that there may be some misunderstandings in what I have written. I am happy to discuss or clarify any such points by email. I should also note that, as with any review comments, you are not required to make changes suggested, but you do always need to provide a suitable rebuttal or rationale for your chosen course of action.

I shall list first the major issues, as I see them, with respect to the requirements of RR. In addition, I will add a number of more minor points noted in passing. These also include comments from a fellow recommender who has also read your manuscript. I hope that these comments are helpful and I look forward to seeing a revised version of this proposal if you choose to address these comments.

Major (RR) points

1) Table 1 lists specific targeted hypotheses, which you divide into 3 main questions (Hypotheses 1-3), with further sub-questions. However, the dependence of your conclusions upon different possible combinations of outcomes is not clear. For example, you state in the paper that “The comparison between In Show and Out of Show face images will be important to determine whether the effect of context on face memory is specific to the visual context in which the images were originally shown”, but your analysis plan does not include any direct comparison between in show and out of show faces.

With specific regard to Table1, I will unpack just one example (Hypothesis 1), but the same general issues apply to all hypotheses.

Hypothesis 1 is that your video manipulations lead to differences in understanding the narrative or context. (It is not clear what role “or context” plays here, because your tests seem to be focused on testing the understanding of the narrative, but maybe these terms just require more precise definition). There are two sub-hypotheses, relating to free-recall and structured questions, and a significant effect of context would confirm an influence of context. The direction of this influence would tell you the direction of that effect. I note in passing that, given the theoretical background of the experiment and your pilot data, it would seem perfectly reasonably to propose one-tailed critical tests here, though you are of course free to choose two-tailed tests if you wish to be able to interpret an effect in either direction.

However, it is not clear how your conclusions depend upon the combination of possible outcomes for H1.1 and H1.2. Will you conclude that there is an effect of video manipulation if EITHER test is significant, or only if both are? This needs to be explicit, and it may have implications for alpha correction for multiple comparisons. Arguably, if a conjunction of significant outcomes is required (both tests significant), then alpha correction is not necessary, but if a disjunction is sufficient (either test significant), you need to protect against inflation of Type I error (see https://doi.org/10.1007/s11229-021-03276-4).

2) At a finer grain, a similar issue arises within each sub-hypothesis. You propose a one-way ANOVA to test for differences between conditions, but the actual inference depends upon the outcome of post-hoc tests, and so the overall ANOVA is in fact redundant. I believe that you would be better to pre-register the planned contrasts of interest directly as your critical test(s) (skip the ANOVA). You state that planned comparisons showing lower scores for the Scrambled and German condition compared to Original will confirm hypothesis 1.1. This suggests that you are proposing a conjunction test – you will only conclude in favour of the hypothesis if BOTH tests are significant, not if only one is. Is that correct?

I do wonder whether – in the economy of your experiment – the German condition is earning its keep. It is very thorough, but it seems expensive - in terms of participant numbers - to have a second control group, particularly given the very similar results between the scrambled and German groups in your pilot data. This is up to you, but if you do have two control groups, you have to specify explicitly whether your conclusions depend on a difference in the treatment group against either, against both (or perhaps even against both combined).

3) The role of hypothesis 1 is not entirely clear to me. You interpret it as a test of the role of context on episodic memory, but the way that the hypothesis is actually stated is more like a manipulation check. That is, given that we know that context leads to changes in understanding, Hypothesis 1 will test whether our video manipulation has been successful in manipulating context. A null result would not challenge schema formation theory, but rather would suggest that your manipulation did not do what you intend. That is, I think that Hypothesis 1 may be better understood and presented as your outcome neutral condition, or manipulation check, whereby a significant result confirms that your experiment is capable of testing its key experimental hypotheses (2&3). A null result would mean that you would not be able to interpret the outcome of the experimental hypotheses in terms of the effects of context. I may be wrong about this, but it seems more sensible to me.

4) The above point related to an issue of how theoretical interpretations will be informed by the combination of outcomes across your hypotheses. This also applied between hypotheses 2 and 3. The proposed theoretical interpretation of a significant advantage for the original condition would be in terms of the role of consolidation, but this would be legitimate only if the advantage for the Original condition at the delay time were greater than that for the Original condition immediately. That is, your ability to test for a role of consolidation depends on the change over time between immediate and delayed (i.e. comparison between effects of condition at times 1 and 2), not on a static snapshot of performance after a delay.

If so, this needs to be what your critical test will target. I suggest that it would be more simple and straightforward not to propose an ANOVA of condition by time that follows up the interaction term, but to isolate the interaction of interest directly by appropriate subtraction between conditions and then to run a between-groups t-test on that value. This would also allow you to specify your effect size of interest as Cohen’s d (or similar).

5) You state that  ICC greater than 0.75 will indicate good reliability between raters. Be clear on what role this value plays in the logic of your experiment. Is this an outcome-neutral criterion that must be passed in order for the experiment to be deemed capable of testing the key hypotheses. If so, this should be stated explicitly.

6) You propose an admixture of frequentist and Bayesian tests. Either is legitimate, but if you are using Bayesian tests, then these need to be fully specified and justified (in terms of the priors used), and supported by an appropriate sensitivity analysis that will differ from the power analysis for frequentist tests. (More guidance on this can be provided if necessary.) In general, you might be better to settle on a frequentist or Bayesian approach – if you wish to take a Bayesian approach to quantify evidence for H0, then why not also use it for H1? Or, conversely, if you wish to use a frequentist method to test H1, then you could consider a Two-One-Sided-Test approach to test H0 (where this is specified in terms of smallest effect size of interest).

7) The effect size that you target should relate to the specific comparisons that you will conduct, which requires you to specify the smallest expected effect size, or the smallest effect size of interest for each of your critical comparisons, and show that each of your tests has sufficient power to meet your targeted thresholds.

Your sample size is predicated on an effect size of s η2 = .103 (Cohen’s f = 0.338) from the critical delayed In Show face recognition ANOVA from the pilot data. There are several problems here:

(i) this is not the unique critical test of your hypothesis; and it is only one of the hypotheses under test.

(ii) Based on the fact that you are following these pilot data up because they showed an encouraging effect, the effect size estimate you are using is likely to be inflated. Should you be taking this into account?

(iii) The effect size estimate could be inflated for a second reason, which is that your in show faces for the pilot experiment were actually seen in the viewed clips, which will not be the case in your proposed RR. You should consider whether this is liable to be consequential for the size of your effect, and – if it is – then try to adjust your targeted effect size accordingly.

Major (more general) points

8) Your hypotheses are stated as questions (e.g. “Is there a difference in face recognition at a delay following stimulus encoding?”), which is a little unusual.

9) In Figure 2 (assuming that these images are representative), it would appear that the in show and out of show stimuli differ quite dramatically in terms of lighting conditions and backgrounds. It may be that your design means this does not matter, but you should make a more specific argument for why this is the case (or adjust your in show and out of show stimuli to be better-matched).

10) You do not provide any description or examples of the foil stimuli, but these are critical to be sure that they are valid foils.

11) You do not seem to provide information about how long the clips are, how many there are etc… I can only find the statement that “Three 20-minute (1170s) movies constructed from audio-visual clips from the first episode of BBC TV series Life on Mars will be used as stimuli.” In general, your methods need to be described precisely and unambiguously (and, ideally, if you could make materials open to the reviewers, it might help in evaluation of the proposal).

12) As far as I can tell (maybe I missed it?) you do not even specify the number of trials that are completed in the face-recognition tests, nor the dimensions or timing or duration of stuimuli.

13) Related to the above, is the number of trials sufficient to support a valid calculation of d’ (or would a more simple metric eg. HR-FAR be more suitable?).

14) Your calculation of d’ builds in compensation for floor and ceiling effects, but how common do you expect floor and ceiling effects to be? If they are very rare this may be fine, but if they are more common it could be problematic, and you might need to consider adjusting your stimuli to avoid them and/or adding exclusion criteria for performance level.

15) The above would be very usefully informed by your pilot data. The pilot experiment is good, and very relevant, and I think it might work better to have a fuller report of the pilot experiment prior to the description of the planned study (rather than afterwards). This is particularly so, given that the original n (albeit not the final sample) of the pilot is in fact larger than that of the planned RR. Perhaps it could even be styled as an Exploratory Experiment 1, rather than as a mere pilot reported at the end. The present description of the pilot study is rather telegraphic about details (e.g. What do Test 1 and 3 refer to? And what happened to Test 2?). Your report of these data could usefully include some more insight into general levels of performance and distribution of scores on the tests. The bar charts with standard errors may demonstrate salient patterns, but they do not provide much insight into the underlying data.

16) Make it clear that the sample size relates to complete valid datasets after exclusions.

17) How likely is it subjects will do the 24h and long-term experiments in time? Responding to email at short notice seems potentially unreliable.

18) Will subject assignment be balanced across conditions in some way?

19) Do you have any strategies to minimise the potential for participants to lie e.g. about German language skills, or knowledge of Life on Mars etc?

User comments

No user comments yet