Manipulating what is believed about what is remembered
The effects of memory distrust toward commission and omission on recollection-belief correspondence and memory errors
Abstract
Recommendation: posted 11 June 2024, validated 11 June 2024
Dienes, Z. (2024) Manipulating what is believed about what is remembered. Peer Community in Registered Reports, . https://rr.peercommunityin.org/PCIRegisteredReports/articles/rec?id=564
Recommendation
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.
List of eligible PCI RR-friendly journals:
- Advances in Cognitive Psychology
- Collabra: Psychology
- Cortex
- Experimental Psychology
- Journal of Cognition
- Peer Community Journal
- PeerJ
- Psychology of Consciousness: Theory, Research, and Practice
- Royal Society Open Science
- Studia Psychologica
- Swiss Psychology Open
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #4
DOI or URL of the report: https://osf.io/5mb8d
Version of the report: 4
Author's Reply, 10 Jun 2024
Decision by Zoltan Dienes, posted 08 Jun 2024, validated 08 Jun 2024
Thanks for your thorough revisions, into which you have put a lot of thought. Before recommeding your paper, two little points:
1) You have used 80% CIs, but most journals require significance levels of 5%. If you report for when the population value is zero, the proportion of times the 80% CI is above the minimal effect of interest, and if this is less than 5%, then no journal can complain that you haven't controlled error rates to their usual satisfaction. I mention this to make sure you have as much choice in journals as is actually appropriate.
2) I don't think you state what results would actually count against any H1. Explicitly state that you are using the inferential rules (which your power calculations impy you are using) that if the CI is inside the equivalence region you reject H1 and accept H0 (and if the CI only partially overlaps the equivalence region, exending into H1 values, you suspend judgment).
best
Zoltan
Evaluation round #3
DOI or URL of the report: https://osf.io/rgqps
Version of the report: 3
Author's Reply, 03 Jun 2024
Decision by Zoltan Dienes, posted 23 Apr 2024, validated 24 Apr 2024
Thank you for your revision. You have a great job of scientifically motivating predicted and smallest effects of scientific interest,which is the hardest part of the process!
1) The main thing is that the Stage 1 should not discuss exploratory analyses as that muddies the waters as to what was pre-registered with analytic flexibility removed and what was not. You can simply report the exploratory analyses in the Stage 2 in a non-pre-registered results section.
2) You use inference by intervals for the first two rows of the design table, then hypothesis testing against no difference for the following. I have just published a piece arguing against mixing inferential practices in the same paper (https://osf.io/preprints/psyarxiv/2dtwv). As the different tests use different DVs, the issues I raise there don't apply strongly, so you may keep the mixed inferential strategy if on reflection you think it best. But it is at least inelegant: Why use inference by intervals in some cases and hypothesis testing against no difference in others? There is no inferential reason I can make out. You have all the figures (smallest effects/predicted effects) you need for using either method. For what it is worth, I think inference by intervals the superior of the two: Having defined a minimally interesting effect, one may, because of high power reject H0 with a smaller effect than the minimally interesting one. Yet the correct procedure with hypothesis testing is to reject H0. So high powered hypothesis testing can lead to embarrassing inferential contradictions which won't occur with inference by intervals. So just think about what you really want to do here.
3) When I said report your units I mean e.g. "regression slope = .017 c units/Likert unit of MC question". I know psychologists never do this but it makes things clearer. Same when reporting means.
4) When using inference by intervals, report the probabilities of both a) obtaining the 90% CI within the null interval if there is no effect; and b) the probability of obtaining a 90% CI outside the null interval if the predicted effect sizes on H1 is true. a) gives the severity of the test for the theory that H1 is true (i.e. how likely it is to get evidence against the theory if it were wrong). I typically do this with monte carlo simulations.
Evaluation round #2
DOI or URL of the report: https://osf.io/4zecf
Version of the report: 2
Author's Reply, 18 Apr 2024
Decision by Zoltan Dienes, posted 27 Mar 2024, validated 27 Mar 2024
Thank you for your revision: three of the reviewers are entirely happy you dealt with their points. Wright has some optional further points. I just request some tidying up of the Design Table, making each row contained in itself, so that it has, for example, its own power analysis. My points in detail follow.
1) First two rows of design table:
State in the table itself the DV in analysis - SSMQ for distrust for omission and MDS for commission.
The Table refers to a 90% CI. State in the table itself for what comparison: for omission (omission vs no feedback) and for commission (commission vs no feedback). Each could be as part of a 2X2 ANOVA.
Most informative would be to express each scale as an average over items so that the raw units are the same as the response scale subjects actually use, e.g. -4 to + 4. Then convert the Cohen's d (0.8) to these raw units so it is clear how much change in units of the original response scale is involved. The CI can then be in raw units.
"If there are interactions between manipulation check order and feedback, we will examine the effects of feedback on manipulation checks separately for the before and after the second test conditions, and decide if it is appropriate to report the two conditions together based on the directions of the effect in each condition."Three is some analytic flexibility here. Be explicit about the precise conditions for each if-then decision. (It might be simplest to average over order and explore order later in exploratory analyses?)
Power/sensitivity needs to be determined for each row of the Design Table. That means for each confidence you need the power to detect the lower bound of the CI being above the minimal cut off. This entails two estimates: What is minimally interesting to define the lower cut off, plus an estimate of the likely effect - I am not sure if you can get this from your past studies?
2) For your third row, specify how you will calculate the simple effects. As the conclusions actually follow from the simple effects, each simple effect could itself be the crucial test you jump straight to. This would give you more power. You then need a power calculation for the simple effect.
3) For the fourth row, can you state what raw regression slopes this corresponds to in meaningful units, so a plausibility judgment can be made about it really being minimally interesting?
4) For the fifth, as a power analysis is difficult, consider as an exploratory analysis and remove from the table.
5) Same for the last row, state in raw units the change in regression slope you are powered to pick up.
It is unusual to be asked to present in raw units I know; but psychologists are so quick to remove all units from their analysis and thereby lose contact with their data as if the meaning of the units of their measurement scale made no difference whatsoever. As long as one cares about what size difference really matters, the units must matter.
Reviewed by Dan Wright, 15 Mar 2024
Looking through mine and the other reviews, the authors have addressed most of the issues. My comments here are minor and/or suggestive. Most relate to the analysis.
1. Lines starting about 142. This was in an issue in the first version too. The authors bring up SDT. It is important that they relate this to a probit/logit regression since SDT (in the way they describe) is just a probit regression for each individual, and the flexibility of the later approach often makes it more useful (e.g., DeCarlo, https://www.columbia.edu/~ld208/psymeth98.pdf, Wright et al. https://pubmed.ncbi.nlm.nih.gov/19363166/).
2. Randomization is used in several places (e.g., when the manipulation check is used, it appears the order of the pictures is randomized and I think even which are used in which phase). This is just adding variation. Why is it being done? Further, how are the orders of these going to be taken into account in the models?
3. Exclusions and sample size. The exclusions are listed, but it will be helpful to give an estimate for how many will likely be excluded and then use this to adjust the sample size. Also, you need to take into account the number of people who do not complete phase 2, and also increase the sample sought.
4. I like the use of OASIS picture data base. This has variables on which the photos differ which likely relate to memorability. These should be included in the models.
5. The old/new, R, and K judgements will all be related. Given much of the introduction is about where these do not match (e.g., I remember clearly but know it did not occur), it should be described how the relationship among these will be examined. Also, it is important to discuss if there is an issue if there are not many of the discrepancies, since the introduction seems to be focused on discrpancies. I worry that the stats on the R and K judgements (after conditioning on Old/New and the other) become more about how people use the scale than their memories. Are there patterns of these that would lead the authors to stop the analyses?
6. The authors include a random effect for pictures when analysing the R and K responses, but do not for the old/new response. Why? This contradiction requires explanation since it can be easily done predicting individual old/new responses. Also, as noted above there may be variables related to memory from OASIS.
7. It was not clear to me how the items on the two distrust scales will be used. I assume many of the individual items relate to items across the scales. Will these be combined and the psychometrics of the combined scale be reported. Obviously you would not treat them as distinct measures and analyse how they relate to the effects of interest independently.
I hope that these comments are useful. As I said at the start, I think these are minor and/or suggestive.
Reviewed by Romuald Polczyk, 17 Mar 2024
The authors responded satisfactorily to all my comments. I have no further comments and recommend accepting the project.
Reviewed by Iwona Dudek , 18 Mar 2024
Thank you to the Authors for sending their responses for review. I have no further comment and recommend this draft for acceptance.
Reviewed by Greg Neil , 07 Mar 2024
The responses to my points were all very well considered, and I am satisfied that the manuscript is now clearer and more understandable. Happy for this to be fully accepted now, and I have no further comments.
Dr Greg J Neil.
Evaluation round #1
DOI or URL of the report: https://osf.io/674ft?view_only=a663f2e3619545edafe86b0aee885603
Version of the report: 1
Author's Reply, 01 Mar 2024
Decision by Zoltan Dienes, posted 04 Dec 2023, validated 04 Dec 2023
First of all my sincere apologies for the long delay from receiving the referees' reports to me making a decision. On the bright side, I have 4 expert reveiws, and they are on balance very positive. In responding to the reviewers' points I want you also to take into account the following:
You should determine you have a satisfactory number of participants (e.g. power, if you are using power, as you are) for each row of the design table. That is you justify an effect size for each test, namely the test specified in each row of the design table, that tests the theory or hypothesis listed.
As power is the attempt to not miss out on interesting effects in the long run, you should specify the effect you just don't want to miss out on. Given a previous study found an effect of a certain size, given an effect a bit smaller would still be interesting, you should not simply use the effect from a past study. For ideas on how to approach this problem see: https://doi.org/10.1525/collabra.28202. It is often easier to think about this problem in raw effect sizes; Dan Wright illustrates by thinking about the actual numbers of words you are dealing with. Another example would be using raw regression slopes from your past studies so you can say what change in bias should follow from a unit change in distrust on the scales you use. Your past study may give you this. The bottom limit of the 80% CI on that slope may be (you would have to judge) roughly the smallest plausible effect size that may be still interesting. It would also tell you how much you would need to change distrust by in raw units to get a meaningful change in bias. And that might help answer the reviewer's question about how big the change in distrust should be to pass the manipulatio check. (You can continue working with power and minimal effects; or you would switch to Bayes factors and expected effects.)
In sum, make sure you have considered the relevant effect for *each* test in the design table, so you have power calculations specificaly designed to address each test, in a scientifically justified way.
best
Zoltan
Reviewed by Dan Wright, 31 Oct 2023
The submission describes the two main error types in memory recognition: false alarms (saying old to a new item) and misses (saying new to an old item). Lots of work has been done showing that it is harder to convince someone that something they remember did not occur, than to convince them that they do not remember has occurred (often explained by there being at least two types of memory). The proposed study uses false feedback during one recognition test stressing one of these two types of errors (misses and false alarms) to see how that affects the error patterns in a second recognition. The assumed causal chain is feedback -> a particular pattern of memory distrust (either focusing more on misses or false alarms) -> future memory recognition bias. There are lots of ways to change the response criterion in memory studies (e.g., telling them to do so), but this is examining if an indirect approach of feedback influences an asymetric type of memory distrust and this in turn has an impact on the memory recognition pattern. This is an interesting idea.
Understanding the feedback is therefore important to know if the manipulation will have a large enough influence on memory distrust. Here is what the document says: Feedback commission "they will receive false feedback on target items and true feedback on filler items. For each correctly recognized target, there is a 20% probability that participants will receive false feedback that this is actually a new scene. For each incorrectly recognized filler, participants will receive true feedback that this is a new scene." For example, suppose most people are about 75% accurate on hits and correct rejections. This would be 15 hits, so on 3 of these the person receives false feedback and presumably correct feedback on all 5 of the misses, but on all 5 of the false alarms they would receive correct feedback. Is this correct? So if they have been equal on the the hit and cr rates, then 8 of the 13 (62%) are of a commission error. The opposite would be true for the other group (with these assumptions). And these values will move around, but it is this percentage that presumably may affect the person belief in their asymmetry. If it only has a small effect then it would less likely to produce a detectable effect at the next stage.
My concern at this point is whether the manipulation will have the desired effect on memory distrust that the authors believe. If I read their power analysis correctly they believe it will almost completely account for memory distrust because they base their analysis on the memory distrust to response bias effect, if the causal chain above how they believe that this works. This makes the manipulation check critical.
As such, details of what counts for the manipulation check working is important for this. The section on this in the proposal is too vague for me to know how this will be done. For example, they say "if the manipulation to increase distrust toward commission errrors is successful", so they need to define sucessful. They kind of define it in their power analysis. They use the r from Zhang et al (2023c) of .19 between memory distrust and response bias within this three group ANOVA design power calculation. If they are assuming the feedback effect is completely mediated by memory distrust (and if I read their lit review accurately this is at least the primary mechanism assumed), the "successful" manipulation check should require r values approaching one. This is unless I missed something. There is also the question of what to do if the manipulation is "successful" for one of the two manipulation check variables but the other only has, say, r=.7 (which I assume would be unsuccessful for them given their power analysis). It may be that the authors want to define "successful" at a lower level, but then they might not want to use the .19.
In summary, I think the effect of feedback may be smaller than what the authors believe it will be. Therefore I would want to see that it has the size effect that they are assuming before accepting the whole proposal. I focused on this aspect rather than on the lit review and other aspects of the methods because in an experiment clearly the manipulation is critical.
I did look some code for the simulated data out of interest. Consider the main results on recognition by condition. From what I could tell this is split by condition and/or isold, rather than (in lme4/R language)
m1 <- glmer(sayold ~ condition*isold + (1|item) + (1|person),family= binomial)
which allows this to be directly compared with the model without the interaction. Am I reading this correctly? Are the estimates of the response bias from the model
m1 <- glmer(sayold ~ isold + (1|item) + (1|person),family=binomial, data= just one condition)
the conditional modes for the person random variable, and then those being used in subsequent analyses?
There is much work in memory using multilevel logistic models. I think this should be in the text of the proposal. Sifting through a webpage is difficult for me to follow without text explaining the steps, and I am a stats nerd. I wasn't going to go through this in detail since I think first I need to be convinced about the feedback having a big effect. It might also be useful to include two conditions where they are specifically told to raise of lower their response threshold, so that the size of the feedback effects can be compared with a more direct manipulation.
I hope you find these comments useful.
Reviewed by Romuald Polczyk, 17 Nov 2023
This project is about complex relationships of memory distrust with memory commission and omission errors and recollection beliefs.
I like the derivation of predictions - from theoretical considerations about autobiographical memory and autobiographical beliefs, to memory distrust and specific hypotheses. The hypotheses are based on both theoretical considerations and empirical data. I especially like that the hypotheses go beyond looking for relationships between variables but also include more complex moderation analyses.
In sum, I believe that the project is very well written, and I encourage acceptance of this proposal. The logic and plausibility of the hypotheses is excellent, as is the description of the planned procedure and statistical analyses. I especially appreciate giving access to the final Qualtrics procedure.
Below I include some remarks which perhaps may help to even improve the proposal:
I encourage the Authors to state more precisely how they will determine whether the manipulation check produced successful results - is a statistically significant effect enough, or should the effect be of a certain size? For example, is a statistically significant change from an average score of about 2 to about 3 (on a scale of 1 to 10) enough?
I suggest including information about the reliability of the memory distrust scales.
p. 11: “After reading the information letter and giving informed consent, participants will first answer demographic questions about their age, gender, and education level, followed by the SSMQ and the MDS.” - will the order of the SSMQ and MDS be randomized or fixed? Please specify.
Minor points
p. 4: “For some events, people can hold strong beliefs about their occurrence without any recollections about them, such as the celebration of your first birthday (believed-but-not-remembered events).” - I doubt this relates to autobiographical memory at all; this is just knowledge about some past events, not remembering them. Like, say, knowledge about appendicectomy surgery - we may know that it took place, but we cannot remember it because we were under anesthesia.
I extremely appreciate providing access to the Qualtrics procedure for reviewers. I just suggest enabling to proceed without giving answers on the questionnaires as it takes time.
I could not locate any information about funding for the research. Although I don’t think this is important and I have no doubts that the Authors will find funds, I mention this as reviewers are required to do so by PCI.
Although I by no means believe that every scientific project needs to be ‘applied’ to some degree, I encourage the Authors to think about possible applications of the results; or, to state clearly that they expect none of such.
Reviewed by Iwona Dudek , 17 Nov 2023
The proposed research concerns the effect of manipulating two types of memory distrust (i.e. concern about making omission errors or commission errors) on memory reporting (recognition, recollection, and belief judgments). The research problem formulated on the basis of the previous research results of the first author and their colleagues is very interesting. The findings of the study will expand our knowledge on the psychology of memory and may also have implications for the psychology of eyewitness testimony. The question and hypotheses were clearly presented and justified. The manuscript is concise and well-structured. The research procedure is presented in detail, and surveys from the Qualtrics research platform are included. Extremely careful and detailed data analysis was planned for each research question and interpretation of possible results was indicated.
Meeting the Stage 1 RR criteria:
1A. The scientific validity of the research question(s) – Yes. The research question aligns coherently with previous research findings and Blank’s (2017) theory of recollection–belief divergences and validation.
1B. The logic, rationale, and plausibility of the proposed hypotheses, as applicable. – Yes. The research questions and the hypotheses logically derive from the theoretical introduction presented.
1C. The soundness and feasibility of the methodology and analysis pipeline (including statistical power analysis or alternative sampling plans where applicable). – Yes. The planned analyses will answer the questions posed. The sample size is sufficient to ensure a high probability that a significant result is true (power: 90%). Data exclusion criteria are specified, and the smallest effect size of interest is defined based on previous research.
1D. Whether the clarity and degree of methodological detail is sufficient to closely replicate the proposed study procedures and analysis pipeline and to prevent undisclosed flexibility in the procedures and analyses. – Yes. The research procedure is thoroughly described and made available. The analysis plan includes verification of all research hypotheses and state precisely which outcomes will confirm predictions.
1E. Whether the authors have considered sufficient outcome-neutral conditions (e.g. absence of floor or ceiling effects; positive controls; other quality checks) for ensuring that the obtained results are able to test the stated hypotheses or answer the stated research question(s). – Yes. The authors planned to test the effectiveness of the experimental manipulation, and this manipulation was also tested in a pilot study.
I only have a few minor comments:
1. p. 7, l. 20 – 22: It is stated that the trait of memory distrust is related to the objective functioning of memory. However, it may be worth considering a slight modification to this statement, as certain studies have not demonstrated a clear association, for example, Kuczek, Szpitalak, & Polczyk, 2018.
2. p. 15, Outliers and Exclusions section: It might be worth considering the exclusion of participants who guessed the research hypotheses, given that the description of the procedure suggests that participants will be asked about this matter. Consider providing additional clarification regarding what would be deemed a correct guess regarding the study's purpose, for instance.
3. In relation to SDT in the context of response bias, it could be beneficial to mention what a c indicator is, similar to the explanation provided for the β indicator on page 6.
4. There appears to be an inconsistency in the citation of Nash et al. between the text (2023) and the referenced literature list (2022).