DOI or URL of the report: https://osf.io/6fwh2
Version of the report: 4
Revised manuscript: https://osf.io/9cqp6
All revised materials uploaded to: https://osf.io/8zhcj/, updated manuscript under sub-directory "PCIRR Stage 1\PCI-RR submission following R&R 3"
Thank you for the revisions you provided. In my reading of the manuscript and the response letter, only one minor issue remains to be addressed. Please clarify the wording regarding the target sample size. In the manuscript, you write that the minimum sample size is 800 participants. Because of this wording, it remains unclear when exactly sampling will be stopped and under which conditions data will be used. In the response letter you clarified that you set the target of 800 participants on Prolific, but that there will be timed-out responses that will be reimbursed but not counted towards the target sample size, and presuambly not used in the analyses. Please ensure that the wording used in the manuscript is unambigous regarding which data will be obtained and which data will and will not be used for the subsequent analyses.
DOI or URL of the report: https://osf.io/y94ug
Version of the report: 2
Revised manuscript: https://osf.io/6fwh2
All revised materials uploaded to: https://osf.io/8zhcj/, updated manuscript under sub-directory "PCIRR Stage 1\PCI-RR submission following R&R 2"
I have now received three re-reviews. There are few remaining issues to integrate:
Please consider this point in a final revision and response, and we will then be ready to move forward with Stage 1 IPA.
The authors addressed all my comments with detailed responses and adjustments to the manuscript. Therefore, I have no further questions or suggestions regarding Stage 1.
Best Regards.
Thank you again for the opportunity to review this RR. I greatly appreciate the authors’ detailed consideration of feedback and critiques from all reviewers, with the goal of making this project stronger. I have almost no comments for this round. I believe this project is ready for data collection.
Follow ups to first round comments:
Second minor round comments:
Comments on the revision.
The authors' replies go through my comments one by one so here are corresponding replies to that.
.1. I am happy with the revisions made in response to this.
.2 I perhaps didn't express myself very well there. I wasn't trying to argue that there wasn't any need or value in replicating the studies, only that the eventual publication should have a sound justification for it, which means setting it in the context of subsequent developments in that research area. If there is real doubt as to whether the findings would replicate or not, then certainly the replication attempt is justified, and the subsequent literature does seem to justify feeling some doubt about it. I think I just wanted the authors to have a clear idea about what it would mean if the findings were replicated, and also what it would mean if they were not replicated.
.3 The authors' reply is satisfactory - I just wanted to be sure that readers of the eventual publication would get a clear understanding from the paper of why the replication matters.
.4 O.K.
.5 I don't have any idea of a measure of confidence that would be trustworthy but I stand by my original comment that explicit judgments of confidence are prone to response biases. Replicability is not a guide to trustworthiness because it might mean only that the same response biases were operating in both the original and the replication. I'm not saying the authors shouldn't obtain confidence judgments, just that they should perhaps include some nuanced discussion of the results when they get them. What they have said in their reply to my comment is the right sort of thing, in my view.
.6 O.K.
.7 In the original submission it reads "scores supposedly either representing academic achievement, mental concentration, and sense of humour". The two choices are (i) remove "either" and (ii) change "and" to "or". The authors can go for whichever of those they prefer. Apologies for being a bit pedantic about this.
.8 O.K.
.9 O.K., that is very useful. The research I do pretty much has to be done face-to-face so I have never explored online alternatives. The use of Prolific is probably more common in some areas of psychology than others.
.10 It is the "If things fail..." that concerns me. A simple way to deal with the problem would be to analyse separately for each experiment the data from the participants for whom it was the first one they saw - at that point their judgments could not be affected by the other studies because they haven't done them yet. If the results for that sub-sample resemble those for the full sample, then no problem.
.11 O.K.
.12 O.K.
.13 I sympathise with the authors and I think their discussion of the issue is intelligent and appropriate. I agree with the decision to set alpha at .001 for exploratory analyses. For the analyses where .005 is used, I would suggest that, if they get results significant at .01 but not at .005, they could discuss these or at least list them, so that readers could get a feel for whether there is any likelihood of type 2 errors, but I'm happy for the authors to go with their own judgment on this.
.14 O.K.
Overall. I would like the authors to bear my comment on .4 in mind when writing up the results. The grammatical error commented on in .7 should be corrected. The authors should consider the suggestion for further analysis in .10 but I will leave it to them to decide whether to do it or not. They should also consider the suggestion made in .13 but again it's up to the whether they do it or not. I have no further requests for changes.
Peter White
Download the review
DOI or URL of the report: https://osf.io/4rjpm
Version of the report: 1
Revised manuscript: https://osf.io/y94ug
All revised materials uploaded to: https://osf.io/8zhcj/, updated manuscript under sub-directory "PCIRR Stage 1\PCI-RR submission following R&R"
Invitation to Revise: PCI RR 609
Dear Dr. Feldman,
I have now received three reviews of your submission on a replication project addressing Kahneman and Tvsersky (1973). In line with my own reading of your manuscript, the reviewers highlight important strengths of your outlined approach, but also note some areas for further improvement. In line with these suggestions, I would like to invite you to revise the manuscript.
Most salient are the need to clarify questions regarding the 7-in-1 approach of conducting multiple replication attempts in one study, regarding the sampling plan, as well as regarding the nature of the replication and evaluations of replication success. These issues fall within the normal scope of a Stage 1 evaluation and can be addressed in a careful and comprehensive round of revisions.
Warmest,
Rima-Maria Rahal
Review uploaded.
Download the reviewOverall notes:
I commend the authors for taking on this large registered report replication attempt. It is very obvious that much time and consideration has been put into this project, and while PCI reviews do not evaluate the importance of submissions, my personal opinion is that this is an important replication to undertake.
At it’s current stage, I do not believe that this project is ready for data collection. I have a number of concerns, suggestions, and comments that I believe will greatly improve the end product. Even though this is my first time reviewing a registered report (although I have been involved in and am leading one right now), it is my understanding that the review process is meant to be constructive and collaborative. If it seems that any of my comments below are blunt, please do not take them as indicative of anything other than my quick jotting down of ideas.
While my point-by-point comments can be found below, I wish to draw out a few themes and mention some concerns I have.
First, I believe that the manuscript can be organized in a much clearer way. I found that there was a great deal of repeated information due to the way that sections were outlined, and a lot of jumping back and forth was required by the reader. I left some suggestions below, but largely, I would try to work on cutting a great deal of text and consolidating repeat information. One major step would be going through all of the components of any one study in order, as opposed to components (i.e. Study 1 manipulations, measures, deviations, Study 2 manipulations, measures, deviations, etc.; as opposed to Manipulations study 1, study 2, study 3, etc., Measures study 1, study 2, study 3, etc.). While I did not note all occurrences of this, I do believe restructuring the entire manuscript would help with readability and reduce total length.
Second, I think that there should be more consideration put towards sample size, sensitivity analyses, and power analyses. There was a recent article in PSPR around considerations of power (Giner-Sorolla et al., 2024). Pulling from that article, and related research, I urge the authors to consider what power and sensitivity mean. Power is the probability of detecting an effect if there is an effect there, and is effect specific rather than study specific. The same holds through with sensitivity analyses (which is truly just a different mathematical representation of the same equation): sensitivity is related to an effect not a study. Multiple comparisons and analyses all have their own sensitivity analyses (or power levels). I would encourage a consideration of how multiple comparisons and analyses for any given study may affect the reliability of the sensitivity analyses reported, and what can be done about them.
Third, I do not see this as a close replication, but rather a conceptual one. I discuss this more below but if the authors wish to argue that this is a close replication, I think there are some changes that have to be made. One of those is my fourth major concern.
Fourth, I do not love the design of this replication for two reasons. I am strongly opposed to all 6 (or seven, depending on how you want to count it) studies run by each participant. Much of the cognitive biases and heuristics literature does show how knowledge and practice reduce the effects that biases have on individuals. I would venture a guess that having participants run through a number of prediction tests may influence their answers on later questions, and even if randomizing order and looking at order effects is done, I think that this reduces the ability to detect an effect if an effect is there. Additionally, the population in question (Prolific participants) is not a naïve group. I just downloaded some of my own data from Prolific. This study included U.S. residents who were over 18 and proficient in English and were not involved in our pilot test. Out of 1226 participants, the mean number of prolific studies completed was 2194 (median 1719, SD = 1914). I think there is something to be said that this population has had more experience and exposure to heuristics and biases, and may not respond the same way. I would argue that a participant on prolific who has completed 300 studies is not representative of the population, and over 75% of participants in my sample have. While there is nothing that can be done about the non-naïvate of prolific participants, I would suggest that participants do not run through multiple studies (or only run through a subset of studies). If there is flexibility with the budget, I could even envision some participants running through only one study, some running through 2-4, and some running through all if there are interesting questions there. But I think that having participants running through all 6 studies is a bad idea.
Thank you for the opportunity to review this project. I am looking forward to seeing the next version, and to seeing this project through. If there is anything I can clarify, let me know. Additionally, please understand that everything in this review is my opinion, so take things with a grain of salt.
Minor suggestions/comments:
· Page 7: First sentence feels like it’s trying to define both representativeness heuristic (hereafter RH) and give results from 1973 study all at once. I might suggest splitting it into two sentences, one that gives a clear one-sentence definition of RH followed by the 1973 seven studies example.
· Page 8: groups in S1 are unclear, I might name them (as K&H did) to clearly indicate that the key results were between-subjects correlations between similarity and likelihood (and likelihood vs base rates). Could be read as within subjects currently.
· End of page 8/beginning of page 9: page number for direct quote would be helpful
· Page 9: As opposed to wording “representativeness heuristic predicts…”, give the actual findings, and say “…, [in]consistent with what the representativeness heuristic would predict.”
· Page 11: Sentence starting with “At the time of writing…” is a run on, I would split it up for readability
· Table 2 Exp. 6: Indicate from where N was inferred (presumably df of t-test + 1?)
· Page 21: I would choose another phrase than “Bayesian line” to describe S3. There is no one Bayesian line, maybe pull from original article wording which states “The curved line displays the correct relation according to Bayes’ rule.”
· Page 21: for S4, should read “either adjectives or reports”
· Page 21: S5 is unclear if it is between or within subjects. I would edit to read: “representing academic achievement, mental concentration, or sense of humor”
· Table 3: Geographic origin of KH 1973 participants were recruited via student paper at University of Oregon unless stated otherwise (see footnote on page 238 of original study
· I would remove all text mentioning “HIT” in the Qualtrics survey, as HIT is specific to MTurk and not Prolific.
· Related to the point above, the Qualtrics mentions “MTurk/Prolific/Cloudresearch” at the end. Are data being collected from multiple sources? If not, I would remove any references to other platforms.
· I would encourage consideration of any exclusion criteria (i.e. US residents 18+, proficient in English, etc.) and to include those in text, including the attention checks
· Page 49: How was the prior of 0.707 generated?
·
Major suggestions:
· Intro: I think that all of the information is there, but I don’t love how it’s organized. While reading through the first time, I found myself jumping back and forth to remind myself of what was said previously. I would suggest a reorganization along the following lines: Start by overviewing the findings from all studies in KH 1973. Something similar to table 1, but outline them all at the start. Then, move into different replication attempts (primarily around S1 and S3) and inconsistencies there and the theories for why replications are less/more successful (salience of randomness, etc.). Move into importance of replicating this specific paper, and then include the overview of replication/extension. End with a section outlining deviations/extensions in replications.
· Page 10: unclear if actually testing reproducibility, given that reproducibility is the reliability of a prior finding using the same data and same analysis (Nosek et al., 2022)
· Why was feedback accuracy not manipulated in S1/2 as it was in KH 1973? I understand the importance of including self-perceived accuracy, but I think there’s something to be said about being told that you either were or were not accurate, and how that might influence usage of representativeness on likelihood. While I get that the original study found null effects of this manipulation, I feel that it is not true to the replication to remove it due to the deception being found “unnecessary and unconvincing.” I would suggest keeping it in to remain truer to the original study (which would also allow for assurance that there is not an unequal distribution in perceived accuracy). I might have perceived accuracy first, followed by (deceptive) feedback.
· Similar to the introduction, I would consider a restructuring of the methods section to go study by study as opposed to a section for manipulations, one for measures, and one for extensions. The cognitive load of switching back and forth between studies might be lessened if it’s organized by study, with sub headers for manipulations/measures. This would also cut down significantly on the text.
· Page 43: I am unconvinced by the determination of successful/mixed/failed replication. I suggest a consideration about if there are certain studies that hold more weight or less weight, and what that might mean for representativeness heuristic. I would argue that this determination can be made almost only post-hoc, due to the vast differences in potential outcomes (7 studies, multiple analyses, etc.). Perhaps a better metric could be whether a specific study (or analysis) replicated, and whether the evidence in aggregate seems to indicate that RH as a whole replicates. I don’t have an answer on how to generate metrics for the second option (whether RH as a whole replicates) but the cutoffs given are unconvincing to me. What would happen if every study gave non-significant findings in the same direction as hypothesized? Someone might argue that power was too low but on aggregate the effect exists… Or what would happen if there are effects of the order or something of that sort?
· Page 43: I would not argue that this is a close replication by LeBel et al.’s criteria. The population is different, the context is different, the setting is different, the procedure is different, and there are differences in operationalization and stimuli. To me, this is a conceptual replication. There is nothing wrong with that, but I would represent it as it is. LeBel et al. state that a close replication is when the IV or DV stimuli are different, but a conceptual (far) replication is when the IV or DV operationalization or population is different.
· I found the discussion of power/sensitivity/sample size to be removed a bit from the analyses: sensitivity analyses do not take into account the multiple effects generated. I would suggest a careful consideration of what effect the sensitivity analysis is generated in relation to, and what to do to adjust for multiple comparisons.
Stats Comments
· I would highly recommend using the groundhog package (https://groundhogr.com/) to ensure reproducibility of all code. This would allow for version control of packages.
· I would suggest to not use “99” as a code for missing age if that is a valid age in the dataset. Consider an obviously implausible value such as “999”, or commonly used metrics scuh as “NA” or “.”
· I would recommend installing tidyverse instead of individually ggplot2, haven, kintr, dplyr, purr, etc.
· If the data is collected via Qualtrics, why is it imported via a .sav file as opposed to an open format such as .csv?
References:
Giner-Sorolla, R., Montoya, A. K., Reifman, A., Carpenter, T., Lewis Jr, N. A., Aberson, C. L., ... & Soderberg, C. (2019). Power to detect what? Considerations for planning and evaluating sample size. Personality and Social Psychology Review, 10888683241228328.
Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., ... & Vazire, S. (2022). Replicability, robustness, and reproducibility in psychological science. Annual review of psychology, 73, 719-748.