DOI or URL of the report: https://osf.io/5g6vy/
Revised manuscript: https://osf.io/vkp3q/
All revised materials uploaded to: https://osf.io/fujsv/ , updated manuscript under sub-directory "PCIRR Stage 1\PCI-RR submission following R&R"
Two reviewers, including one of the authors of the target study for the replication, have now kindly evaluated the Stage 1 submission. At a broad level, the submission has many promising characteristics but you will see that the reviews raise a number of substantial issues that will need to be addressed to achieve in-principle acceptance (IPA). I will highlight some of the headline points. In terms of the framing of the replication, a broader consideration of literature would be useful (as noted by Craig Fox), as well as clarification of the rationale for certain predictions and extensions on the original methodology. Other key points include consideration of deviations from protocol and potential negative consequences of doing so (e.g. use of a single sample for all three studies), and alternatively whether keeping certain features of the original methods the same might inadvertently reduce validity due to changes in how the measure will be received now compared to 20 years ago. Keeping a replication study as faithful as possible to the original methods while also ensuring that it provides a theoretically valid replication can be a challenging balancing act -- there are often no perfect solutions, but one option would be to run paralllel studies in different samples using original vs updated methods to assess robustness. The reviewers also highlight concerns with possible knock-on effects of completing different measures, and raise queries regarding the patriotism intervention (and the statistical power of that analysis as well as other extensions, noted by Leonardo Cohen).
The authors propose to replicate three studies in Fox et al. 2005 which demonstrate "partition dependence", and extend with new measurements to better explain the drivers behind the phenomenom observed. Partition dependence has been explored in many areas of psychology research and is also known as the 1/N or "naive diversification" effect. I thank the authors for providing such detailed and open experimental materials.
While it is important to replicate studies, in particular those done at such specific "WEIRD" original population, unfortunately, I am not sure that an exact replication of this specific set of studies conducted most than 15 years ago makes sense today, without further changes to ensure its generalizability. Perhaps a conceptual replication is more suited.
For example, in Study 1, students at Duke University were asked about the distribution of financial aid to students at Duke University. There must be an effect of students making choices about financial aid at their own institution. For example, were those students more likely to believe that their answers could affect actual distribution of financial aid (and perhaps even affect themselves or their colleagues), as opposed to asking a broader audience from Mechanical Turk who are not necessarily students and not at Duke? I would go as far as suggesting that many participants from Mechanical Turk will not even be aware of what Duke University is. I believe that the current proposed study setting makes it much more hypothetical than the original study, and therefore not a true replication. I also propose that the stipend and the income brackets should at least be adjusted for inflation.
Unfortunately Study 2 is also not a true replication because the authors change one of the categories from Durham funds (where Duke is located) to United States funds. While this is important because they can no longer guarantee participants will be local to Durham, surely individuals are more likely to be emotionally attached and loyal to their local community (in-group) than a country-wide community. Can the authors filter MTurk to a more specific locality, for example, a state - and ensure that location is guaranteed and cannot be faked via IP geolocation identification? (Authors should also be aware that their information sheet does not inform participants that their IP address is being recorded.) Although even at state level, this would not be an exact replication of the in-group affiliation as at county level from the original study.
In Study 5, I am not a wine expert, but I am aware that wines can improve (or indeed, spoil - as some wines are supposed to be consumed closer to harvest than others) with time, depending on the variety, and at different speeds. Using the original list of wines, more than 15 years later, does not make any sense to me. Perhaps these wines no longer exist, perhaps they would taste horrible, or perhaps they are extremely rare and valuable. The authors should consider updating the list of wines. The price of the wines also need to be adjusted for inflation.
There were also methodological flaws to the original study that do not warrant replication, and could be improved. For example, the original authors use a t-test to analyse percentage data - data which is bound between 0 and 1. This is not applicable - and might explain the unexplicably large effect size. Perhaps with the limited computing power of 15 years ago this was acceptable, but nowadays much better methods are easily available. For example, a beta regression could be considered (a transformation might be needed to remove zeroes and ones, for example, y' = y*0.998+0.001). If there are many zeroes and ones, a zero-one-inflated beta regression could be used instead, and no transformation is needed. A t-test should still be conducted for the sake of replication and comparison with the original results, but a more appropriate analysis should be run instead to confirm the findings.
I also believe that the hyerarchical approach in Study 2 can be very misleading. Participants are asked to allocate 100% between national and international funds, then again asked to allocate 100% between national funds. I believe that the majority of participants will not understand that the second allocation of 100% is a sub-set of whatever percentage they first allocated to the national funds. Previous work in numeracy has consistently shown that individuals are very bad at understand percentages. Instead, the authors should say that the second allocation should sum up to the percentage entered in the first allocation for US funds. As a smaller point, will participants be aware of what United Way is - perhaps the original authors knew that students at Duke know what this is, but I can envisage many respondents being unaware of their work?
When analysing Study 5, the original authors discard a considerable amount of data (participants who chose {2,2} or {3,3} types of wines). I do not understand how that can be acceptable. Data cannot be simply discarded for the sake of analytical simplification. I understand that the cells are not independent, which is one of the assumptions for contingency tables and logistic regressions. Perhaps the stimuli need to be updated so that the cells are independent, which I believe could be done with a different set of wines - or a better analytical method must be identified?
I also do not understand the rationale for the original measurement of "expertise" for Study 5 based on number of bottles of wine consumed. Is this a proven measure of expertise? What is the cutoff point that makes one an expert? And how can we justify that just because an individual buys more wine than another, they become an expert? Surely there are better measurements of expertise that the current authors could explore. I see the authors have added self-reported questions about wine in their questionnaire, but they do not explain how these will be incorporated into the analysis, and how they have been validated as true measurements of expertise.
In fact, upon looking at the questionnaire that the authors plan on deploying, I have noticed *many* additional measurements which are not mentioned in the registered report. For example, there are many questions about wine. I guess these might be used for later exploratory analysis. I would imagine that the additional variables being captured might be considered for additional extensions? These analyses *must* all be pre-registered. It is not clear for example if the choice diversity is added to the model as an interaction, only main effect, or both. The way these measurements are calculated is also important. And are the measurements going to be centered or standardized when entered into the model?
I do appreciate the authors' attempt to create a choice diversity measurement. However, one potential problem come to mind that should be addressed. How does answering the earlier questions in the questionnaire, which shows participants many choices and ask them to diversify between them, potentially influences answers to the diversity metric. Can the latter be measured without influence from the former, given that they are sequential in time (and closely follow each other)? Could a diversity metric be measured separately somehow, perhaps during another session?
I know that the original studies have been published in a high-quality journal and all these methodological aspects have probably been discussed and covered already by other reviewers. However, I would propose that instead of having an exact replication, perhaps a conceptual replication which improves on the original methods can be employed?
Minor points:
On page 12 the authors state that Study 5 is a "mixed design," I don't understand how that's the case as it appears to be a between-subjects design study?
On page 16, the authors list which filters they are using, plus "etc", twice. That is very ambiguous. A comprehensive list must be used otherwise the work cannot be replicated in the future.
On page 17 they list the participants as US American - that for me implies citizenship. It should instead perhaps say US-located?
The patriotism questionnaire for me has many flaws. Just because participants are physically located in the US, it does not mean that they will be US citizens. (See my comment below on citizenship) So questions like "How strong is your love for your country" as very ambiguous, as the respondent might not be thinking of the United States. I also think that the flag burning question might trigger concepts of law and criminality (even though it is not a crime to burn the US flag since 1989 according to a quick search - I am not an expert), participants might think that burning of flags is illegal, and therefore, wrong regardless of my patriotism. This also applies to the question about sharing government secrets - it does not mention which government, so might not be the US, and also it is a crime, so in someone's view could be seen as wrong, regardless of patriotism. I believe these questions might be measuring different concepts.
The treatment of outliers for me is strange. The authors say that outliers will be excluded if they fail to replicate the original results. However, what if it is the other way around and the replication was due to outliers, and removal of outliers actually removed the replication? I think that the authors need to have a better strategy for outliers - either removal or not removal, regardless of findings. Also it's important to note that the 3SD rule is not the best approach for percentage data and I am not sure how they would apply that on count data (Study 5).
The questionnaire mentions to participants that there are attention checks. What are they? And how are participants who failed the attention checks be treated?
You ask the following question: "This survey is only intended for native English speakers born and raised in the United States." I'm not sure this was part of the original research, as it's not stated in the original paper. Why are only those born in the US allowed to participate? And how can you guarantee that individuals will answer this truthfully? Furthermore - what are the data privacy implications of collecting potential immigration data from participants? Is there a danger that participants might lie about their citizenship?
How are the authors going to analyse the data in open-ended text questions, there are many of those, such as "Briefly describe your reasoning for choosing the three wines you selected on the previous page"
The authors do not conduct a power analysis on their extensions - for example, for patriotism. What is the minimum sample size needed to observe an interaction with patriotism? I also do not understand why the authors asked Qualtrics to randomly generate data, when they could have generated data based on the findings from the original study. Or, alternatively, given that the effect size is probably uncharacteristically too high, the authors could have generated data based on their understanding of what the difference between groups realistically could be (perhaps with a smaller effect size). This would help better determine power, as well as understand what type of sample size they would need to find a significant effect for their extensions.
Leonardo Cohen