I am satisfied with the authors' reply to my concerns/suggestions. I have no further comments.
Ethan A Meyers.
I thank the authors for responding to my points. I have no further comments and I am looking forward to seeing results of this project!
The authors sufficiently addressed the remaining points that I have had.
DOI or URL of the report: https://osf.io/j4xd8
Version of the report: 2
Revised manuscript: https://osf.io/vrsd3
All revised materials uploaded to: https://osf.io/phym3/ updated manuscript under sub-directory "PCIRR Stage 1\PCIRR-S2 submission following R&R 2"
I now have received re-reviews from three of the reviewers who evaluated your manuscript in the previous round. All of the reviewers judge the manuscript to be substantially improved and we now are much closer to Stage 1 IPA. You will find some remaining points to address, principally in clarifying materials and rationale and further strengthening the study design. In responding, please also address points (c) and (d) in the original review by Marek Vranka, which you appear to have overlooked in your previous response. I have extracted and pasted below the relevant points below.
c) As mentioned above, for testing H3 and H4, participants who favor minority candidate will be excluded. I believe it makes sense, but the decision is not discussed or explained in any detail whatsoever. Is there not a risk of bias? For example, if those with high reputational concerns in condition without credentials will favor the minority candidates, they will get excluded. Only those with low values will remain and there will likely be no association with the DV. In the condition with credentials, even those with high concerns remain and thus there will be the expected negative association. The exclusion would thus lead to the interaction, but in the opposite direction than predicted.
d) This is really a minor point, but since the authors ask participants to select the applicant by writing his or her name, it is likely that there will be mistakes / typos. It could be mentioned how this will be handled. (Alternatively – and I am not 100% sure whether it is possible in Qualtrics, but I guess it should be – on the page with all profiles, the authors could upload each profile as a clickable image and ask participants to click on the selected applicant – instead of circling it – and then write his or her name. See e.g., https://community.qualtrics.com/XMcommunity/discussion/1596/make-pictures-as-answers-clickable)
...
There is a typo in the last sentence before the begininng of Results section: „... on this measure as we (sic) the ones we ran on prejudiced preferences“.
The authors have addressed most of my concerns. However, I still have a few comments (comments #2, #3 and #4 I included in my previous review have not been addressed by the authors):
1) H4 talks about correlations, but the analysis uses multiple linear regression. Note that a difference in correlations is not the same thing as a difference in slopes in regression (Rohrer & Arslan, 2021).
2) If preference for males/Whites is influenced by the manipulation of credentials, excluding all participants with some preference for sex/race in analysis of H3 and H4 may result in the problem of conditioning on a post-treatment variable (see Montgomery et al., 2018). It seems that Reviewer 1 made the same comment (#4) and the authors disagreed with him, but still they removed the mention of exclusion of “participants who favor females/Blacks in the sexist/racist scenarios”. I am now not sure whether the participants will not exclude these participants, or whether the change was that they will exclude them from tests of all hypotheses and thus do not mention the exclusion specifically when describing the test of H3 and H4. In the latter case, I believe that the exclusion possibly introduces a bias and that it is better to analyze the data including all the participants.
3) I am not entirely sure, but I believe that using non-centered variables and their interactions in the test of H3 makes the main effect non-interpretable (see Dalal & Zickar, 2012).
4) The hypotheses are directional, will the statistical tests be one-tailed?
5) It is not clear from the description of the pretest (“The data of these 30 participants were not analyzed separately [...]“) whether the participants from the pretest would be included in the analysis from the main study.
References:
Dalal, D. K., & Zickar, M. J. (2012). Some common myths about centering predictor variables in moderated multiple regression and polynomial regression. Organizational Research Methods, 15(3), 339-362.
Montgomery, J. M., Nyhan, B., & Torres, M. (2018). How conditioning on posttreatment variables can ruin your experiment and what to do about it. American Journal of Political Science, 62(3), 760-775.
Rohrer, J. M., & Arslan, R. C. (2021). Precise answers to vague questions: Issues with interactions. Advances in Methods and Practices in Psychological Science, 4(2), 25152459211007368.
Štěpán Bahník
I think the authors did a superb job responding to my review. Unless explicitly mentioned below, the authors can assume that each of my previous concerns has been remedied, including most of them. Below, I note my outstanding concerns as well as a couple of new ones introduced in this revision.
Major
First off, thanks to the authors for correcting many of my misunderstandings regarding the “anti” vs “non”-sexism distinction, as well as what is and what is not required for moral licensing to occur (e.g., no clearly moral initial act is required).
In their response, the authors targeted one of my weaknesses. As an empiricist at heart, their response of "let's test to see if it matters" is something I'm likely to be in favor of by default, and this is no exception. I support the authors' proposed solutions, namely to include the following questions: (1) whether participants viewed a task judgment as sexist/racist, (2) how morally good the decision to hire the best candidate was, and (3) whether participants believed the preferences to be prejudiced or not. I have strong preferences as to exactly how these questions should be presented and worded (resulting in only minor tweaks to the authors’ intended method), but if the authors disagree with my preferences, then I wouldn't necessarily expect to see any sort of rebuttal/reply as to why. I'll detail these below, but first, I'd like to note that I find these questions especially helpful.
One concern I had while reading the revised work is that the way the authors discuss how moral credentials work is somewhat at odds with how it is tested. (Before I go any further, it’s possible that this concern too is a result of my misunderstanding something, but I don’t think it is). For example, the authors write:
“As a result, subsequently morally questionable behaviors (e.g., making conceivably prejudiced comments against ethnical minorities) are less attributed to genuine prejudice (but more to, for instance, situational factors) and may appear less wrong. Importantly, the credentials license morally dubious behaviors by altering how people interpret them” (page 8)
And
“Because moral credentials license by altering interpretations of behaviors, in theory, they work best when the behaviors are morally ambiguous, which, due to their ambiguity, afford multiple interpretations” (page 9)
The authors claim that people's interpretations of morally dubious behaviors can change when they have moral credentials, as opposed to when they do not. Yet, this was not tested in the original experiment – people's willingness to express an aggregate sex or racial preference for a specific occupation is not the same as altering the interpretation of a behavior. Thus, I am not convinced the original experiment assessed moral credentials if it lacked a test of whether interpretations of the morally dubious behavior were different across the licensing conditions. Indeed, in their response to my review, the authors acknowledge that the original work "may not be inherently appropriate for testing moral licensing, or specifically, the moral credential effect."
However, the additional questions proposed by the authors lend themselves to explicitly testing whether people's interpretations differ across credential conditions. In fact, my judgment is that these questions cut to the heart of the matter. By asking about the morality of the act, the authors can observe whether people's moral judgments of the act do indeed change. I find this to be informative regardless of whether H1 replicates or not.
As for how the questions should be presented, I advocate for the following presentation format:
1) Providing a label for each option (see below), especially a midpoint when relevant
2) Including a short definition of prejudice when participants are going to judge prejudice
3) Including these judgments as early as possible in the design.
I think labelling all points of the scale is especially important for these questions. Participants should be given the option to state they do not have enough information to make a judgment as the moral credential effect implies they do have enough information (how else could one’s moral impression of an act change?).
I also believe everyone should be on the same page for what is meant by “prejudiced”. I support providing a definition for “racism” and “sexism” as well when relevant but view this as a bit less important. Finally, I believe that these questions should be presented as soon as possible in the design (before reputational concern measure). This is because I view them as essential to the main question of the study. Furthermore, I doubt responding to these questions would affect participant’s responding to later measures (e.g., reputational concern). However, if the authors are concerned about potential order effects, then they could counterbalance the presentation order or randomize it. Frankly my judgment is that the order is most likely not too important and so I would be okay with the authors ignoring my suggestion. I recommend using the following response options for each question:
Sexism/Racism:
1 = very unlikely to be sexist/ racist
2 = somewhat unlikely to be sexist/ racist
3 = not enough information to judge
4 = somewhat likely to be sexist/ racist
5 = very likely to be sexist/ racist
(The authors could also organize this as a 1 - 4 scale with 5 as the neutral "not enough information to judge".)
Morally Good:
1 = strongly disagree
2 = disagree
3 = neither agree nor disagree
4 = agree
5 = strongly agree
I would also recommend the authors include one more question that is essentially a foil to this question:
“Selecting anyone but [candidate’s name] for the position is a morally bad decision” with the same response options as above. I think this would be helpful as it’s possible that people might not differ in the extent to which they’d endorse the moral goodness of hiring the best candidate (i.e., a ceiling effect) but they may view hiring the best in the non-racism and non-sexism conditions respectively as morally worse than not hiring the best candidate in the no-credential condition.
Prejudice
Include the following in the instructions: “by ‘prejudice’ we mean, “an adverse opinion or leaning formed without just grounds or before sufficient knowledge”” (from the Merriam-wester dictionary: https://www.merriam-webster.com/dictionary/prejudice)
1 = not at all prejudiced
2 = disagree
3 = neither agree nor disagree
4 = agree
5 = strongly agree
Minor:
On page 18, the authors describe the comprehension checks in a way that suggests that participants can pass the check even if they did not comprehend the vignette. It is not clear whether failing the check would lead to exclusion from the study. To address this issue, the authors could consider pre-determining a reasonable filter (e.g., participants who fail more than two attempts are filtered out) to increase the odds of excluding participants who did not comprehend the vignette.
On page 16 and throughout the manuscript, the authors use the terms "gender preference" and "non-sexist credential". To promote consistency, it would be better to use either "sex preference and sex credential" or "gender preference and gender credential". Given that the authors use "male" and "female" throughout the manuscript and are specifically studying non-sexism, I suggest they replace the word "gender" with "sex" wherever relevant.
On page 19, the authors include numbers as part of the scale, but they plan to deviate from the initial method by tweaking the values from negative to positive. One suggestion is to remove the numbers altogether since each scale point has a clear text label associated with it. Removing the numbers would eliminate any concern about the pluses/minuses being associated with ethnicities/sexes. Alternatively, the authors could counterbalance the scale order to test whether this could have affected the original results. Otherwise, I suggest the authors not proceed with their subtle sign change unless they can provide reasonable justification as to why their concern (that some people might be bothered by minuses being associated with Blacks/ females) wouldn’t have also applied to the participants of the original design.
On page 19 and throughout the manuscript, I suggest the authors avoid using the term "gender/ethnicity preference" as it could imply that someone generally prefers Whites or Blacks or Men or Women, even though the authors do not mean it this way. One alternative is to use the language of "hiring preference," where the preference is to hire the best candidate, and any differences in hiring preference across the sexism or racism conditions reflect some expression of sexism or racism.
On page 21, the authors deviated from circling and writing down the full name of the candidate to typing just the last name, citing the lack of a straightforward way to implement circling on Qualtrics. One suggestion is to consider using radio buttons underneath the profiles or some variant of that. Additionally, it is unclear whether the inability to circle has any bearing on whether the participant must enter the full name vs. just the last name.
Signed: Ethan A. Meyers
Thank you for considering my comments and providing thoughtful responses.
Overall, I feel that my concerns have been adequately addressed. However, there are two remaining points:
While I understand and accept your focus on replicating the original design without aiming for well-powered extensions, I believe it is important to discuss this aspect in the paper, particularly in the discussion section as a limitation. As a Registered Report (RR), potential readers might assume that all tests are well-powered, and non-significant findings for the extensions could inadvertently hinder further research in this area. I am confident that you are aware of these limitations, and I simply wish to ensure that they are clearly communicated in the manuscript. That being said, I find your rationale for the decision sound and do not have any issues with it.
It appears that you may have overlooked two of my suggestions and a postscript note. Your response concludes with point b) from my suggestions, but points c) and d) seem to have been missed. Additionally, you have not corrected the typo mentioned in the postscript, which indicates that you likely overlooked these items as well.
Signed,
Marek Vranka (vranka.marek@gmail.com)
DOI or URL of the report: https://osf.io/fy5aw/
Revised manuscript: https://osf.io/j4xd8
All revised materials uploaded to: https://osf.io/phym3/ , updated manuscript under sub-directory "PCIRR Stage 1\PCI-RR submission following R&R"
It was my pleasure to review this Stage 1 RR. Overall, I find all materials well prepared and authors’ writing easy to follow.
The rationale behind the replication study is clearly stated and its importance is well-documented by citing the popularity of the original findings, low power of existing studies, meta-analytical evidence suggesting overestimation of the effect size, and mixed results of already conducted direct and conceptual replications. The target study of this replication is well-chosen as it was not previously replicated and it also allows for an elegant extension, namely to explore the effect of „mixed-credentials“. The authors propose an additional extension, that is to explore the effect of the „reputational concern“ trait. Neither extension interferes with the design of the replicated study, so it can still be considered a very close replication. Ethical concerns are adequately considered.
H1 derives directly from the original study, H2 is related to the first extension and H3 and H4 relates to the second extension. All hypotheses are clearly stated and mapped to sensible, clearly defined research questions. Preregistered statistical tests are suitable for testing of the hypotheses and interpretation of the results is unambiguous, as the authors explicitly describe the results that would support their predictions. In the analysis, there is a possible issue, because the authors plan to first test hypotheses H1 and H2 separately, using two 2x2 ANOVAs, but then also test them jointly using one 3x2 ANOVA. This could in theory lead to a situation in which the results of these two approaches are mismatched and it is not clear what would be the conclusion in such case.
The sample size calculation is reasonable and even though one would ideally prefer larger sample size, the available funding is understandably not unlimited. For the direct replication, the study has sufficient power to detect an effect d = 0.25, which is roughly half of the effect size reported in the original study. The first extension is likely looking for a smaller effect (the difference between the effects of matched and mismatched credentials) and thus the test could be underpowered. Similarly for the second extension, when only a part of the sample (those not favoring the minority candidate) will be used and an interaction effect is tested (for H4). The RR can thus be strengthened by adding more detailed discussion of the statistical power for the extensions of the original study.
The methodological part is written in a great detail, providing enough information for any possible future replications. In this regard, only a few minor points:
a) In the “Participants” section (p. 16), it is not clear that only participants from and currently in the U.S. will take part in the study, as evident from the exclusion criterion (p. S12). Moreover, the exclusion criterion itself (“Participants who are not from or currently in the U.S.”) could be rewritten to make clear whether only one or both conditions must be satisfied.
b) It is not clear what will happen when participants fail the comprehension checks (p. 19). Will the survey end or will they be able to examine the scenario and attempt to answer correctly for a second time?
c) In the description of the funneling section at the end of the questionnaire, the item about previous encounter with the materials used in the study is omitted (p. 20). It could be mentioned for the sake of completion (but see point b) in the section below).
Some additional questions and suggestions:
a) It is possible that measuring the trait reputational concern after the experimental manipulation may bias the analysis of H3 and H4? (As participants may answer differently depending on whether they have matched, mismatched or no credentials in the previous study; see e.g., Montgomery et al., 2018) I understand that it is not feasible to measure it before the experiment in order to keep the design close to the original study. Maybe it will be possible to contact participants again and have them fill the questionnaire (this could be easy for example with Prolific, but probably also with mTurk).
b) One of the proposed exclusion criteria (no. 4, p. S12) is “Participants who indicate that they have seen or done similar surveys“. I am worried that the question asking about this is too vague and can be answered yes by anyone who have ever took part in a study in which they selected a job applicant. As far as I know, this criterion was not used in the original study and there are no strong theoretical reasons for why previous participation in a similar study should matter. I recommend either make the question specifically about study with this scenario, or drop the criterion completely.
c) As mentioned above, for testing H3 and H4, participants who favor minority candidate will be excluded. I believe it makes sense, but the decision is not discussed or explained in any detail whatsoever. Is there not a risk of bias? For example, if those with high reputational concerns in condition without credentials will favor the minority candidates, they will get excluded. Only those with low values will remain and there will likely be no association with the DV. In the condition with credentials, even those with high concerns remain and thus there will be the expected negative association. The exclusion would thus lead to the interaction, but in the opposite direction than predicted.
d) This is really a minor point, but since the authors ask participants to select the applicant by writing his or her name, it is likely that there will be mistakes / typos. It could be mentioned how this will be handled. (Alternatively – and I am not 100% sure whether it is possible in Qualtrics, but I guess it should be – on the page with all profiles, the authors could upload each profile as a clickable image and ask participants to click on the selected applicant – instead of circling it – and then write his or her name. See e.g., https://community.qualtrics.com/XMcommunity/discussion/1596/make-pictures-as-answers-clickable)
I hope at least some of the comments above will be useful and help to improve the RR.
Signed,
Marek Vranka (vranka.marek@gmail.com)
PS:
There is a typo in the last sentence before the begininng of Results section: „... on this measure as we (sic) the ones we ran on prejudiced preferences“.
References:
Montgomery, J. M., Nyhan, B., & Torres, M. (2018). How conditioning on posttreatment variables can ruin your experiment and what to do about it. American Journal of Political Science, 62(3), 760-775.
A big hello to everyone who is reading this.
I have not reviewed a registered report before. Instead of trying to implement my typical format of reviewing, I decided to follow the guidelines provided by the PCI RR. Aside from an initial major point that I raise below, my review will comprise of my answers to the “Key issues to consider at Stage 1”. I recognize that this will make responding to my letter a bit more difficult. But, I trust that the authors will be able to figure out a reasonable solution.
Major Comment:
I think the work only suffers from one major flaw. Study 2 from Monin and Miller (2001), is not a conceptually strong (or methodologically strong) study. If one aim of the present work is to further our understanding of moral licensing effects, then I think the authors ought to select an experiment that more clearly assesses moral licensing or fix the issues that I raise below. Even though the authors want to specifically offer a replication of a seminal study of moral licensing, there remain plenty of other suitable choices.
As provided by the authors, the definition of moral licensing is “when moral acts liberate individuals to engage in behaviors that are immoral, unethical, or otherwise problematic, behaviors that they would otherwise avoid for fear of feeling or appearing immoral.” Based on this there are two required elements to moral licensing, (1) a clear initial moral act and, (2) a subsequent immoral act.
In my view, the decision vignettes lack in both elements.
The proposed experiment features a hiring task and then a job description with an opinion question. The hiring task, supposedly the source of the initial moral act, asks participants who they should hire out of 20 possible candidates. In each condition there are 19 white and male applicants and one star applicant who is either a black male, a white female, or a white male. The idea is choosing to hire the best candidate in each condition might afford a moral license to those who hired the black male or the white female but not to the person hiring the white male. This is because it might be seen as being “anti-sexist” for hiring the female or “anti-racist” for hiring the black male. These moral credentials seem vaguely possible, but they don’t make much sense. If your goal was to hire the best person for the job and the person also happened to be black, in what world are you being anti-racist? Hiring a worse candidate because they are white and not black would be an expression of racism. Hiring the best candidate who happens to be black because they are the best candidate is not an expression of anti-racism. Hiring a worse candidate because they are black and not white might be an expression of anti-racism as it is argued (I would argue it is still just racism) and thus might be the only scenario in which I see an “anti-racist” credential being possible. To summarize, it is not clear to me what moral credential the average person might obtain from the anti-racist condition.
My thoughts and arguments are identical for the anti-sexism condition – hiring a woman who is the best candidate is not anti-sexism, it is hiring the best candidate. Hiring a worse man instead of a better woman is sexism, as is hiring a worse woman instead of a better man (sometimes argued to be “anti-sexist”). So, it is not clear to me what moral credential the average person might obtain from the anti-sexist condition.
To further complicate matters, it also isn’t clear to me how “hiring the best person” might not count as a moral credential. A person might value meritocracy highly. In this case, the moral act is to hire the absolute best person regardless of what they look like. In this sense, it shouldn’t matter to me whether the person I hired is a female or black -- If I hired the best, I’ve done the right thing. Under this view, couldn’t each condition be granted a moral license? Unless this can be ruled out, I don’t think a license vs. no license comparison (the primary test of the replication) would be meaningful or possible.
So, participants complete the hiring task, possibly picking up some sort of vague credential, maybe not, or maybe they are feeling extra moral for having hired the best person. They are then presented with another hiring scenario. This one is a bit different, however. Instead of hiring a candidate, participants are provided lots of details about a job (e.g., that it requires exuding confidence) and are told that they have already hired someone. They are then asked the key question: whether the job is particularly suited to one ethnicity or gender (scenario dependent). Unlike the first part of the task where participants had to actively select the individual candidate they wished to hire, in this portion of the task they are expressing a group level preference. It’s already a stretch to label answering this question a “behavior”, but to claim it is an expression of prejudice makes little sense to me. Allow me to elaborate. Consider the question asked in the anti-racist condition:
“You wonder whether ethnicity should be a factor in your choice. Do you feel that this specific position (described above) is better suited for any one ethnicity?”
It isn’t clear to me how a belief that any single ethnicity might be better suited for this role would necessarily be prejudiced. First and foremost, what is not being asked for here is some sort of individual-level evaluation. There is no set of applicants one must choose between, weighing all sorts of factors. Instead, it’s simply asking whether the person expects any sort of differences between any two possible ethnicities. The comparisons are endless! Moreover, it seems completely plausible that a person might think that ethnicity imperfectly tracks culture. Cultures differ and thus produce people of different values, some of which will be more or less suited for any occupation. Answering something akin to “it seems possible” or “it might” or “it could” despite holding this perspective will be labelled by the authors as prejudiced. This makes little sense to me. Especially if we consider that prejudice means an opinion not based on reason or actual experience, it’s not clear to me how one could even begin to infer prejudice without making an earnest attempt to seriously understand the reasoning or experiences of each individual participant.
The anti-sexism case is similar. The participant is told that one needs to exude confidence in the job (among other things) and then are asked whether they think if broadly speaking, one gender might be more suited for this role. If I hold the belief that men are a bit more confident (and especially overconfident) than women, why wouldn’t I think that on a group level men might be more suited for this job? Believing this would certainly never exclude you from hiring a woman. Indeed, I could simultaneously endorse the truth that men tend to be a bit more confident AND hire only women for the position, because, at the level of the individual, where hiring decisions actually take place, I am selecting for the best candidate as they appear. Despite group averages, individual women can be extremely confident and likewise, individual men can be extremely unconfident.
To summarize: Moral licensing first requires a moral act and then requires an immoral one. In the scenarios to be presented, the original moral act isn’t clear and neither is the immoral one.
I hope these arguments illustrate the conceptual problems with Study 2 of Monin and Miller (2001). To be perfectly clear on my position: I don’t think any researcher should consider Study 2 to be evidence of anything. Meta analyses that have included this study should promptly exclude it, and the rest of the included studies should have similar criteria applied and evaluated for.
In my view it is not a requirement of the authors to “fix” the problems of the original work. However, without doing so, I would be highly doubtful that this work would provide anything valuable. This would also apply to potential extensions on this flawed work. I think the authors have two options. One is to attempt to resolve these confounds and present an even stronger test than was originally offered. The other is to simply pick a different experiment to replicate.
Key Issues:
And now for the key issues (bulleted) and my responses (normal)
The researchers do a good job in setting up the importance for replicating moral licensing work. Despite several replication attempts already undertaken, the number of recent meta-analyses on the subject suggest that there is a debate in the literature large enough to warrant further replication attempts. Overall, I felt convinced that replicating moral licensing work would be worthwhile if it could be used to help determine whether the effect is real or not.
I found the proposed extensions both clear and unclear and raise a potential concern about each of them.
First, the reputational concern hypothesis is very sensible. The authors did a great job in pointing out the role that reputational concern could play (and consulted the work of Rotella et al.,) in this effect and of highlighting the importance of studying it further. However, I’m afraid they might have done “too good a job” at this. I can’t help but wonder if manipulating reputational exposure rather than measuring reputational concern would be a much better test of this hypothesis…
According to the work of Rotella et al., studies with explicit observation produced larger effects than those with only some or no observation. I would think the most natural extension of this finding is to then manipulate whether participants are being observed (or think they are being observed) or not. The correlation that the authors are proposing would indeed help to answer this question. But, we tend to think of experiments as stronger tests of mechanisms than observational studies. In my opinion that certainly applies here. I also don’t think running a study of this variety would suffer from many practical (especially resource) constraints. For instance, the authors are already planning to recruit 350 participants to assess the correlation anyways. These participants could be used for the experiment instead. This would also allow you to run fewer participants in the replication experiment as the moral licensing task would remain across both experiments (so you could combine samples for the strongest estimate of the effect)
Second, I am not completely convinced by the rationale of the domain extension. While I understand the “racist” and “sexist” domain-specific moral credentials idea, I wonder how the rationale could not also apply to a “hiring decision” domain. That is, the first judgment takes place in a hiring situation and so does the second. Could it be the case that each decision in this experiment takes place in a “hiring domain”? I think it’s plausible. Because it is plausible the authors risk finding no effect of domain based on their definition despite there being a real effect of domain but at level other than what the authors were considering. To remedy this problem, the domains should be made further apart to minimize any alternative explanations for the observed effects. Without this, I am not sure the domain test is convincing enough, which might call into question the value of this extension as proposed.
I also wonder the extent to which the domain hypothesis is already tested in the original experiment/ is observable in the planned analyses of the replication attempt. Given that the test is a 2 Credential (Yes/ No) x 2 Scenario (sexism/ racism), wouldn’t a significant 2-way interaction imply an effect of domain? Similarly, wouldn’t a null interaction imply that if there is an effect of domain that these two domains aren’t far enough apart? I could be wrong in my reasoning here…
Given the detail provided by the authors, I believe that I could run their proposed experiment and analyses right now if they would be willing to fund it ;) (and I wouldn’t necessarily call myself an expert).
For the replication hypothesis, the authors state their evaluation of a non-replication will follow LeBel et al., (2019)’s criteria. So, aside from any grey area this contains, yes.
In terms of the rest of their hypotheses the authors do not seem to have indicated how they will interpret different results. For the most part the interpretation seems to be clear when the results are in line with their prediction. However, it is not clear to me (as I raised above in the domain case) how they might handle null results here. Some further detail would be appreciated.
The sample size is sufficient for the replication hypothesis according to the power analysis conducted by the authors. I think the power analysis proposed (d = .25, power = 90%, alpha = .05) is reasonable. However, I wonder if it might be the absolute best decision to power to d = .18, the smallest value in the estimated range of the uncorrected moral license effect size according to the meta-analyses cited in the intro. But, I understand the potential practical (i.e., resource) constraints facing the authors so I leave it to them to decide.
The extensions are less clear. As stated above, it seems like the authors calculated power for the replication attempt and are essentially assuming/ hoping that the effects of the other tests will be at least as large. I don’t think this is unreasonable, but it should probably be made explicit. Otherwise, I would request that the authors justify their smallest effect size of interest for each of the extensions and ensure that they are sufficiently powered to test those hypotheses.
It is not clear. I believe the authors need to first establish what a null result means conceptually before being concerned about its statistical evaluation.
Yes.
The authors do a good job at pointing out work already done. However, I think they should spend a bit more time explaining some of the replication attempts (especially Blanken et al.,) that have been undertaken for moral licensing. Right now, I think the intro gives the impression that “some work has been done” without providing any further understanding about how, when, or why the work was conducted or what it found.
I am not certain what the comprehension questions are (I could not find them in the supplement). Otherwise, their exclusion criteria are prespecified and, in my evaluation, reasonable.
N/A
It is my impression that the authors have considered the ethical risks to the research.
Minor Comments:
Page 8 “moral debits” I think should be “moral debts”
Page 11, paragraph 2 first sentence “has” should be “Have”.
Page 11, paragraph 2 in parentheses “…might not help when the person engage…” should be “engages”
Page 11 near the bottom “…concluded no support for idea…”
Page 13, bottom paragraph, the authors claim they are testing moderation but write it as if they are testing mediation.
Participants
The first paragraph has “etc” after both sets of options. I’m not too sure what was meant by this… what are the other options you are planning to employ?
“For example, 5-8 minutes survey would be paid 1USD per participant” was a confusing sentence to read, but I understood what was meant.
Design and Procedure
Paragraph 1 final sentence – “… who did not ” should be “who would not” and “… pay attention but only” should be “… pay attention and only”
Deviations
I get the rationale behind wanting to deviate from the original design and present the profiles individually at first. But, I cannot shake a concern that it’s possible it has a completely unexpected effect (perhaps in the opposite direction) which may harm the replication component of the work. If Monin & Miller were able to find a genuine result with their method, I lean toward replicating it as closely as possible (with fixes to the confounds I raised) meaning removing this addition.
The rest of the deviations seem sensible.
Confirmatory Analyses
These seem reasonable. My only concern is that the authors are planning to run many uncorrected tests, often using the same variables across these unprotected tests. They plan a Tukey post-hoc comparison only after they plan to compute three ANOVAs and four linear contrasts uncorrected. For the purposes of a replication, I wonder whether it is best to be as strict as possible with type 1 error rate inflation. If the authors do not want to correct for some of these additional tests as part of their confirmatory analyses, then so be it, but I would strongly suggest that they at least include footnotes that specify where the results diverge if a more stringent correction is applied to all but a single ANOVA (e.g., bonferoni correction for every test other than the ANOVA for H1 – the primary replication).
Signed,
Ethan A. Meyers.
Download the review