This Stage 1 Registered Report (RR) aims to test three hypotheses about how free participants feel in contributing to online conversations with toxic comments, and whether participants feel a specific toxic comment or situation has been addressed and resolved by a given response to that comment.
I applaud the authors for fleshing out predictions for multiple possibilities of the outcomes - it is such a great way to a priori consider how you will interpret whichever result ends up being supported and to make these alternatives part of the whole research program (rather than just discussing a favorite prediction, which might not be supported).
The RR is well developed and carefully thought out. Please see my comments below (minor and major mixed together, following the page numbers of the RR) for ways in which I think it could be clearer and for a couple of (surmountable) issues.
Abstract - state what the n=126 and n=800 refers to…the number of Reddit conversations? Or comment-reply pairs? Or people?
Page 3, par 1, sentence 1: perhaps start with a broader sentence to introduce the idea for your article and why this topic matters. Starting with the big problem that you are aiming to solve could be a good angle. And then it would make sense why you are jumping in to using Google’s codebook, definitions, and why it matters how people respond to toxic posts. Explain what API stands for.
Page 3 “While Kolhatkar and Taboada (2017) have argued that comment toxicity is unrelated to its ability to promote civil” - clarify what “its” refers to. Reddit? And clarify whether you think that responses to news articles will be different from interpersonal interactions. As a reader, I don’t know how to interpret this sentence as it relates to your research - does this study have an impact on the interpretation of your results? Or are news articles a different context and you think the responses there won’t be relevant to your context?
Page 4 - “one-on-one conversation can persuade the original commenter to change their views” - in what context? Change views about beliefs or change views about participating in an online conversation? It seems like the former because I assume that the one on one conversation happens in person? If that is the case, it would be good to make an argument about whether in person interactions apply to online interactions to predict whether this finding would apply to your research question’s online context.
Page 5 - “Are there any differences among them in how free participants feel to participate?” - differences among what? The three strategies you outlined in the previous sentence?
Page 5 - “Perhaps benevolent correction of the toxicity is the best strategy” - the best strategy for what and in what context? I can imagine that the best strategy could differ depending to the goals/motivations of the forum/commenter/observer and whether repeated interactions were required with these individuals in the future.
Page 6, Hypothesis 1a - how are “benevolent replies” different from 1b “benevolent corrections”? It seems like the latter would be a sub category of the former, but it just depends on how you categorized each term and whether there is overlap in the data that will be used to evaluate each hypothesis (i.e., all of the data from 1b is included in the 1a analysis). This becomes clear later in the RR, but I think it would be good to mention here near the beginning for clarity.
Page 7 - “had more respect for the second person if they condemned vs. empathized with the target”. I’m not clear on which condition elicited more respect for the target: if the observer had an attitude toward the target that was condemning or if they empathized with the target. Could you provide more detail?
Study design table > Interpretation given different outcomes: how will you determine whether or not there is a difference between the means?
Study design table > Q2 > rightmost column: replace “I” with “it” in “If H2a is supported, I…”
Study design table > Manipulation check - correcting > Hypothesis - should retaliatory be added to this cell? It looks like it because the retaliatory condition is in the ANOVA and in the interpretation.
Study design table > Manipulation check - toxicity > Hypothesis - “Ensure the first impression of each toxic commenter is similar across conditions.” The first impression of the participant as they participate in the experiment? Or the first impression of the experimenters who are categorizing the comments as toxic, benevolent, etc? Again, this becomes clear later on, but good to mention early in the RR to help readers follow.
Page 11 - for the interrater results, please state what test was used.
Page 11 - “The research assistants also re-rated the toxicity of each initial comment” - will you clarify how the toxic comments were classified as you did for the benevolent comments? Was a comment classified as toxic if it received a 1 or less on the benevolence scale? Or did toxic comments have their own scale? A few more details would be helpful here.
Figure 1 legend - please explain the x and y axes here, the sample sizes for each panel, what each dot represents, and what the violin shape represents. Also, a summary of the take home message would be useful. Do you need to cite the data you used here or is the data unpublished?
Page 12 - “A pdf of our Qualtrics survey and deidentified pilot data can be found…” Please indicate the file name so readers know where to find this data. I didn’t see a pdf of the Qualtrics survey at the OSF project.
Page 14, top par - how were the researcher-selected replies rated on the benevolence/retaliatory scales? If they weren’t rated, then why were these treated differently and how were they categorized?
Page 14, 2nd par - just to clarify, a “conversation pair” is a comment-reply pair? It would be good to either make sure this is clear throughout or change the term to something more intuitive.
Pilot study - throughout this section there are alphas reported, however it is not clear what they refer to - interrater reliability of a particular interpretation of, for example, the toxicity of the initial comment? Please clarify throughout and include the name of the test and a description of what the statistic represents.
Page 16 - “Social media use was included to describe our sample” How does social media use describe your sample?
Page 16 - should you list your IRB protocol number? I’m not sure how it works with studies on humans, but studies on non-humans have to list this in all articles.
Page 16 - please clarify that pair 1-12 means 4 comment-reply pairs multiplied by 3 conditions.
Page 16, last par - please show the data from the other benevolent condition as well so readers can evaluate what a “marginal” difference is.
Page 16, last par - “The effect of condition was not significant, however, given that the difference
between the retaliatory and benevolent correction conditions was marginal (planned comparison t(114) = -1.89, p = .061), we decided to control for the first impression in all multilevel analyses”
It looks like the “marginal” difference was determined based on p=0.061? If that is the case, what was your preplanned cut off for determining whether there was a difference between conditions/means/etc or not? If the cut off was p=0.05, then there is no “marginal”. It is either on one side of the threshold or not (see references below for further discussion on this topic). I realize this was for your pilot study and not your proposed study, however your decision to include first impression as a fixed effect in the analyses for the proposed study is likely based on this finding. If this is the case, because of your non-significant finding in the pilot study, the first impression should be removed from the proposed analyses.
Figures 3 & 4 legends - please clarify what the circles refer to - the means?
Pages 18-20 - “without covariates” is mentioned a few times, but I’m not sure what this means when the analyses were run with the covariates.
Page 19 - “Comfort with offensive language was not related to toxicity addressed, p = .23” Please add the rest of the test statistics here as in the other sentences.
Legends for Figures 4 & 5 and those thereafter as well - please add how to interpret the y axis. Do negative numbers mean participants felt like the toxicity was made worse, 0 = toxicity not addressed, and positive = toxicity addressed?
Given that the pilot study found no significant correlations for 2 of the 3 hypotheses, it might be a good idea to add to the study design table in the Interpretation column how you will interpret when there is no correlation and what theory this would contradict.
Also, was the pilot conducted according to the hypotheses in this RR? It would be good to note what the pilot study hypotheses were at the beginning of its section.
Page 21, pars 1 & 2 - please omit the sentences that mention “weak evidence” and “marginal” - these were not statistically significant, which is the measure you chose to determine whether there were differences or not (see my comment above and references below).
Page 24 and throughout - “This suggests that the manipulation of how benevolent and how correcting the Reddit conversations were was/was not successful” According to how I understand the experiment, I think you categorized the responses and not that you manipulated the responses or the participants. When I think of a manipulation, I think of designing the experiment such that the behavior of the participants changes across the study because of the experiment. If you agree, I would replace the term manipulation with categorization or something similar.
Page 25 - explain what ICC is and how to interpret it on first mention.
Page 26, last sentence - there is only a p value place holder; please add the rest of the statistics as in the rest of the paragraph.
Page 32 - did all co-authors approve the submitted version for publication? It looks like only one author did, however all authors need to approve of articles submitted on their behalf.
- the terms “benevolent going-along” and “benevolent endorsement” terms are used in different places. I would choose one term and stick with it to avoid confusion.
- the axis labels look like they are the raw variable names and would be clearer if they were relabeled to assist readers with interpretation.
>Assessing the RR according to PCI RR’s Stage 1 criteria:
>1A. The scientific validity of the research question(s).
The research questions are scientifically valid.
>1B. The logic, rationale, and plausibility of the proposed hypotheses, as applicable.
The proposed hypotheses are logical, rational, and plausible, and I suggested adding interpretations for the possibility that there are no correlations (see above).
>1C. The soundness and feasibility of the methodology and analysis pipeline (including statistical power analysis or alternative sampling plans where applicable).
The methodology and analyses are feasible and I suggested a change to improve the soundness (see the comment on marginal significance above).
>1D. Whether the clarity and degree of methodological detail is sufficient to closely replicate the proposed study procedures and analysis pipeline and to prevent undisclosed flexibility in the procedures and analyses.
The methodological detail is clear and replicable. I had a suggestion regarding the analysis pipeline to further reduce analytical flexibility (see above regarding marginal significance).
>1E. Whether the authors have considered sufficient outcome-neutral conditions (e.g. absence of floor or ceiling effects; positive controls; other quality checks) for ensuring that the obtained results are able to test the stated hypotheses or answer the stated research question(s).
The authors conduct categorization validation checks to ensure the three types of responses were perceived as belonging to their assigned categories.
I wish you the best of luck in conducting your study!
All my best,
Max Planck Institute for Evolutionary Anthropology
Lybrand et al. 2021. Investigating the Misrepresentation of Statistical Significance in Empirical Articles. https://dc.etsu.edu/honors/646/
Nozzo. 12 February 2014. “Scientific method: statistical errors”. https://www.nature.com/articles/506150a
Otte et al. 2021. Almost significant: trends and P values in the use of phrases describing marginally significant results in 567,758 randomized controlled trials published between 1990 and 2020. https://doi.org/10.1101/2021.03.01.21252701
Pritschet et al. 2016. Marginally Significant Effects as Evidence for Hypotheses: Changing Attitudes Over Four Decades. https://statmodeling.stat.columbia.edu/wp-content/uploads/2016/06/Pvalues.pdf