Understanding the validity of standardised language in research evaluation
Finding the right words to evaluate research: An empirical appraisal of eLife’s assessment vocabulary
Recommendation: posted 04 September 2023, validated 11 September 2023
Field, S. and Chambers, C. (2023) Understanding the validity of standardised language in research evaluation. Peer Community in Registered Reports, . https://rr.peercommunityin.org/articles/rec?id=488
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.
List of eligible PCI RR-friendly journals:
- Advances in Methods and Practices in Psychological Science
- Peer Community Journal
- Royal Society Open Science
1. Hardwicke, T. E., Schiavone, S., Clarke, B. & Vazire, S. (2023). Finding the right words to evaluate research: An empirical appraisal of eLife’s assessment vocabulary. In principle acceptance of Version 2 by Peer Community in Registered Reports. https://osf.io/mkbtp
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #1
DOI or URL of the report: https://osf.io/e5pkz?view_only=3f65527bb5dc428382f4b9154bfc58e0
Version of the report: 1
Author's Reply, 25 Aug 2023
Decision by Sarahanne Miranda Field and Chris Chambers, posted 10 Jul 2023, validated 12 Jul 2023
This Stage 1 Registered Report proposal details a study which aims to address in interesting metascientific issue: that of linguistic ambiguity in the journal eLife’s assessments of manuscripts under its new model of eliminating accept/reject decisions after peer review. The authors of the proposed study point out that the model’s success will partly depend on how clearly with prospective readers. They argue that, at present, some of the wording contained in eLife’s manuscript assessments is counterintuitive and ambiguous. The authors have designed a study to explore whether the language used in the eLife assessments will be interpreted ‘as intended’ by readers.
I received four thorough and constructive reviews of this proposal and have used those to supplement my own thoughts and assessment of this proposal. In my opinion, the proposed study has potential to make a useful contribution to the metascience field, as well as being a valuable source of information for other journals potentially interested in following the novel path made by eLife. That said I have some concerns about validity and generalizability, as well as about the assumptions of the study.
First, I think it might benefit the manuscript if the authors motivated the study a little more. To be clear, I think it’s important that the wording of these assessments is questioned, but I also think that issues with interpretation will be present any time we use a qualitative descriptor. Is it at all possible to use words that will not carry with them some variance in interpretation, especially across a population with varying proficiency and understanding of English? I’m not convinced that other wording would necessarily bring fewer varying interpretations with it. As one reviewer commented, do we in fact know that others (i.e., outside of the author group) find the wording ambiguous to the degree that it would undermine the assessment in question? I suggest that the authors argue a little more as to why the wording is problematic (including indications that it indeed is, if there are any), and motivate why their choice of alternatives would be better. I realize such arguments are already present in the proposal, but I have to admit I don’t find them as compelling as I think they could be.
Second, and this is something that reviewers also pointed out – have eLife been asked about their intentions regarding the language they used? Indeed, to properly assess whether wording is being perceived as intended, we need to know what that ‘intended’ actually entails. As one reviewer mentioned, eLife might be intentionally using ambiguous language. Without asking, we do not know this. I recommend that the authors confer with eLife to clarify this rather than relying on (what appear to be?) assumptions about what eLife intended or did not intend to convey with their wording.
Third, I am, along with one reviewer, concerned about the sample. The authors plan to use this study to assess how language is perceived, which implies that the conclusions will strongly depend on those perceptions. The perceptions will strongly depend on the sample. In the proposal, the description of the sample boils down to convenience (this is made clear by the authors). This is problematic, in my opinion. If the findings of the proposed study are to hold any real validity beyond a small segment of largely young, white, middle-class English-speakers (i.e., university undergraduates, likely mostly from Australia), the sample must be actively constructed to include some demographic variety. This is especially relevant when one considers the readership of eLife. That is, they will represent a much larger proportion of the population than the sample the authors plan on drawing their data from. The authors suggest that this is likely non-expert readers, but their ‘non-expert’ status is only one element of the sample to be considered. One reviewer suggests stratifying to include a wide variety of people, and I think that’s a good idea. In my opinion, if the authors decline to take this into account in their design (for feasibility reasons, which are certainly understandable) I don’t think the resulting findings will be nearly as helpful as they could be. The authors do note this in the limitations section, but I think they understate it, if I’m being frank.
Finally, one reviewer mentions a confound that I think the authors might attempt to address in some fashion. He points out that while most readers of the eLife assessments will only see one or two assessment reports, the study’s participants will see multiple. He comments that the experiment might not sufficiently simulate what goes on in real life. I agree with this. Do the authors have any ideas as to how to get around this, or a substantive argument as for why this shouldn’t undermine the usefulness of the eventual findings?
Other comments by the reviewers should also be taken into consideration by the authors, either resulting in changes to the protocol or arguments for why they may be ignored.
I hope the authors find these points reasonable and can implement them or assuage the concerns I and the reviewers have. I look forward to reading a revised protocol!