Understanding the key ingredients of the Bayesian Truth Serum
Taking A Closer Look At The Bayesian Truth Serum: A Registered Report
Recommendation: posted 23 April 2022, validated 16 September 2022
Ostojic, L. (2022) Understanding the key ingredients of the Bayesian Truth Serum. Peer Community in Registered Reports, . https://rr.peercommunityin.org/articles/rec?id=149
Related stage 2 preprints:
- Experimental Psychology
- Peer Community Journal
- Royal Society Open Science
- Swiss Psychology Open
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #2
DOI or URL of the report: https://osf.io/xw6hn
Version of the report: https://osf.io/xw6hn
Author's Reply, 23 Apr 2022
Decision by Ljerka Ostojic, posted 21 Apr 2022
Many thanks for the thorough and thoughtful revision of the stage 1 report entitled 'Taking a closer look at the Bayesian Truth Serum: a Registered Report', as well as for providing very detailed replies to the reviewers' and my own comments.
I am happy to provide in-principle acceptance (IPA) for this stage 1 report. However, before doing so there are some minor comments regarding the text of the stage 1 report that I would like you to consider in a final revision.
Page 1: In the abstract, it says ii) testing whether the effect is best explained by parts of the mechanism like the increase in expected earnings or the addition of a prediction task. Here, I was wondering whether using the wording that you have later in the report would be more appropriate - maybe something like you have on page 7 would work well?
Page 2: Thank you for adding the footnotes, I found the information you added to greatly improve the clarity and strength of the arguments made. In this footnote here (footnote 3), I was wondering whethere there are data on the prevalence of unsuccessful attention checks in studies that are not conducted online as a comparison?
Page 4: In the abstract, you write 'incentive-compatible' using a hyphen, and I think this increases readability, but later on, for example on this page, the same is written as 'incentive compatible'. For readability, it would be great if you could be consistent throughout the text, regardless of the option you prefer.
Page 11: '...ii) analyse any effect of the Bayesian Truth Serum is distinct from an increase in expected earnings that.... ' - should this be analyse whether any effect of the Bayesian Truth Serum is distinct from....?
Page 14: 'As the main analyses will be compromised of seven individual...' - should this be 'As the main analyses will be comprised...'?
Once you have issued a minor revision attending to these points, I will issue IPA without further review.
Evaluation round #1
DOI or URL of the report: https://osf.io/xw6hn
Author's Reply, 23 Feb 2022
Decision by Ljerka Ostojic, posted 17 Jan 2022
Dear Dr. Schoenegger,
The stage 1 report entitled “Taking a closer look at the Bayesian Truth Serum: a Registered Report” has now been assessed by two reviewers.
Both reviewers highlight the merit of this stage 1 report and the proposed study. Both reviewers also raise some important questions and issues that I would like you to address in a revision.
In addition, I had some concerns about the way that you planned to do your analyses (this concerns the specific comparisons between different groups and the selection of items for the analyses) as well as inferences that you wish to draw based on the results, particularly negative results. Zoltan Dienes has provided an additional review of these issues, and I am pasting his comments here - please take care to address these issues in the revised report.
1. Regarding the inferences based on results showing a non-significant difference between groups:
“The problem with the analysis plan is that it takes non-significance in itself as evidence against there being a difference. What is needed is an inferential procedure that justifies a claim of no effect so that results could actually count against various predictions. See https://psyarxiv.com/yc7s5/ for the typical alternatives and how to approach them: power; equivalence tests of various sorts; Bayes factors.
To justify a conclusion of no effect being there, there needs to be a scientifically motivated indication of what size effect there could be, if there were one. For power and equivalence tests this should be a minimally relevant effect. Thus, if the plan is to continue to use chi square then power is calculated with respect to Kramer's V or w. A problem for the author to address is justifying the minimally relevant effect size. Then power can be calculated; and thus, according to Neyman Pearson, a non-significant result taken as grounds for asserting no difference. No special power is required for PCI RR, but the power of each test should be known, if power is the tool used; but the author may wish to bear in mind that different PCI RR friendly variables may have requirements (albeit not RSOS). In any case, whatever the journal requirements, "no effect" cannot be concluded without justification.
One possiiblity, which the author may reject, is to say the BTS is useful in so far as it shifts the mean towards the truthful answer; thus a test of mean differences could be used. Power, equivalence testing and Bayes factors are all easier conceptually. One asks what shift of mean is of minimal interest (for power or equivalence tests); or what shift in mean, or range of shifts is scientifically plausible (for Bayes factors), which may be the easier quesation to answer.“
2. Regarding the planned comparisons between different groups and selection of items for these analyses:
“This [refers to the planned comparisons] is problematic because of a selection effect: By selecting extreme scores in the first comparison [concerns the comparison between the BTS group and the main control group], they will naturally tend to get differences in the second [concerns the comparisons between the BTS and additional control groups]. So I wouild not do the pre-selection. The authors should just look at the overall evidence, for each comparison without preselection.”
Here, I had an additional question: given that with the two new control groups, the idea is to test to what extent alternative explanations can explain the base effect (which statistically manifests itself as a difference in the response distribution between the BTS group and the control group in a specific direction), would it not be useful to compare that difference to the difference between a group in which an alternative manipulation was used and the control group? I think an analysis as the one proposed by Reviewer 2 may be in line with this.
Below I am pasting some minor suggestions concerning the text of the stage 1 report. Most of these are very much suggestions and changes are not necessarily required but may help in ensuring that readers get the most out of the report.
In the Abstract, at the end, it says that under ii) you want to test ‘whether the effect may be explainable by an increase in expected earnings or the addition of a prediction task’. This very much sets the expectation that it is these two explanations that are being primarily considered as possible explanations of the effect. However, implicit in your text is that the primary explanation is that it is the truth incentivising interaction between the instruction and related monetary incentive that elicits the effect, and as such these would be alternative explanations. This is made more explicit in other areas of the report where you explicitly state the term ‘alternative explanations’ and especially ‘the worry that …’. The reader would be greatly aided in understanding the report if this was all standardised and made clearer throughout the text.
The last sentence of the abstract is not very clear – i.e., it is not clear how this relates to what you are testing here.
‘While there have been significant methodological advances in psychology and cognate disciplines recently,…’ This statement should be supported by references, but also maybe explained further, as it is currently not sufficiently clear how it is relevant to the study at hand. In many ways, it is not necessary or useful to state this here at all?
When you state that “many papers do not report the compensation fee that was offered to research participants and the fact that these fees vary widely among the papers that do disclose them (e.g., Keith et al., 2017; Rea et al., 2020)’ more information would be useful: Are there actual numbers indicating prevalence available? Is the compensation here meant for the same task/time invested by participants?
‘Perhaps this is due to the null findings reported by the majority of studies that investigated the influence of financial incentives on data quality (e.g., Buhrmester et al., 2011; Crump et al., 2013; Mason & Watts, 2010; Rouse, 2015). There are, however, noteworthy exceptions indicating that increasing financial compensation can improve data quality (Ho et al., 2015; Litman et al., 2015).’ More information here may be useful to the reader.
In the next part of the introduction (but also in a later part, where you write about incentive compatible and incentive incompatible designs), the reader receives information about participants, especially in online studies, clicking through items in surveys rather engaging with the item content in order to maximise payoff. This is contrasted with the BTS manipulation to incentivise honest answers. What is missing or what I think could be confusing to readers is the step that connects these two issues, because honest answers are not necessarily the only answers that participants may give even when they engage with the items and their content.
‘When participant payments are primarily dependent on completion of the online survey, participants are likely to complete studies as quickly as possible and to complete as many of them as feasible in the time they have available in order to maximise payoffs.' Are there any data supporting this argument? Given the footnote, do we know how many participants fail the attention checks? Or data on differences between online and in-person studies?
‘The Bayesian Truth Serum works by informing participants that the survey they are about to complete makes use of an algorithm for truth-telling that has been developed by researchers at MIT and has been published in the journal Science. This algorithm will be used to assign survey answers an information score, indicating how truthful and informative the answers are. The respondents with the top-ranking information scores will receive a bonus in addition to the base pay for participation. Participants then go on to answer study items as they normally would,’ For readers who are not so familiar with the BTS manipulation, it would be useful to state more clearly whether the part of ‘This algorithm will be used…’ is part of the instructions given to the participants. On a side note, the actual wording in the Schoenegger (2021) differs so it would be good to state clearly what wording you will use in the proposed study, and also alert the reader to any differences from the Schoenegger (2021) study given that it is these results that you are aiming to replicate.
‘However, participants are only told that they can earn a bonus for answering truthfully and are not informed about the specific mechanisms of the compensation scheme.’ This requires more explanation, especially in light what instruction the participants actually receive (see a previous comment).
‘to ensure that the results found there generalise to a new sample and effects of the Bayesian Truth Serum are as such also likely to replicate in other people’s implementations.’ Maybe researchers’ instead of people’s?
Will participants who took part in Schoenegger (2021) be able to take part in this study?