This is part of a larger project assessing replicability of popular findings in decision making research and involving a formative component. I applaud the scientific effort and its great formative value. I agree with the authors on the relevance of a replication of Peters et al.’s (2006) study and I am listing a number of comments/suggestions for improvement (a few minor, a few more substantial) below.
Some of the original hypotheses have been reframed (e.g. numeracy as a continuous variable) but for a replication it may be better to use first the original formulation and list the re-framed hypothesis as an extension.
The authors are planning to recruit participants online via Amazon Mechanical Turk. This aspect is problematic as it cannot be checked if participants use online shortcuts/a calculator to answer numerical questions.
It is mentioned that d will be transformed to r, in order to compare with the original effect size. However, this applies to the extension (the replication results may be directly comparable, and should drive conclusions on the replication outcome).
STUDY DESIGN TABLE
From this table it becomes clear that the original analyses will be performed too. The replication outcome should be decided on the basis of those analyses.
Does the power analysis take into account the extensions or will these be regarded (and discussed) as a secondary/exploratory component of the study?
Relationship between numeracy as a continuous variable and decision making outcome/confidence: investigated via linear regression but the relationship might not be linear. Clarify that confidence judgments refer to the decision making tasks. Conclusions about the continuous variable analysis outcome should not overrun the conclusions about the replication Peters et al.’s study.
The section on “affect and numeracy” could be better tuned on the manipulation of Peters et al.’s experiment 4, where the focus is on the disadvantage of being highly numerate under certain conditions.
Section “choice of study for replication”. Overall statements should be backed up more robustly.
Low power: please clarify what the target effect of interest could have been and why the specific number of participants was insufficient for each study – here or later, under each specific study sections (please note that some additional or supplementary analyses reported in the original paper may be exploratory/of secondary importance and it should be taken into account when evaluating power in the original study).
Methods: Peters et al. stated that dichotomization was introduced to address data skewedness in Study 1 and maintained it for the other experiments as well; had the sample been larger for Study 2 and Study 3, it looks like the authors would have considered other splits (such as a split based on quartiles). Please clarify if/why this procedure is not suitable and/or ideal. It would be useful to mention here whether you will calibrate your replication so that it will achieve sufficient power to test for a relation between a continuous numeracy variable and decision making biases, and that you will also replicate the original analysis for a direct comparison with Peters et al.’s study (i.e. with numeracy dichotomized).
. One of the major methodological differences between the proposed replication and the original study (online vs in person testing), should be mentioned here and discussed. Another important methodological difference (all participants will take part in all the tasks), should be also mentioned, along with its possible advantages and disadvantages.
. The extension about study-specific self-efficacy is very interesting. However, the introduction of a summary confidence judgment would certainly not interfere with the replication effort if a between-participants design was adopted and if participants were not informed about having to provide a confidence judgment at the end. Could its introduction affect decision making processes on following numerical tasks in a within-participant design, as the proposed one? (e.g. Boldt, Schiffer, Waszak & Yeung, 2019; Confidence predictions affect performance confidence and neural preparation in perceptual decision making).
p.21 “We then conducted a power analysis using G*Power (Faul, Erdfelder, Lang, & Buchner, 2007) for the statistical tests in each of the decision-making risk paradigms separately (i.e. framing effect, frequency-percentage effect, ratio bias and bets effect).”
The four studies are not conducted with separate groups of participants. What is the familywise alpha across the four main tests? Please state whether you have included any corrections to your power calculations.
p.21 state what result (effect within the two-way between-subject ANOVA) was the one requiring a larger sample size.
p.22 Please list all the recruitment criteria and quality data checks beforehand (at the moment a list is provided but it ends with “etc.”). Please provide any threshold you will use to exclude participants after data collection, if based e.g. on Qualtrics diagnostic scores.
Table 3: The medium of Peters et al.’s study is paper and pencil, most likely in person, as the questionnaires were described as being “administered” rather than self-administered (p. 408) (perhaps this could be confirmed by Peters?). The year in which Peters et al.’s participants were tested is 2005 (if not earlier; p. 413), please amend.
> DESIGN: REPLICATION AND EXTENSION
In addition to the comment under the introduction section: the confidence rating is presented on the same page as the other questions, and all the answers can be changed after seeing/replying to the question about confidence. For a closer replication of Peters et al, the confidence rating could be required on a separate page, only after completion of a scenario/task.
Tables 4-7 (study design): should be checked and possibly reorganised to improve clarity and consistency (e.g. you could use the words “scenario” and “conditions” that are used in the main text, to achieve more clarity and distinguish between variable names and variable levels). E.g. for Study 1: IV2: Framing scenario or Frame (as in Peters et al.); IV2 – condition: Positive; IV2 – condition: Negative; also Numeracy is indicated as a between subjects IV like the Framing scenario though it is not manipulated but measured and no level can be indicated. Perhaps add a header with: manipulated variables for the first column and measures collected for the second. After reading the introduction, it is unclear whether the numeracy measure indicated in these Tables is the same as the original measure or a novel one – or both. For Study 2, IV2 should probably be indicated as Risk scenario (or Format, as in Peters et al.) rather than as “frequency-percentage effect” (condition 1: Frequency, condition 2: Percentage). Study 3 does not seem to have a manipulated IV. In Study 4 the manipulated IV could be Bet scenario (or Bet type) rather than “bet effect”.
- The numeracy measure is always presented after the decision tasks in Peters et al. (2006) but this aspect is not preserved in the current replication, why are the authors planning to present the numeracy scales first?
- Instructions ask for participants to answer via their gut feelings/intuitions (this is partly functional to prevent them from checking their answers online). But was this prime present in the original instructions? Would it not be more ecological to let low- and high- numerate individuals choose their preferred approach?
Table 8 should be updated with the deviations from protocol outlined above.
Page 32: please provide reasons why you think that dichotomization is the main weakness (over and above sample numerosity?).
Please specify whether linear regressions will be tested with both numeracy scale scores. And what is plan B, if the assumptions for linear regressions are not met by your data?
The dichotomization was originally performed on the basis of data distribution (median split). Why do the authors plan to apply Peters et al’s thresholds rather than a median split (as in Peters et al.’s study) to their own distribution of scores? The latter might improve their chance of having even groups to compare – important, given the use of parametric stats. It could well be that their median split will overlap with that of Peters et al.’s but it could also be that their sample is more (or less) numerate on the whole (information about demographics in Peters et al. is scarce). Related to this point, a weakness in Peters et al.’s paper is the (likely) uneven distribution of experimental conditions between numeracy groups (the division in groups was performed after assigning conditions to participants) – the numbers in each group are not reported in the original paper.
Clarify what numeracy scale you are using for regressions.
“The results of the Rasch-based numeracy scale generated by simulated data were hard to
analyze. 958 out of 1000 participants achieved zero marks and the rest of them all achieved one
Re: Rasch-based numeracy scale, could a list of 1000 random numbers ranging between min and max score be generated in Excel instead?
p.4: “Study 1” please eliminate “mixed within-subject” (this ANOVA should have 2 between-subject factors; applies to Table 1 on p.5; even though in Peters et al it is indicated as a “repeated measures” ANOVA); please eliminate or replace “with five students”; correct typo “numerach”
p.17: pls specify the topic of the “funnelling” questions.
p.32: pls clarify if participants will be able to answer from a mobile phone/whether you plan to include participants answering from a mobile phone.
Pls clarify if, after 8 minutes have passed, participants will still be able to complete or will be logged out automatically.
“The expected completion time was set at 5 minutes in advance”: Does this mean you allowed 5 extra minutes compared to the expected completion time? What age were the 30 pilot participants? How realistic is the expected completion time for younger/older participants? (age range: 0-100)