DOI or URL of the report: https://osf.io/dqvm5?view_only=12191f02a5db4689b00b42bab7dbd522
Version of the report: 1
I have now received reviews from all four experts from the first round of reviews. They all appreciate your thoughtful responses and have indicated that the study, as proposed in the current version, would be an interesting and important contribution to the literature. A few remaining concerns have been identified, two of which may be relatively easy to address: (1) a more nuanced characterization of relevant theories (anonymous reviewer 1) and (2) clarification of details of the pilot data (Julia Englert). In addition, there are questions about the reliability and validity of subjective effort due to boredom vs. effort, and regarding the design of the Stroop task. It is not mandatory that you make any changes to the current protocol in response to these two aspects. However, I encourage you to carefully consider these points as potential limitations of the study. Accordingly, if you decide to make no changes to the study protocol, these potential limitations and their impact on the study results should be discussed in your eventual Stage 2 report. I very much look forward to your point-by-point response.
The authors have gone to great lengths to improve the theoretical parts of their manuscript and have also collected additional data. They have greatly expanded the theoretical reasoning and discussion, and take a much more comprehensive look at the existing evidence, and potential critical points.
My only remaining substantive concerns are still about the design of the Stroop task, though the current design does in no way invalidate the proposed research and I believe it is up to the authors whether they wish to incorporate any suggestions. The design will be informative about the main aim either way. Your design allows comparing the two tasks globally - there is always a virtually infinite number of possible controls one can add, and they can be left for later studies. As a stand-in for an “easy, boring” versus a “cognitively effortful” task, the manipulation works.
(However, there was also a related point of confusion about the pilot data they collected online.)
In any case, am looking forward to the finalized version of the experiment and the results.
Regarding the Stroop Design:
As mentioned, proceeding as planned should not stand in the way of acceptance.
1. But I still think that sticking with a 100% (in)congruent Stroop task and a total of two conditions limits the interpretation of results. One could even argue that the “easy” condition does not really constitute a Stroop task at all, which is defined by the occurrence of response conflict on some trials – it is “just” a colour or number classification task (with redundant information on the correct response). It also cannot be assumed to activate the same task set as other such tasks, potentially introducing confounds beyond effort and boredom (e.g. “flow” experience, or response times).
2. I am also somewhat skeptical about the choice to make the hard task 100% incongruent (p7 of your response). This essentially turns your experiment into a blocked design, while the hard task was previously mixed. That’s not a problem per se; since it’s not uncommon for the Stroop paradigm. The fact that comparisons between congruent and incongruent blocks still yield substantial differences shows that it’s a valid way of manipulating difficulty. (see, e.g.: https://www.researchgate.net/publication/242663541_Effects_of_Type_of_Design_Blocked_vs_Randomized_on_Stroop_and_Emotional_Stroop_Tasks) . But there are some reasons why this change may be more problematic to your question than your previous mixed version. For one, I see a risk of making the task less easily comparable: The perfect negative contingency between written word and print colour incentivizes task recoding: Participants now have additional reason to ignore the irrelevant feature dimension – this might also have the unintended side effect of reducing difficulty. Second, if the goal is pushing the two task conditions further apart in terms of difficulty and boredom, this might backfire: A mixture of congruent and incongruent trials would add variety, after all. It might not even be perceived as more difficult because the task demands stay the same for the entire session. Even if you do not vary congruency proportion systematically and include a 100% congruent task for the easy condition, consider if your previous mixed difficult condition (or a similar one) might not serve you better.
3. Regarding congruency proportion in the different tasks and your online data: I am not sure if your Pilot data (Appendix A) includes a hard color Stroop task with 100% or fewer incongruent trials – it seems to be explicit only for the numerical Stroop. If the pilot data is from the color task is for a mixed version with a 50% or at least <100% incongruent proportion, I wonder why you are diverging from a task for which you already have empirical data. If your hard color task was in fact 100% congruent, then you cannot interpret the interaction the way you do, as the numerical task had a different congruency proportion which may account for differences in results: “This interaction suggests that the impact of difficulty level (easy, hard) on perceived task difficulty depends on the task variant (color, numerical).”, p33.
4. If you decide to stick with the blocked (100% congruent vs. 100% incongruent version), I have a suggestion for briefing your participants: There is a possibility that participants who perform the difficult condition first are more vigilant in the beginning, expecting incongruent trials, while those starting with the easy condition might not. You might want to clear this up in the instructions – i.e. prepare participants for the occurrence or non-occurence of incongruent trials. If the main goal is to maximize the differences between the easy and difficult tasks, then it is not critical for them to be as comparable as possible, or for the instructions to be as similar as possible.
I much appreciate the thoughtful responses to my suggestions. I have no further comments.
My apologies for the delayed review. I believe the authors have done a good job addressing the points I raised in my previous review. I am still somewhat unsure to what extent I buy into the self-reports of 'effort due to difficulty' and 'effort due to boredom' but I believe the authors have thought about this sufficiently that we will learn something from this study and I expect the authors to discuss their eventual findings in light of some uncertainty surrounding these self-reports.
Jonas Dora, UW
Review of 'How effortful is boredom? Studying Self-control Demands Through Pupillometry' Ver. 2
I remain enthusiastic about this manuscript. I think the authors have largely addressed my concerns in the revised version of the manuscript. I have two remaining points for the authors consideration.
1)The authors have claimed the cost of effort is the fact that it is associated with an aversive experience. This is a mischaracterization of the theories cited and, I think, would require a bit more elaboration (e.g., consider Kurzban, 2016 for a discussion of this point. The aversive experience of effort is generally thought to index the cost of continuing with the task at hand (for varying reasons)). I suggest the authors provide a bit more nuance here.
The question of whether participants can reliably and validly report effort due to boredom vs effort due to task difficulty is centrally important. I think it needs further consideration at both a conceptual and psychometric level. I don’t think asking participants at the end of the study if they felt able to assess the difference is sufficient. Moreover, I am not persuaded that the plan to examine the BIC of different models will somehow address this concern. In my mind multiple items of effort due to boredom and multiple items of effort due to task difficultly are required as this would support an SEM measurement model, or some sort of factor analytic approach. The author's responses to the reviews suggests they are under appreciating the significance of this problem. Having said that, given the novelty and importance of the work, I would still be interested in the results even with this significant limitation.
DOI or URL of the report: https://osf.io/dqvm5?view_only=12191f02a5db4689b00b42bab7dbd522
Version of the report: 1
I have now received the detailed and constructive evaluations of four experts. They all welcome the proposed study as a relevant and timely contribution of high theoretical and practical interest. I share their overall positive evaluation and believe that this submission is a promising candidate for eventual Stage 1 in-principle acceptance. However, all reviewers have also raised several critical concerns. I will not attempt to reiterate all of them, especially as the reviewers clearly elaborate specific ways in which these concerns can be addressed. I would only like to highlight a few recurrent and particularly important issues.
As pointed out by all reviewers, the theoretical framing of the study and the definitions of its core constructs can be made more precise. Most importantly, the term “effort” appears to be used interchangeably for “investment of effort” and “perception of effort”, and its operationalizations (pupillometry vs. subjective report) also seem to fit onto such a distinction. Similar concerns have been raised concerning the definition of the second key construct, boredom. I believe that the reviewers offer helpful and very specific suggestions as to how the theoretical framing can be improved.
Other issues that seem particularly important relate to (a) the validity of the self-report methods for the assessment of the key constructs, (b) design and labelling of the LCT and HCT conditions, (c) concerns that the 10 min break is insufficient to minimize control carry over effects, and (d) questions about the data analytic approach (e.g., more detailed analysis of task performance, use of linear mixed models).
Finally, there appears to be some agreement among the reviewers that the empirical literature on fatigue is relevant and can inform this study (including “ego-depletion” studies, amongst others). I am unsure how much elaboration on fatigue is needed or helpful for this RR, but I would encourage the authors to carefully consider and comment on this issue in their response (if not in the report itself). Among the reasons why this may be important is that boredom and fatigue have been argued to overlap in phenomenology and function. Moreover, the authors include a measure labelled “energy” (“How high is your energy level right now?”), which may be construed as an inverse of fatigue (at least in subjective terms, since energy depletion may not be more than a metaphor, e.g. see https://doi.org/10.1017/CBO9781139015394.002). If fatigue is indeed what the authors intend to measure, then it is clear that it needs to be defined clearly.
I encourage you to address the concerns that have been raised in a revised RR and very much look forward to your reply.
This is a strong proposal on a worthwhile topic. It is of clear theoretical interest because it aims to disentangle two meaningfully different concepts, namely challenge and boredom in relation to effort, and it is practically relevant, since those two concepts reflect challenges that almost everyone is facing on a daily basis. The research paradigms are suitable to the question, and the combination of dependent measures – realtime reports on subjective experience, task performance and pupil dilation as a physiological marker seem like they should synergize well in capturing those concepts. The theoretical reasoning in the introduction is also sound but could be made more precise in a small number of places.
However, I have a few larger, interrelated concerns about the design and the proposed computation of dependent measures. On the one hand, I believe the Stroop and Flanker paradigms are well-suited to the question of disentangling effort and boredom. The same can be said of the task switching paradigm, which here would result in a single (LCT) vs. dual (HCT) task version (wherein participants are cued as to whether to categorize the word or the colour) of the Stroop test. I believe that, in its current form, the authors would not be making full use of the information these tasks and measures have to offer. Therefore, I strongly encourage you to refine both their design and analyses to accomplish this.
Major notes on the protocol:
1. In my opinion, an “overall” difficulty level expressed in mean RTs and error rates does not tell us as much as it should and creates unnecessary confounds. Both cognitive challenge and boredom should have specific effects on Stroop interference, switch costs, sequence effects, and speed-accuracy tradeoff. Comparing only a single task 100% congruent Stroop -i.e. a Stroop task without Stroop interference – to a dual 50% congruent task, means that switch costs cannot be separated from interference, and any interference or sequence effects can only be compared within the dual task version, because in the single task, there is no variation in trial type. Comparing only overall means and RTs between two task versions means that all of these differences get lumped together. In the absence of compelling reasons to the contrary, please consider looking at switch costs & Stroop interference specifically, rather than combining all trials in a block. These “difficult” trial conditions are why the tasks are challenging in the first place, while the “easy” trials in the HCT might even provide a brief respite from it (and thus, introduce noise).
2. The single (LCT) and dual (HCT) task versions of the Stroop task should be varied independently of the level of congruency proportion. In addition, I recommend using more than just two levels of congruency proportion (0% and 50%), because that way, its relationship to the two constructs can be investigated more systematically. Consider performing both the dual and single task versions under a larger number of different levels of congruency proportion – for instance, you could add an intermediate level of 25% incongruent trials, and perhaps even a taxing level (~75% incongruent). If this is not feasible, e.g. because it would prolong the sessions too much, consider replacing the 0% condition with one that has at least *some* incongruent trials – especially if you take up the suggestion to vary congruence proportion independently of task switching. (Since 100% agreement between word and colour would mean that the task switching cues can safely be ignored.)
3. Another reason for removing the confound between single vs. dual task Stroop and % congruent is that it allows controlling for the proportion congruency effect which might otherwise muddy effects of difficulty: Stroop interference is stronger when more trials are congruent. While this effect may or may not itself be related to increased cognitive control and therefore “effort”, it might undermine overall effects of difficulty because interference gets smaller as the task gets more difficult (i.e. contains more incongruent trials; see for instance, Rothermund et al., 2022; https://doi.org/10.5334/joc.232).
4. Speed accuracy tradeoff: Effort and boredom should differentially affect the relationship to response speed and error rates. That is, if participants are bored, they might get sloppy. We can then expect a greater proportion of early answers, which are also more often wrong (e.g. a tradeoff) – dividing response times into bins may be one way to accomplish this. (That kind of inattention should also lead to smaller interference effects). On the other hand, being challenged and exerting effort should cause both an increase in response times and errors (or at least an increase in one and no change in the other) – a general cost of difficulty on both variables.
5. Phasic Pupil response: As mentioned, I strongly advise using separate averages for your different trial conditions (switch vs. repetition, congruent vs incongruent). Since you already link recordings to stimulus onset, please also compute separate averages for phasic pupil dilation, as there should be larger effort-based responses for switch trials compared to repetition trials, and for incongruent trials compared to congruent trials. This way, you should get a much clearer moment-to-moment picture. (I also see a chance there is no difference in pupil dilation, or even smaller pupil dilation in HCT when only “easy” trials are compared.)
Minor notes on the introduction and theory:
1. I know that the term (and phenomenon) are controversial, but I found it surprising that the term “ego depletion” was not mentioned at all, given that it’s probably one of the first associations that come to mind in this context. Other readers might also be curious about this, even if there are good reasons why it isn’t applicable or why you prefer other terms.
2. The conceptualisation of self-control and effort and its juxtaposition with impulsiveness might also be worth expanding on, a bit, see, e.g. Hofmann et al., 2009 ( https://doi.org/10.1111/j.1745-6924.2009.01116.x9 ) or Wennerhold & Friese, 2022 (https://doi.org/10.1111/spc3.12726).
a. (Note: Searching for “Hofmann” in the draft, I saw that you cited a 2014 publication by Hofmann et al. that isn’t in the reference list).
Minor suggestions for the methods:
1. For the start of each Stroop trial: One of your task cues is a plus sign, which looks a lot like the fixation cross – you should replace at least one of these with something more distinct, or
you will get repetition priming and potential task preparation benefits on the colour trials!
2. Consider adding a probe about regarding perceived overall difficulty of a given session to the thought probes for carry over effects
3. Supplementary measures and sample characteristics: If in accordance with ethics: it might also be worth inquiring after potential sleep deprivation, ADHD status, and current stimulant medications. (They might also be worth considering as potential exclusion criteria).
4. Please add some information on the lighting conditions, since these have a large impact on pupil dilation. Is the laboratory cut off from external light sources, and can you ensure the level of brightness? This info will also be needed to replicate your experiment, as lighting conditions could effect pupil sensitivity.
Minor questions/ suggestions for the analysis and interpretation:
1. Power analysis: Aiming for a test power of 95% is laudable and I hope it will not prove too ambitious, especially in an elaborate laboratory setting. If the authors’ are confident they can reach this target, all the better. If you are not sure this is feasible, I would find it perfectly fair to specify a sample range starting at a slightly lower power (e.g. required n for 80% and 95% power).
2. Carry over effects: How will boredom be included in subsequent analyses if you uncover significant differences? Will the alpha level be the same as for hypotheses tests?
3. In the HCT, are you analyzing only the colour trials, or all trials? Since there is a well-documented asymmetry in reading vs colour categorisation, including all trials would make the HCT and LCT (which has no word trials) less comparable.
4. Please also distinguish between congruent and incongruent trials in the flanker task (see comments about methods)
In any case, I hope these notes have been helpful to the authors and am looking forward to seeing this promising research progress.
I read the Stage 1 report "How effortful is boredom?" by Radke and colleagues. The authors propose that boredom can act as a "self-control demand", and thus can increase effort on tasks. They aim to distinguish the proposed effect of boredom on effort from the (well-established) effect of task difficulty on effort. Their study proposes to use both subjective measures (of experienced difficulty, boredom, and a few others) and pupillometry.
I think the rationale of the paper is clear and convincing, and the general topic of this paper is timely. In general, I think this is a worthwhile study, and I would be interested in seeing the outcome of this study. My comments are intended constructively, and would still be interested in this paper if the authors would not follow all of my recommendations:
1) Relevant empirical papers
The idea to study the association between (high-frequency) subjective measures vs. physiological and/or behavioral measures in lab tasks has been done, to my knowledge, a few times before.
- On the topic of effort, I did two experiments where I studied the association between pupillometry and subjective feelings of effort (Bijleveld, 2018, Consciousness and Cognition). Though the designs were within-subjects, I found that effort--both subjective effort and phasic pupil dilation--increases with time. I think Figures 3A-B and 4A-B are very similar to what the authors have in mind. To be sure: (1) This previous paper does not challenge the novelty of the authors' plan; I understand their study will add important insights on boredom; and (2) This comment should not be seen as a request for citation or anything; I just think the paper is quite relevant to the authors' plan.
- On the topic of boredom as a source of effort, a recent study showed that a boring task (Mackworth Clock task) made people very fatigued, to a similar degree as an extremely demanding task (TLoadDBack task; Pickering et al., 2023, Canadian Journal of Experimental Psychology). This finding seems well in line with the authors' main argument.
- If the authors haven't seen these papers, I think it is worth looking Hopstaken et al. (e.g., 2015, Biological Psychology), Müller et al. (2019, Nature Communications), and Dora et al. (2022, JEP:G). All these studies examine how feelings of effort and/or fatigue dynamically fluctuate during task performance, and they may support part of the authors' arguments.
2) Relevant theory
I was somewhat surprised the authors did not draw from Motivational Intensity Theory, which posits that effort emerges from the (nonlinear) interaction between (a) subjective task demands and (b) the importance of success on the task. Though the authors argue that boredom increases demands, I think one could also argue that boredom may reduce the importance of success. A good introduction is Richter et al. (2016, advances in motivation science). I could also see that this theory may help to make sense of findings post-hoc, i.e., in the discussion.
3) Procedure
A methodological challenge with pupillometry is that (phasic) pupil dilation is not a super fast measure; the pupil response has a delay of several 100s of milliseconds (Hoeks & Levelt, 1993; Hershman & Henik, 2000, Memory & Cognition). Another methodological challenge is that pupil dilation depends strongly on brightness of the screen. In my experience, the effect of stimulus type (e.g., a letter string vs. a fixation cross) is a lot stronger than the effect of task difficulty, which may make it hard to interpret data if brightness is not kept constant and/or if stimuli have varying different durations (e.g., if stimulus duration depends on RT, as is the case in the proposal; I would definitely try to avoid that). It is currently not clear if the authors are aware of these challenges. One way to deal with this issue would be to search previous papers that used pupillometry and Stroop, and closely model the procedure after these previous papers (such as Hershman & Henik, 2020).
4) Procedure
The authors plan to do both sessions on the same day, with a 10-minute break in between. I understand that this is desirable from a practical point of view, especially since the authors plan to recruit 95 participants (which is great: this is a lot more than commonly done with this type of design). In other studies with similar designs, however, sessions are often conducted on different days (e.g., Lin et al., 2020, Psych Science), to prevent carry-over effect. As it stands, the duration of the break is justified solely based on what seems to be a rather weak finding (an ego-depletion paper from 2008, with N=10 per condition, and a rather odd manipulation). Yet, this is a crucial design choice for the present study: although a 10-minutes break may be enough, I recommend justifying it better.
5) Analysis
p. 17-18 'Thought probes': The authors seem to treat time as a categorical variable in their ANOVA. If so, this seems somewhat odd, given that the authors make linear predictions (e.g., "we expect an increase ... over time", p. 18). I recommend doing this in an linear mixed-effects modeling framework, in which the effect of time is included as a fixed effect. This seems more appropriate and consistent with what the authors plan to do under "Effort and pupil size".
6) Analysis
"Effort and pupil size": The authors explain that they will use "random effects among participants allowing for random intercepts". Could they clarify this? I can see the need for a random intercept: this tells the model people may differ, overall, e.g., in their phasic pupil dilation. However, I can also see the need for random slopes: these would tell the model that people may differ in, e.g., how strongly phasic pupil dilation depends on time. In the literature, there is some discussion as to how to navigate random slopes in experimental designs. It may be helpful to look at the (very influential) guidelines by Barr et al. (2013, J. Memory and Language) and the (very reasonable) response by Bates et al. (https://arxiv.org/abs/1506.04967). In any case, I think it would be good to decide a priori what the random-effects structure of their models will look like.
1A. The scientific validity of the research questions.
The proposed research questions are scientifically justifiable.
However, I have some concerns about the link between ‘theory/conceptualization’ and ‘operationalization / measurement’ of key variables.
1) The concept of ‘effort’ is not 1) precisely and explicitly defined, 2) sufficiently related to existing scholarship / definitions of ‘effort’, 3) measured in a manner that has unequivocal validity. Moreover, at times the authors use the term ‘cognitive effort’ and at other times the term ‘effort’ is used (I suspect the authors do not intend to refer to distinct concepts with these two terms and thus I would recommend using a singular term). I note the authors provide a helpful definition of ‘self-control’ and I suggest doing the same for ‘effort’. When it comes to defining effort, it is my understanding that this term has been used in a variety of different ways by different authors and I think it would be useful to link the authors definition to existing scholarship to show how their work fits with the work of others. I will not articulate different definitions here beyond pointing out the work of Mulder 1986 because I think Mulder’s distinction between ‘processing complexity’ and ‘compensatory control’ is particularly apt in the present context (of course the authors are free to disagree with me about the applicability of Mulder’s distinctions). I am enthusiastic about the conceptual distinctions between a) task difficulty, b) boredom, c) boredom related effort, d) task difficulty related effort. However, it would be helpful for the authors to be clearer on how these concepts are being defined and, importantly, it would be useful to show that these concepts can be validly and distinctly reported by participants (and therefore measured by researchers). For example, can participants distinguish between the effort they have to exert because of boredom vs task difficulty? Do participants consider the ‘boringness’ of a task when they rate task difficulty? Or, to put it differently, will participants interpret these questions in the same way as the researchers intend them to be interpreted. I wonder if there is some creative way to examine this with single item probes. Perhaps showing autoregressive effects over time within concepts vs lagged correlations across concepts would convince a reader of their psychometric distinction? Or perhaps there is a better way of addressing these issues...At the end of the day, I think the proposed work would be publishable without an explicit demonstration of the validity of the probe items, so a note on this point in the limitations section would likely be sufficient. I offer this point in the spirit of trying to make the findings stronger. Finally, regarding the concept of ‘effort’, I suggest the authors are explicit about using pupil size as the gold-standard for operationalizing effort and provide some background on the existing work correlating pupil size and self-reported effort.
2) I am concerned about the coherence of the concept ‘level of (self) control’. First, sometimes the term ‘control’ is used and at other times the term ‘self-control’ is used, again, I encourage the authors to be precise and use one term for one concept or articulate the conceptual difference between control and self-control. Second, throughout, the authors refer to two levels of the Stroop Task as ‘High Control’ and ‘Low Control’ and operationalize these two conditions in terms of Stroop task parameters (i.e. a difficult and an easy version of the Stroop (difficult and easy are appropriately further specified). However, the introduction, and the logic of their hypothesis, suggest that self-control is necessary to persist with easy=boring tasks. Thus, I recommend that the authors more clearly and coherently define or label their two Stroop conditions (using the more concrete ‘easy’ vs ‘difficult’ seems like a promising way to go). Again, as I mentioned above a more sophisticated discussion and definition of ‘effort’ along the lines of Mulder’s work might help when considers the best way to conceptualize the two versions of the Stroop. This issue of operationalizing level of (self) control in terms of Stroop task parameters is particularly concerning when it comes to the proposed analyses for the second task. That is, if the probe results of task 1 demonstrate that boredom is effortful then the proposed analyses for the second task may need to be altered.
1B. The logic, rationale, and plausibility of the proposed hypotheses.
The proposed logic and rationale are strong. The proposed hypotheses are plausible.
1C. The soundness and feasibility of the methodology and analysis pipeline (including statistical power analysis).
The proposed methodology and analysis pipeline is sound and feasible.
1D. Whether the clarity and degree of methodological detail is sufficient to closely replicate the proposed study procedures and analysis pipeline and to prevent undisclosed flexibility in the procedures and analyses.
Sufficient and clear methodological detail is presented, which would allow for a close replication of the proposed work.
1E. Whether the authors have considered sufficient outcome-neutral conditions for ensuring that the obtained results are able to test the stated hypotheses.
Yes, the authors have sufficiently considered outcome -neutral conditions – carry over and manipulation checks for example. However, I don't recall seeing an explicit statement about whether or not the probe data would be analyzed if the hypothesized ‘task difficulty’, ‘boredom’, and ‘performance’ hypotheses are not confirmed (p. 17). I dont have an opinion on what is best to do in this regard...so I leave it to the authors.