Does auditory stream segregation reduce the irrelevant speech effect?

Chris Chambers

Back

Recommendation

Printable page

Does auditory stream segregation reduce the irrelevant speech effect?

Chris Chambers based on reviews by Massimo Grassi and 2 anonymous reviewers

A recommendation of:

STAGE 1

The role of spatial location in irrelevant speech revisited: A pre-registered replication

Florian Kattner, Mitra Hassanzadeh, & Wolfgang Ellermeier https://osf.io/tx2wr version 3

Read report on server

Abstract

ZH-CN

Submission: posted 26 April 2023
Recommendation: posted 30 October 2023, validated 30 October 2023

Cite this recommendation as:
Chambers, C. (2023) Does auditory stream segregation reduce the irrelevant speech effect?. Peer Community in Registered Reports, . https://rr.peercommunityin.org/PCIRegisteredReports/articles/rec?id=455

Recommendation

The irrelevant-speech effect (ISE) is a laboratory phenomenon in which performance at memory recall is impaired by the presence of irrelevant auditory stimuli during the initial encoding phase. In a typical ISE experiment, participants are asked to remember a sequence of letters presented visually (e.g. F, K, L, M, Q, R, Y in a shuffled random order between trials) while irrelevant speech is played over headphones. The typical finding is that recall performance is impaired by the presentation of speech compared with silence. The ISE has been influential in cognitive psychology, prompting the advancement of two broad classes of competing explanations: one in which the irrelevant sounds gain automatic access to memory processes without any specified role for attentional selection, and another in which the ISE is explained by irrelevant speech drawing attention away from the relevant items to be recalled.

In the current study, Kattner et al. (2023) propose a replication of a seminal study by Jones and Macken (1995) that provided a foundation for the automatic access (or ‘interference-by-process’) class of theories. In their original set of experiments, Jones and Macken reported that the segregating individual components of the irrelevant speech (the spoken letters V, J, and X) into different lateralized locations reduced the magnitude of the ISE by converting a single ‘changing-state’ stream three separate ‘steady-state’ streams. Here, Kattner et al. ask firstly whether this classic finding can be successfully replicated in a well-powered sample, and secondly whether the streaming-by-location effect in Jones and Macken reduces the ISE to the same level as observed during a steady-state baseline condition in which a single letter is repeated from each location. If the answer to either question is No then doubts will have been raised about interference-by-process theories, opening the door (even more) to alternative theoretical explanations of the ISE.

The Stage 1 manuscript was evaluated over two rounds of in-depth review. Based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).

URL to the preregistered Stage 1 protocol: https://osf.io/2tb8e

Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.

List of eligible PCI RR-friendly journals:

References

1. Jones, D. M. & Macken, W. J. (1995). Organizational factors in the effect of irrelevant speech: The role of spatial location and timing. Memory & Cognition, 23, 192–200. https://doi.org/10.3758/BF03197221

2. Kattner, F., Hassanzadeh, M. & Ellermeier, W. (2023). The role of spatial location in irrelevant speech revisited: A registered replication of Jones and Macken (1995). In principle acceptance of Version 3 by Peer Community in Registered Reports. https://osf.io/2tb8e

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviews

Toggle reviews

Evaluation round #2

DOI or URL of the report: https://osf.io/z6w3h

Version of the report: 1

Author's Reply, 25 Oct 2023

Dear Prof. Chambers,

We thank you and the three reviewers for the additional comments on our revised manuscript. We have now made the remaining minor changes (p. 10 and p. 14) and updated the manuscript in the OSF repository.

We hope that the Stage-1 manuscript now meets the requirements for an in-principle acceptance and we look forward to your response.

Best regards,

Florian Kattner, Mitra Hassanzadeh, and Wolfgang Ellermeier

Decision by Chris Chambers, posted 20 Oct 2023, validated 20 Oct 2023

The three reviewers from round #1 returned to look at your revised manuscript, and the good news is that all are broadly very happy with your revised proposal. I was also very impressed with the quality of your revision and response, not least because you managed to satisfy two reviewers who (as you will see below) differ significantly in their own views. Fortunately, any such disagreement does not present any barriers for progressing. There are a couple of minor issues remaining to clarify. Once these are addressed in a final revision, we should be ready to move forward with Stage 1 IPA.

Reviewed by Massimo Grassi, 22 Sep 2023

Green light for me. Just one thing about this comment I made in my previous review.

In my previous review I wrote:

- page 13. "statistically indistinguishable". Please operationalize this criterion.

"We replaced “statistically indistinguishable” by “not significantly different”."

Note that if (for example) the study was superpowered, authors would observe many statistically significant results, even this one. In my opinion it would be better to declare an effect size so that, below that effect size value, any difference is simply "not interesting" (of course given the specific power, and alpha etc.).

Good luck with the data collection!

Reviewed by anonymous reviewer 2, 18 Oct 2023

I was pleased with the responsiveness of the authors to my comments and recommend progress to Stage 2. There are a few remaining minor issues to do with clarity/wording that I could note but correcting these is not critical for the acceptability of the study for Stage 2 and so I will do that when I see the next version. The only sentence that still puzzles me in terms of its content is the second sentence in the following:

"First, it is the first study manipulating the spatial arrangement of distractor sources in the irrelevant speech effect. Though it does so with only minimal, highly artificial means to spatialize the sound, i.e. by using monotic (left, right) vs. diotic (center) headphone presentation, resulting in a lateralized or central sound image in the listener’s head".

I am not sure what point is being made here (and it's not a grammatically complete sentence either): What does the fact that it used an artificial method have to do with the theoretical implications of the study, which is the main focus of this replication attempt, as reflected in the fact that the present study will use the same artificial method?

Reviewed by anonymous reviewer 1, 01 Oct 2023

I am pleased with the author’s revision of their Stage-I proposal. All of my suggestions have been satisfactorily addressed.

To weight in on the issue of what a direct replication is, I think that the importance of the differences between the replication study and the original study should be evaluated against the background of the theoretical framework that is tested. I think that it is quite clear that the minor deviations from the original procedure proposed by the authors should not matter as it has been demonstrated time and again that the changing-state effect can be obtained regardless of whether the responses have to be written down or whether participants click through the digits. Furthermore, both responses should rely on order processing and thus should be subject to a changing-state effect. It is extremely well established for the serial-recall paradigm that the primary determinant of the changing-state effect is the presence or absence of acoustic changes in the auditory stream while the meaning of the to-be ignored material is comparatively irrelevant. For example, speech disrupts serial recall to the same degree regardless of whether it is played forward or backward (Jones et al., 1990; Röer et al., 2014, 2017). What is more, speech disrupts performance to about the same degree regardless of whether it is in the participants’ native language or a foreign language (Colle & Welsh, 1976; Ellermeier et al., 2015; Ueda et al., 2019). This is true in the case of continuous speech and it certainly applies to the stimulus material that is used in the present study that is essentially devoid of meaning anyways. So the relevant issue is whether the acoustical properties of the to-be-ignored distractors matches those of the original study. Therefore, I think that the author’s decision to use the original distractor sounds with the English pronunciation of the letters V, J, X. is the right choice. Let me note that a “successful” replication of the original pattern of results is not guaranteed because the original study does contain several features (for example, the long retention interval) that are not ideal for observing a strong changing-state effect. If the replication fails, I think that these characteristics of the original study are much more plausible candidates for producing undesirable outcomes than the changes to the procedure that were made by the authors of the State-1 proposal. Nevertheless, I of course see the advantages of faithfully reproducing these characteristics of the original study (e.g., using the same retention interval) when performing the replication. Therefore, I think the replication study can be performed as suggested by the authors.

I should also add that I completely disagree with Reviewer 1’s evaluation of the attentional account. Reviewer 1 paints a completely distorted picture of the available evidence on the attentional account, primarily by selectively citing evidence of Jones and Macken that is mostly based on extremely small sample sizes while ignoring the extensive literature supporting the attentional account (e.g., Bell et al., 2019a, 2019b, 2021a, 2021b, 2022; Körner et al., 2017, 2019; Röer et al., 2014a, 2014b, 2015). Given that Reviewer 1 ignores most of the available evidence and recent theoretical developments (Bell et al., 2021b), it is quite unsurprising that Reviewer 1 ends up with a completely biased evaluation of the attentional account. Reviewer 1 clearly didn’t bother to research how the attentional account would incorporate the streaming-by-location effect or misinterprets the available evidence. My impression is that Reviewer 1 tries to create their own attention-based straw-man account that is not helpful in advancing our understanding of auditory distraction. Therefore, I think it is important for me to chime in to support the authors’ original standpoint that is actually well-balanced and backed up by relevant evidence. The Stage-I-proposal was stimulated by a study of Jones and Macken (1995) that tested an important implication of their order-interference account. Therefore, it is obvious that the order-interference account has to be credited for allowing Jones and Macken (1995) to arrive at an interesting novel prediction. In fact, the finding that the changing-state effect is modulated by streaming has turned out to be a cornerstone for theorizing even though, as the authors of the Stage-I proposal correctly note, the finding has never been independently replicated. While the order-interference account has to be credited for being the first account from which a prediction about streaming-by-location has been derived, this of course does not necessarily imply that the same finding cannot be incorporated into other accounts. There is in fact good evidence that electrophysiological correlates of attention switching such as the mismatch negativity are prone to streaming effects not too dissimilar to what Jones and Macken (1995) have found (for reviews, see Näätänen et al, 2007; Sussman et al., 2014). Furthermore, as the authors of the Stage-I proposal correctly notice, the conditions realized by Jones and Macken (1995) confound streaming by location with predictability. Therefore, a predictability-based account could explain the same finding. This is an extremely important observation of the authors of the Stage-I proposal. I think that it is an essential part of the Stage-I proposal that the authors reflect on how important modern theories such as the attentional account would incorporate the results and not to rely only on the order-interference account. This is especially important given that it does not seem to be a necessity to confound streaming by location with predictability, it is just a choice that was made by Jones and Macken (1995). By reflecting on how the attentional account would incorporate the results of Jones and Macken (1995), the authors of the Stage-I proposal lay the groundwork for future research in which, in case of a successful replication of Jones and Macken’s (1995) finding, the contributions of predictability and streaming can be teased apart. This is how science should proceed.

Evaluation round #1

DOI or URL of the report: https://osf.io/72hq5

Version of the report: 1

Author's Reply, 17 Sep 2023

Download author's reply Download tracked changes file

Decision by Chris Chambers, posted 09 Jul 2023, validated 09 Jul 2023

I have now received three very helpful and detailed evaluations of your submission. The reviews are overall positive and I believe your submission is a promising candidate for eventual Stage 1 in-principle acceptance. The reviewers offer a range of constructive comments and suggestions, which I won't attempt to exhaustively summarise, but I will highlight a few points that struck me as particularly salient based on my own reading of your submission.

First, it seems important in this case to ensure that the replication is as similar as possible to the original study in critical areas of the design that could explain any potential differences in outcomes. One example highlighted in the reviews is the choice of stimuli. This is an understandable (and justifiable) deviation considering the target sample but also raises issues to consider. One possible solution would be to run two replications in parallel, one using original English stimuli and the other using German stimuli, perhaps validated for suitability as Reviewer 1 proposes. I will leave you to consider this possibility -- it is not a concrete requirement but a point that warrants careful consideration. I would also suggest including an additional table in the manuscript that highlights key similarities and differences in the methods of the replication and the original study.

Other points of note include clarifying (and further justifying) the rationale for adjudicating between the different theoretical accounts, clarifying the justification of the smallest effect size of interest in the sampling plan (Reviewer 2), and making sure the methods are as reproducible as possible (Reviewer 3, Massimo Grassi). I realise not all researchers use R (and it is not mandatory to do so), but whatever approach is used, the goal expressed by the reviewer should be met to ensure that the sampling plans and analyses are as scripted as much as they can be, fully explained, and computationally reproducible.

Reviewed by anonymous reviewer 2, 30 Jun 2023

This paper proposes to replicate the study of Jones and Macken (1995) which showed that the disruptive effect of changing-state sound on serial recall is attenuated when the changing-state stimuli are presented in the spatial domain in such a way as to try to induce the perception of a number of ‘steady-state’ streams. The Jones and Macken study was quite ingenious and in general I welcome the present authors’ proposal to attempt to replicate their Exp 1c (which had the most comprehensive set of conditions). However, I have a number of concerns about the current proposal.

Main points:

1. I welcome places like PCI RR as (partly) an outlet for publishing replications (or replication attempts) even if that is the main or sole purpose of the study in question. However, I believe the methodology of such studies should adhere as closely as is reasonably possible to the original study. And indeed, the present authors state early in the paper that that is their intention here: “Jones and Macken’s (1995a <<should be ‘b’ by the way) Experiment 1c, which contains all relevant control conditions including and beyond those used in Experiments 1a and 1b (and Jones et al., 1999), is to be replicated as closely as possible”. However, when I read on, I found that this wouldn’t in fact be the case if the authors were to proceed with the current proposal. There are three methodological changes from the original experiment, none of which I think is necessary:

i) The authors propose to use (computer-generated) German-pronounced versions of “V”, “J”, and “X” instead of the English-pronounced versions of the same letters used by Jones and Macken (1995). This is understandable in one sense because the authors will draw on a pool of participants in Germany and want the speech to be in the participants’ first/dominant language just as it was in Jones and Macken’s (1995) study (conducted in the UK). However, on balance, I think that it would be ‘safer’ to use the same English pronunciations as used by Jones and Macken (1995) (my understanding is that these three same letter-names are pronounced quite differently in the two languages). Of course, I don’t mean the exact same sound recordings but rather that the authors use a computer-generated (native English) voice to produce the speech tokens with (broadly) the same pronunciation as those used by Jones and Macken. The danger with using different auditory tokens is that one set of auditory tokens, or combination thereof, may not stream apart as well as another set of auditory tokens. There may, for example, be higher acoustic or post-categorical transitional probabilities between certain speech sounds serving to inhibit the fission of successive tokens into separate streams (Bregman, 1990; Warren, 1999). Moreover, we know that the changing-state effect is not modulated by linguistic factors: the effect, as well as, importantly, its modulation by streaming cues is also found with pure tones (Jones, Alford, et al., 1999, JEPLMC). Thus, on balance, I think it would be better to match the Jones and Macken experiment in terms of the letter-names used rather than match it in terms of the language of the speech being the same as the dominant language of the participants.

An alternative solution (though not my preference) would be to conduct some sort of independent empirical check on whether and how strongly and how consistently the (German-spoken) tokens stream apart, ideally involving a comparison with English-pronounced “V”, “J”, and “X”. I noticed that Jones and Macken did such a check in relation to their mono vs stereo CS conditions and so something similar could be done here re: the German tokens vs the English tokens.

ii) The authors propose to use an order reconstruction response mode where participants are shown an array of the to-be-remembered letters and the participant must click them in the order in which they were just presented whereas Jones and Macken (1995) had participants write down their responses. Whilst transcribing written responses is more time-consuming, a written response mode should nevertheless be used in my view.

iii) The present authors propose to provide feedback on the participant’s performance after each trial whereas this was not the case in Jones and Macken. Again, please remove this unnecessary addition to the original experiment.

Whilst I admit that it is not obvious why any of these methodological features--either alone or in some combination--should make a difference to the outcome, if a different outcome is indeed observed, readers may well wonder whether one or more of them did indeed have an effect, which would not be satisfactory when the sole purpose of a study is to replicate a previous one.

2. P. 2 “Jones and Macken’s study, which has never been replicated by a different laboratory, to our knowledge,…”

Whilst it’s true (to the best of my knowledge too) that Jones and Macken’s (1995) streaming-by-location study of the changing-state effect hasn’t been replicated by a different laboratory, there are some highly relevant demonstrations of effects of streaming on auditory distraction during serial recall (one of them from a separate lab) that currently go unmentioned in the manuscript:

i) Jones, Alford, Bridges, Tremblay, & Macken (1999, JEPLMC, Vol. 25, No. 2, 464-473) showed that having a particularly large difference in pitch between successive tones—again the idea being to induce their perceptual segregation into two steady-state streams—attenuates the changing-state effect.

ii) Macken et al. (2003, JEPHPP, Vol. 29, No. 1, 43–51) showed the attenuating effect on the changing-state effect of streaming induced by increasing the rate of presentation (whilst keeping the content, including the pitch, of the tokens constant).

iii) Although not quite as relevant to the changing-state effect per se, Hughes and Marsh (2017. JEPHPP, Vol. 43, No. 4, 537–551) showed that the order incongruence effect—the particularly pronounced disruption to serial recall caused by sound tokens post-categorically identical to the to-be-remembered items but only when they are presented in an incongruent order with the to-be-remembered items (Hughes & Jones, 2005)—disappears when the successive sounds in an objectively order-incongruent sequence are presented from two different ‘locations’ (presented to each ear in an alternating fashion) and in two different voices so as to render it, perceptually, no longer order-incongruent.

One general point that arises from the fact these relevant studies appear to have been missed by the current authors is that rather too much is made of the notion that Jones and Macken (1995; and Jones et al. (1999) is the only demonstration of the key phenomenon of interest. Whilst it is true that these other studies didn’t focus on spatial location cues (though Hughes and Marsh, 2017 did include spatial location cues), I don’t think the fact that Jones and Macken (1995) used location cues specifically is as important as the authors make out – the key question is whether streaming cues (whatever their nature) modulate auditory distraction effects in serial recall. The findings of some of these other studies are also relevant to some of the other concerns raised below.

3. P. 7. “In our view, these steady-state control conditions from Experiment 1c are crucial as they constitute the true “steady-state” reference and allow to tease apart the irrelevant speech effect (comparison with silence) into a changing-state effect (changing condition vs. steady condition) and a steady-state effect (steady condition vs. silence). Note that this kind of reference is needed to be able to assess whether the release from interference caused by spatial streaming (in the “V-J-X” stereo condition) reduces to a perceptual steady-state condition (or three steady-state streams, according to Jones and Macken’s reasoning), or whether some sort of changing-state effect remains (i.e., changing-state stereo being slightly more disruptive than steady-state stereo).”

And, relatedly, on p. 10 “…the results are not as clear-cut as Jones and Macken (1995a)’s interpretation suggests: While the critical “stereo” condition affording spatial streaming into three steady-state sources should completely abolish the changing-state effect (due the eliminated interference with serial-order processing)…”

I don’t agree with the authors here. Whilst it was/is ideal to include the SS conditions (for one thing, their inclusion showed the effect wasn’t driven merely by the spatial changes in the CS-stereo vs CS-mono conditions, as argued by Jones and Macken), they were/are not crucial for the theoretical conclusions of Jones and Macken (1995). As the present authors say in their previous paragraph, “the crucial prediction made by Jones and Macken (1995b) is that the changing-state mono condition should be significantly more disruptive…than the stereo condition..”. Indeed, Jones and Macken made no prediction regarding a possible (‘residual’) difference between CS-stereo and SS-stereo, no doubt because they were well aware that streaming is an unstable and not an all-or-none phenomenon. The strength of streaming could thus easily vary between trials or/and individuals. Indeed, Jones, Alford et al. (1999), using pitch as a streaming cue, did find a ‘residual’ effect in a streamed CS condition vs a ‘true’ SS condition and stated that: “This may be because the effect of having two steady-state streams is more disruptive than having one steady-state stream. *More likely, as studies of the phenomenology of streaming suggest, the process of stream separation is relatively unstable, so that the two channels are not rendered consistently*. In any case, whatever the precise level of performance in relation to the repeated-token conditions, analytically, the significant improvement in going from 5 to 10 semitones seems to suggest quite strongly that streaming processes were at work and that these modulated the disruptive effect of irrelevant sound.” (p. 471, asterisks added).

Thus, I don’t believe that a residual CS effect would speak to the veracity of the interference-by-process account. It is certainly not the case in my view, therefore, that “If the spatial separation of sound sources (via stereo) does not reduce distraction to the level of “steady-state” speech, the strict interpretation from the original study is refuted, thus challenging the interpretation given by Jones and Macken (1995).” (from Table on p. 21).

I also don’t really see why being able to decompose the disruption into a ‘changing-state effect’ and ‘steady-state effect’ is important, at least in the present context. It has already been demonstrated that a steady-state effect can be detected with enough statistical power (Bell et al., 2019); examining whether that effect is observed again here seems tangential to the main point of the proposed experiment.

In sum, I agree that the two SS conditions should be included in the proposed experiment (as they were in Jones and Macken’s Exp 1c) but I’m not at all convinced by the current authors’ justifications for why they are necessary; they are worthwhile additions but not “crucial” for the theoretical conclusions of Jones and Macken (1995) or those drawn in the related papers by Jones, Alford et al. (1999), Jones et al. (1999), and Macken et al. (2003).

4. The authors seek to pit the interference-by-process account against the attentional account (Bell, Roer et al.). However, I don’t think the attempt works. There are, for example, some obvious difficulties with the attentional account of the changing-state account—both in general and in relation to the effects of streaming in particular—that go unmentioned.

For example, the authors reiterate the attentional account’s proposition that “the changing-state effect may arise because a sequence of changing sounds is less predictable than a steady-state sequence” (p. 4) and yet the degree to which a changing-state sequence disrupts performance is not a function of its predictability (e.g., Hughes & Marsh, 2020; Tremblay & Jones, 1998; Jones et al., 1992). Furthermore, an (unpredictable) changing-state sequence does not disrupt non-seriation based processing (e.g., Hughes & Jones, 2020).

In relation to streaming studies in particular, it is difficult to see how the attentional account can explain the fact that having more tokens per unit of time can *reduce* the degree of disruption (Macken et al., 2003): having more tokens per unit of time should surely require more attentional resources for monitoring and hence increase the amount of disruption?

In relation to the proposed experiment, the authors state that, on the one hand, the attentional account predicts greater disruption in the CS-stereo condition than the CS-mono condition because in the former condition there are changes occurring in terms of both spectral content and spatial location (see p. 11). That is, it predicts the opposite result to that found by Jones and Macken (1995). So far, so good: the accounts make opposite predictions (though it doesn’t bode well for the attentional account given Jones and Macken’s, 1995, results). However, in the next sentence, the authors state that “Only if some kind of auditory stream segregation - or, more akin to the model, divided attention - is postulated, will Jones and Macken’s (1995b) “stereo” condition boil down to three steady-state streams.”. Thus, the authors seem to be suggesting here that the attentional account can also explain the result obtained by Jones and Macken (1995b). This is a problem in itself in that the attentional account is clearly too ill-specified and hence too flexible to serve as a theoretically useful counterpoint to the interference-by-process account, at least in the present context. This is especially the case given that we are told that the account would have to appeal (and in an ad-hoc way it seems) to the same mechanism (auditory stream segregation) as the interference-by-process account. I didn’t understand what the authors meant by “or, more akin to the model, divided attention”: Given that the irrelevant sound effect itself is explained on the attentional account in terms of attentional resources being divided between the focal task and the sound, wouldn’t having to divide attention further ‘within’ the irrelevant sound again be expected to increase the level of disruption still further, not decrease it?

The authors go on to state that “The stereophonic control condition of Jones and Macken’s Experiment 1c, where one and the same letter alternates between three locations, might be thought to require slightly more attentional resources than the monophonic steady-state condition (repetitions of the same letter at a single location), since the changing spatial position of that letter (driven by the interaural level differences) constitutes a changing sound feature.”

If this is the case, then surely it’s the first prediction above that must hold in relation to the CS-stereo vs CS-mono conditions (i.e., greater disruption in the CS-stereo condition than the CS-mono condition because in the former condition there are changes occurring in terms of both spectral content and spatial location) and not the second prediction?

Perhaps it is because clear predictions cannot be derived from the attentional account that the implications for this account of various possible outcomes are not included in the table on p. 21?

One general observation arising from main points 2-4 above, together with minor point ? below is that none of the ‘added’ justifications offered by the current authors for the replication attempt are convincing. But then such additional justifications may not be needed; a simple (faithful) replication may well be of sufficient value in and of itself.

Minor points

1. There is some mixing up of the citations to Jones and Macken (1995a) and Jones and Macken (1995b) at various points through the manuscript and there are also occasions on which it is ambiguous (i.e., “Jones and Macken, 1995”, especially as there was a third 1995 paper by Jones and Macken!)

2 P. 3. “While the earliest theoretical explanation as to why irrelevant speech effects occur, focused on interference-by-content between articulatory rehearsal of the to-be-remembered verbal material and the automatically encoded phonological elements of irrelevant speech in working memory (Baddeley & Hitch, 1974),…”

First, I think Salamé and Baddeley (1982) should be cited here rather than (or as well as) Baddeley and Hitch (1974) as the latter paper predates the discovery of the irrelevant speech effect. Second, the description of the theory isn’t quite right (at least as of Baddeley, 1986, onwards): An important discriminating feature of the phonological loop-based account of the ISE is that the irrelevant speech disrupts the passive phonological store and not articulatory rehearsal (e.g., Baddeley, 2007; Jones et al., 2004, JEPLMC).

3. p. 8 “This result is (1) significant in the web of empirical results determining what factors modulate memory disruption by irrelevant speech, particularly, since, among the many acoustical factors modulating the irrelevant speech effect (Ellermeier & Zimmer, 2014), it is the first to demonstrate an effect of the spatial layout of the irrelevant sound background in affecting the amount of auditory distraction”.

It’s not entirely clear what these 'many acoustical factors' are? The key factor is changing-state/acoustic variability, which is what Jones and Macken (1995) were examining.

4. p. 9 “Though it does so with only minimal, highly artificial means to “spatialize” the sound image, i.e. by using monotic (left, right) vs. diotic (center) headphone presentation, resulting in a lateralized or central stream in the listener’s head; no other study to date has replicated this classic study with more sophisticated means of generating spatially localizable sound sources, such as loudspeaker arrangements or headphone-based binaural techniques (Hammershøi & Moller, 2005), thus generating realistic externalized sound images rather than producing an “in-the-head” localization.”

When I read this, I feared that the present authors’ proposed experiment was going to deviate from using the same (“highly artificial”) method. Gladly, I saw from the Method section that it is not but this paragraph may mislead other readers too.

5. p. 10 “Second, Jones and Macken’s (1995b) study is statistically underpowered: Even though they obtain a spatial-streaming effect (statistically significant difference between their “mono” and “stereo” conditions) in three separate experiments (1a-c), each of them is based on data from twenty participants only”

I didn’t find this argument very convincing – the fact that three separate experiments obtained the same reliable effect makes it very likely that a reliable ‘overall’ effect would have been found in a cross-experiment analysis (i.e., with an n = 60, which is similar to the n of 54 proposed for the current experiment). Moreover, the effect was replicated again (as the authors note) in Jones et al. (1999).

6. p. 10 “That might suggest that this is not just a steady-state effect (which typically is hardly measurable with so few participants, see Bell et al. 2019) but a residual changing-state effect, potentially due to the spatial switching between locations”

First, as noted in main point 3, any ‘residual changing-state effect’ could be due to imperfect/unstable streaming. Second, Jones and Macken’s (1995b) Exp 1c was specifically designed to examine whether spatial switching per se could produce some disruption and they found no support for that.

7. I don’t think the authors mention the intensity level at which the sounds will be presented but presumably this will be 65dB(A) as in Jones and Macken’s study?

Very minor points:

1. Abstract “…that provided evidence for the spatial separation of sound sources to reduce the detrimental effects of irrelevant speech on short-term memory”

The wording is awkward here (and it doesn’t quite work). I’d suggest: “…that showed that the separation of successive sound stimuli to distinct spatial sources reduced the detrimental effect of irrelevant speech on short-term memory”

2. The authors use the term ‘unitary attentional account’ but the term 'unitary' won't make sense to an uninitiated reader unless the contrasting 'duplex' account is also introduced. I think 'attentional account' is sufficient in any case (or ‘attentional-capture account’ would be even better as the interference-by-process mechanism can also arguably be couched in ‘attentional’ terms, e.g., Hughes & Marsh, 2017).

3. In a number of places, the authors refer to Jones and Macken’s ‘classical’ experiment. I believe the authors mean ‘classic’ (‘classical’ means traditional) as indeed they do use ‘classic’ at other points.

4. P. 5 “was that both a “mono” version of the irrelevant letter sequence was contrasted with a “stereo” version.”

>>delete “both” (one cannot contrast both of two things; each is contrasted with the other).

5. p. 7 “more disruptive to the serial recall of information stored in short-term memory”

The wording is a bit unfortunate here because the Jones and Macken (1995) study formed part of what was (or what certainly became) their view that there is no such thing as 'short-term memory' for things to be 'in'! I would suggest just saying '..more disruptive of serial recall than the stereo condition'

6. P. 13 “If, in fact, spatial streaming reduces the changing-state effect by producing all but a steady-state effect…”

I think the authors meant to say ‘nothing but a…’ not ‘all but a…’ here.

Reviewed by anonymous reviewer 1, 20 Jun 2023

Let me first emphasize that I think that it is a great idea to perform a direct replication of the study of Jones and Macken (1995) with increased statistical power. As the authors write, the study has had a huge impact on theories on auditory distraction. The finding is cited to this day as a key finding that theories on auditory distraction have to explain. Considering the strong impact on the field, it is indeed noticeable that independent replications of this finding are still missing. As noted by the authors of the Stage-1 manuscript, there are replications of the same research group (Jones et al., 1999) but no independent replications of other research groups. I fully agree with the authors that a replication is needed based on the fact that the original study as well as the replication study had only small sample sizes and were thus seriously underpowered, combined with the fact that there are already failed replications in the literature of studies with similar characteristics (e.g., Kvetnaya, 2018). The study is a straightforward replication that will be highly useful for providing a better empirical basis for theories on auditory distraction regardless of the outcome of the statistical hypothesis test.

The State-1-manuscript is very well written, the literature review is up to date, and the theoretical reasoning is convincing. Overall, I think that this is a great proposal.

I fully agree with the authors that the original study by Jones and Macken (1995) as well as the replication study by Jones et al. (1999) were seriously underpowered and I think that it is great that the authors plan to repeat Jones and Macken’s (1995) study with more than double the sample size. Nevertheless, I wondered whether the power calculations are still too optimistic. How did the authors arrive at an effect size estimation of f = .25? Did the authors use the default options of G-Power? Is the correlation among the levels of the repeated-measures factor included in the effect size measure? Furthermore, it seems to me that the sample-size planning did not consider that the authors intend to use Holm corrections of the alpha level for multiple comparisons which should decrease the sensitivity of the statistical test. As a side note, I found it somewhat confusing that the term “changing-state effect” (at least to my understanding) is used to refer to the comparison between changing-state speech and steady-state speech as well as to the comparison of steady-state speech between mono- and stereophonic presentation modes. I think that the term “changing-state” should be reserved to the comparison between changing-state speech and steady-state speech. As a possible way to move forward, the authors may consider highlighting one specific statistical test as the decisive test of the hypothesis instead of correcting for multiple comparisons. The other tests can then performed as part of supplementary exploratory tests.

Minor issues

I think that Jones and Macken (1995) had a 7-seconds presentation phase and a 10-seconds retention interval. It is true that they wrote that the had created a sound recording that lasted approximately 20 seconds but the fact that the sound recording lasted 20 seconds does not imply that it was played for 20 seconds. Jones and Macken (1995) are clear in the description of the procedure that the presentation phase was 7 seconds and the retention phase was 10 seconds, resulting in a total exposure of 17 seconds. I don’t think that this is a critical issue because, of course, the authors of the Stage-1 manuscript can argue that exposure is maximized by presenting the sounds for the full 20 seconds, thereby increasing the chances of finding effects if they existed, but I wanted to mention it because it is a deviation from the original study that does not seem to be strictly necessary.

According to APA, quotation marks should not be used to highlight key phrases, therefore I suggest to remove the quotation marks highlighting “stereo”, “mono”, “irrelevant speech effect”, “changing-state effect”, “unitary attentional account”, “spatial planning”, “changing-state hypothesis”, “streams”, , “steady state”, “changing state”, “stream out”, “interference-by-process”, “spatialize”, “in the head”, “babble speech”, “standard babble”, “streamed babble”, and “babble”.

Page 10: “While the critical stereo condition affording spatial streaming into three steady-state sources should completely abolish the changing-state effect (due the eliminated interference with serial-order processing), the memory disruption observed in all three experiments is still substantial and – showing mean performance between the silent control and the most disruptive monophonic changing-state condition (see Figure 2).” – I found this sentence hard to follow – please revise.

Page 11: “differences differences” –-> differences?

Page 12: Please note in the figure caption what the error bars stand for and please add an explanation as to why the error bars are not displayed for the steady-state conditions.

Page 12: “low statistical power and unequivocal results” —> “low statistical power and ambiguous results”?

Page 16: “test feedback” —> feedback?

Reviewed by Massimo Grassi, 20 Jun 2023

I read the stage 1 RR by Hassanzadeh et al. In my opinion the work is good, but there is some work to be done on two sides:
1. make more explicit the direct comparison between original study and current replication
2. improve the computational replicability of the protocol submitted by the authors

1. In the paper, authors stress a lot the theory and the importance of the original study. This is certainly correct. But when it comes to replication, they should stress more similarities and (more importantly) differences between original and replication. I guess authors are familiar with the distinction between direct and indirect replication (Zwaan et al. 2018). This looks like a direct replication to me. When it comes to a direct replication (a hypothetical ideal dream in psychology) authors should stress explicitly the differences between the original experiment and the replication and ask themselves whether the differences are theoretically relevant or not and they may impact the results (see references) because in the case the results of the replication differ from the original, we have two possible scenarios: if the replication is a direct replication we can criticize the original result. If it is not, there is not much that can be done. If differences are theoretically relevant authors should report them. In particular, in the description of method and analysis I would report here and there the differences. We tried to do something similar in our paper replicating the attentional blink (see references). Note that in many cases we do not know whether differences may be theoretically relevant, but if authors acknowledge them, at least they will be evident for the future readers. Differences I noticed: different recorded voice (is it relevant?).
[By the way, I didn't understand why authors criticize the stereophonical simplicity of the original experiment but then replicated exactly the same type of stereophonic stimuli (page 9). Note that I do prefer the direct replication of the original stimuli!].

2. I would make the materials (softwares) reproducible. May I ask the authors to translate the power calculation into R-language? g*power is rather opaque and in many cases it is not clear the power calculation that is performed. The same applies for the analysis scripts that will be used for the statistical analysis. Note also that it was unclear how the power analysis was exactly calculated.

3. Suggestion. Authors claim they want to conduct an extremely powerful replication (in statistical terms). If this is true and the outcome of the experiment will be a super clean result (super replicable) an independent replication of the replication maybe less powerful (see our paper) should return the same result. This would strengthen the result of the study.

Minors
- I do not understand exactly the "errors". If I type one letter in the incorrect position it is certainly an error (I guess). What about the question mark? Is this still an error? Note that typing something sets a floor for correct guesses.
- is there any filter for outlier performances? I would use the same of the original experiment
- page 13. "statistically indistinguishable". Please operationalize this criterion.

Kind regards,
Massimo Grassi

Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., ... & Vazire, S. (2022). Replicability, robustness, and reproducibility in psychological science. Annual review of psychology, 73, 719-748.
Nosek, B. A., & Errington, T. M. (2020). What is replication?. PLoS biology, 18(3), e3000691.
Grassi, M., Crotti, C., Giofrè, D., Boedker, I., & Toffalini, E. (2021). Two replications of Raymond, Shapiro, and Arnell (1992), the attentional blink. Behavior research methods, 53, 656-668.
Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, M. B. (2018). Making replication mainstream. Behavioral and Brain Sciences, 41, e120.