This paper proposes to replicate the study of Jones and Macken (1995) which showed that the disruptive effect of changing-state sound on serial recall is attenuated when the changing-state stimuli are presented in the spatial domain in such a way as to try to induce the perception of a number of ‘steady-state’ streams. The Jones and Macken study was quite ingenious and in general I welcome the present authors’ proposal to attempt to replicate their Exp 1c (which had the most comprehensive set of conditions). However, I have a number of concerns about the current proposal.
1. I welcome places like PCI RR as (partly) an outlet for publishing replications (or replication attempts) even if that is the main or sole purpose of the study in question. However, I believe the methodology of such studies should adhere as closely as is reasonably possible to the original study. And indeed, the present authors state early in the paper that that is their intention here: “Jones and Macken’s (1995a <<should be ‘b’ by the way) Experiment 1c, which contains all relevant control conditions including and beyond those used in Experiments 1a and 1b (and Jones et al., 1999), is to be replicated as closely as possible”. However, when I read on, I found that this wouldn’t in fact be the case if the authors were to proceed with the current proposal. There are three methodological changes from the original experiment, none of which I think is necessary:
i) The authors propose to use (computer-generated) German-pronounced versions of “V”, “J”, and “X” instead of the English-pronounced versions of the same letters used by Jones and Macken (1995). This is understandable in one sense because the authors will draw on a pool of participants in Germany and want the speech to be in the participants’ first/dominant language just as it was in Jones and Macken’s (1995) study (conducted in the UK). However, on balance, I think that it would be ‘safer’ to use the same English pronunciations as used by Jones and Macken (1995) (my understanding is that these three same letter-names are pronounced quite differently in the two languages). Of course, I don’t mean the exact same sound recordings but rather that the authors use a computer-generated (native English) voice to produce the speech tokens with (broadly) the same pronunciation as those used by Jones and Macken. The danger with using different auditory tokens is that one set of auditory tokens, or combination thereof, may not stream apart as well as another set of auditory tokens. There may, for example, be higher acoustic or post-categorical transitional probabilities between certain speech sounds serving to inhibit the fission of successive tokens into separate streams (Bregman, 1990; Warren, 1999). Moreover, we know that the changing-state effect is not modulated by linguistic factors: the effect, as well as, importantly, its modulation by streaming cues is also found with pure tones (Jones, Alford, et al., 1999, JEPLMC). Thus, on balance, I think it would be better to match the Jones and Macken experiment in terms of the letter-names used rather than match it in terms of the language of the speech being the same as the dominant language of the participants.
An alternative solution (though not my preference) would be to conduct some sort of independent empirical check on whether and how strongly and how consistently the (German-spoken) tokens stream apart, ideally involving a comparison with English-pronounced “V”, “J”, and “X”. I noticed that Jones and Macken did such a check in relation to their mono vs stereo CS conditions and so something similar could be done here re: the German tokens vs the English tokens.
ii) The authors propose to use an order reconstruction response mode where participants are shown an array of the to-be-remembered letters and the participant must click them in the order in which they were just presented whereas Jones and Macken (1995) had participants write down their responses. Whilst transcribing written responses is more time-consuming, a written response mode should nevertheless be used in my view.
iii) The present authors propose to provide feedback on the participant’s performance after each trial whereas this was not the case in Jones and Macken. Again, please remove this unnecessary addition to the original experiment.
Whilst I admit that it is not obvious why any of these methodological features--either alone or in some combination--should make a difference to the outcome, if a different outcome is indeed observed, readers may well wonder whether one or more of them did indeed have an effect, which would not be satisfactory when the sole purpose of a study is to replicate a previous one.
2. P. 2 “Jones and Macken’s study, which has never been replicated by a different laboratory, to our knowledge,…”
Whilst it’s true (to the best of my knowledge too) that Jones and Macken’s (1995) streaming-by-location study of the changing-state effect hasn’t been replicated by a different laboratory, there are some highly relevant demonstrations of effects of streaming on auditory distraction during serial recall (one of them from a separate lab) that currently go unmentioned in the manuscript:
i) Jones, Alford, Bridges, Tremblay, & Macken (1999, JEPLMC, Vol. 25, No. 2, 464-473) showed that having a particularly large difference in pitch between successive tones—again the idea being to induce their perceptual segregation into two steady-state streams—attenuates the changing-state effect.
ii) Macken et al. (2003, JEPHPP, Vol. 29, No. 1, 43–51) showed the attenuating effect on the changing-state effect of streaming induced by increasing the rate of presentation (whilst keeping the content, including the pitch, of the tokens constant).
iii) Although not quite as relevant to the changing-state effect per se, Hughes and Marsh (2017. JEPHPP, Vol. 43, No. 4, 537–551) showed that the order incongruence effect—the particularly pronounced disruption to serial recall caused by sound tokens post-categorically identical to the to-be-remembered items but only when they are presented in an incongruent order with the to-be-remembered items (Hughes & Jones, 2005)—disappears when the successive sounds in an objectively order-incongruent sequence are presented from two different ‘locations’ (presented to each ear in an alternating fashion) and in two different voices so as to render it, perceptually, no longer order-incongruent.
One general point that arises from the fact these relevant studies appear to have been missed by the current authors is that rather too much is made of the notion that Jones and Macken (1995; and Jones et al. (1999) is the only demonstration of the key phenomenon of interest. Whilst it is true that these other studies didn’t focus on spatial location cues (though Hughes and Marsh, 2017 did include spatial location cues), I don’t think the fact that Jones and Macken (1995) used location cues specifically is as important as the authors make out – the key question is whether streaming cues (whatever their nature) modulate auditory distraction effects in serial recall. The findings of some of these other studies are also relevant to some of the other concerns raised below.
3. P. 7. “In our view, these steady-state control conditions from Experiment 1c are crucial as they constitute the true “steady-state” reference and allow to tease apart the irrelevant speech effect (comparison with silence) into a changing-state effect (changing condition vs. steady condition) and a steady-state effect (steady condition vs. silence). Note that this kind of reference is needed to be able to assess whether the release from interference caused by spatial streaming (in the “V-J-X” stereo condition) reduces to a perceptual steady-state condition (or three steady-state streams, according to Jones and Macken’s reasoning), or whether some sort of changing-state effect remains (i.e., changing-state stereo being slightly more disruptive than steady-state stereo).”
And, relatedly, on p. 10 “…the results are not as clear-cut as Jones and Macken (1995a)’s interpretation suggests: While the critical “stereo” condition affording spatial streaming into three steady-state sources should completely abolish the changing-state effect (due the eliminated interference with serial-order processing)…”
I don’t agree with the authors here. Whilst it was/is ideal to include the SS conditions (for one thing, their inclusion showed the effect wasn’t driven merely by the spatial changes in the CS-stereo vs CS-mono conditions, as argued by Jones and Macken), they were/are not crucial for the theoretical conclusions of Jones and Macken (1995). As the present authors say in their previous paragraph, “the crucial prediction made by Jones and Macken (1995b) is that the changing-state mono condition should be significantly more disruptive…than the stereo condition..”. Indeed, Jones and Macken made no prediction regarding a possible (‘residual’) difference between CS-stereo and SS-stereo, no doubt because they were well aware that streaming is an unstable and not an all-or-none phenomenon. The strength of streaming could thus easily vary between trials or/and individuals. Indeed, Jones, Alford et al. (1999), using pitch as a streaming cue, did find a ‘residual’ effect in a streamed CS condition vs a ‘true’ SS condition and stated that: “This may be because the effect of having two steady-state streams is more disruptive than having one steady-state stream. *More likely, as studies of the phenomenology of streaming suggest, the process of stream separation is relatively unstable, so that the two channels are not rendered consistently*. In any case, whatever the precise level of performance in relation to the repeated-token conditions, analytically, the significant improvement in going from 5 to 10 semitones seems to suggest quite strongly that streaming processes were at work and that these modulated the disruptive effect of irrelevant sound.” (p. 471, asterisks added).
Thus, I don’t believe that a residual CS effect would speak to the veracity of the interference-by-process account. It is certainly not the case in my view, therefore, that “If the spatial separation of sound sources (via stereo) does not reduce distraction to the level of “steady-state” speech, the strict interpretation from the original study is refuted, thus challenging the interpretation given by Jones and Macken (1995).” (from Table on p. 21).
I also don’t really see why being able to decompose the disruption into a ‘changing-state effect’ and ‘steady-state effect’ is important, at least in the present context. It has already been demonstrated that a steady-state effect can be detected with enough statistical power (Bell et al., 2019); examining whether that effect is observed again here seems tangential to the main point of the proposed experiment.
In sum, I agree that the two SS conditions should be included in the proposed experiment (as they were in Jones and Macken’s Exp 1c) but I’m not at all convinced by the current authors’ justifications for why they are necessary; they are worthwhile additions but not “crucial” for the theoretical conclusions of Jones and Macken (1995) or those drawn in the related papers by Jones, Alford et al. (1999), Jones et al. (1999), and Macken et al. (2003).
4. The authors seek to pit the interference-by-process account against the attentional account (Bell, Roer et al.). However, I don’t think the attempt works. There are, for example, some obvious difficulties with the attentional account of the changing-state account—both in general and in relation to the effects of streaming in particular—that go unmentioned.
For example, the authors reiterate the attentional account’s proposition that “the changing-state effect may arise because a sequence of changing sounds is less predictable than a steady-state sequence” (p. 4) and yet the degree to which a changing-state sequence disrupts performance is not a function of its predictability (e.g., Hughes & Marsh, 2020; Tremblay & Jones, 1998; Jones et al., 1992). Furthermore, an (unpredictable) changing-state sequence does not disrupt non-seriation based processing (e.g., Hughes & Jones, 2020).
In relation to streaming studies in particular, it is difficult to see how the attentional account can explain the fact that having more tokens per unit of time can *reduce* the degree of disruption (Macken et al., 2003): having more tokens per unit of time should surely require more attentional resources for monitoring and hence increase the amount of disruption?
In relation to the proposed experiment, the authors state that, on the one hand, the attentional account predicts greater disruption in the CS-stereo condition than the CS-mono condition because in the former condition there are changes occurring in terms of both spectral content and spatial location (see p. 11). That is, it predicts the opposite result to that found by Jones and Macken (1995). So far, so good: the accounts make opposite predictions (though it doesn’t bode well for the attentional account given Jones and Macken’s, 1995, results). However, in the next sentence, the authors state that “Only if some kind of auditory stream segregation - or, more akin to the model, divided attention - is postulated, will Jones and Macken’s (1995b) “stereo” condition boil down to three steady-state streams.”. Thus, the authors seem to be suggesting here that the attentional account can also explain the result obtained by Jones and Macken (1995b). This is a problem in itself in that the attentional account is clearly too ill-specified and hence too flexible to serve as a theoretically useful counterpoint to the interference-by-process account, at least in the present context. This is especially the case given that we are told that the account would have to appeal (and in an ad-hoc way it seems) to the same mechanism (auditory stream segregation) as the interference-by-process account. I didn’t understand what the authors meant by “or, more akin to the model, divided attention”: Given that the irrelevant sound effect itself is explained on the attentional account in terms of attentional resources being divided between the focal task and the sound, wouldn’t having to divide attention further ‘within’ the irrelevant sound again be expected to increase the level of disruption still further, not decrease it?
The authors go on to state that “The stereophonic control condition of Jones and Macken’s Experiment 1c, where one and the same letter alternates between three locations, might be thought to require slightly more attentional resources than the monophonic steady-state condition (repetitions of the same letter at a single location), since the changing spatial position of that letter (driven by the interaural level differences) constitutes a changing sound feature.”
If this is the case, then surely it’s the first prediction above that must hold in relation to the CS-stereo vs CS-mono conditions (i.e., greater disruption in the CS-stereo condition than the CS-mono condition because in the former condition there are changes occurring in terms of both spectral content and spatial location) and not the second prediction?
Perhaps it is because clear predictions cannot be derived from the attentional account that the implications for this account of various possible outcomes are not included in the table on p. 21?
One general observation arising from main points 2-4 above, together with minor point ? below is that none of the ‘added’ justifications offered by the current authors for the replication attempt are convincing. But then such additional justifications may not be needed; a simple (faithful) replication may well be of sufficient value in and of itself.
1. There is some mixing up of the citations to Jones and Macken (1995a) and Jones and Macken (1995b) at various points through the manuscript and there are also occasions on which it is ambiguous (i.e., “Jones and Macken, 1995”, especially as there was a third 1995 paper by Jones and Macken!)
2 P. 3. “While the earliest theoretical explanation as to why irrelevant speech effects occur, focused on interference-by-content between articulatory rehearsal of the to-be-remembered verbal material and the automatically encoded phonological elements of irrelevant speech in working memory (Baddeley & Hitch, 1974),…”
First, I think Salamé and Baddeley (1982) should be cited here rather than (or as well as) Baddeley and Hitch (1974) as the latter paper predates the discovery of the irrelevant speech effect. Second, the description of the theory isn’t quite right (at least as of Baddeley, 1986, onwards): An important discriminating feature of the phonological loop-based account of the ISE is that the irrelevant speech disrupts the passive phonological store and not articulatory rehearsal (e.g., Baddeley, 2007; Jones et al., 2004, JEPLMC).
3. p. 8 “This result is (1) significant in the web of empirical results determining what factors modulate memory disruption by irrelevant speech, particularly, since, among the many acoustical factors modulating the irrelevant speech effect (Ellermeier & Zimmer, 2014), it is the first to demonstrate an effect of the spatial layout of the irrelevant sound background in affecting the amount of auditory distraction”.
It’s not entirely clear what these 'many acoustical factors' are? The key factor is changing-state/acoustic variability, which is what Jones and Macken (1995) were examining.
4. p. 9 “Though it does so with only minimal, highly artificial means to “spatialize” the sound image, i.e. by using monotic (left, right) vs. diotic (center) headphone presentation, resulting in a lateralized or central stream in the listener’s head; no other study to date has replicated this classic study with more sophisticated means of generating spatially localizable sound sources, such as loudspeaker arrangements or headphone-based binaural techniques (Hammershøi & Moller, 2005), thus generating realistic externalized sound images rather than producing an “in-the-head” localization.”
When I read this, I feared that the present authors’ proposed experiment was going to deviate from using the same (“highly artificial”) method. Gladly, I saw from the Method section that it is not but this paragraph may mislead other readers too.
5. p. 10 “Second, Jones and Macken’s (1995b) study is statistically underpowered: Even though they obtain a spatial-streaming effect (statistically significant difference between their “mono” and “stereo” conditions) in three separate experiments (1a-c), each of them is based on data from twenty participants only”
I didn’t find this argument very convincing – the fact that three separate experiments obtained the same reliable effect makes it very likely that a reliable ‘overall’ effect would have been found in a cross-experiment analysis (i.e., with an n = 60, which is similar to the n of 54 proposed for the current experiment). Moreover, the effect was replicated again (as the authors note) in Jones et al. (1999).
6. p. 10 “That might suggest that this is not just a steady-state effect (which typically is hardly measurable with so few participants, see Bell et al. 2019) but a residual changing-state effect, potentially due to the spatial switching between locations”
First, as noted in main point 3, any ‘residual changing-state effect’ could be due to imperfect/unstable streaming. Second, Jones and Macken’s (1995b) Exp 1c was specifically designed to examine whether spatial switching per se could produce some disruption and they found no support for that.
7. I don’t think the authors mention the intensity level at which the sounds will be presented but presumably this will be 65dB(A) as in Jones and Macken’s study?
Very minor points:
1. Abstract “…that provided evidence for the spatial separation of sound sources to reduce the detrimental effects of irrelevant speech on short-term memory”
The wording is awkward here (and it doesn’t quite work). I’d suggest: “…that showed that the separation of successive sound stimuli to distinct spatial sources reduced the detrimental effect of irrelevant speech on short-term memory”
2. The authors use the term ‘unitary attentional account’ but the term 'unitary' won't make sense to an uninitiated reader unless the contrasting 'duplex' account is also introduced. I think 'attentional account' is sufficient in any case (or ‘attentional-capture account’ would be even better as the interference-by-process mechanism can also arguably be couched in ‘attentional’ terms, e.g., Hughes & Marsh, 2017).
3. In a number of places, the authors refer to Jones and Macken’s ‘classical’ experiment. I believe the authors mean ‘classic’ (‘classical’ means traditional) as indeed they do use ‘classic’ at other points.
4. P. 5 “was that both a “mono” version of the irrelevant letter sequence was contrasted with a “stereo” version.”
>>delete “both” (one cannot contrast both of two things; each is contrasted with the other).
5. p. 7 “more disruptive to the serial recall of information stored in short-term memory”
The wording is a bit unfortunate here because the Jones and Macken (1995) study formed part of what was (or what certainly became) their view that there is no such thing as 'short-term memory' for things to be 'in'! I would suggest just saying '..more disruptive of serial recall than the stereo condition'
6. P. 13 “If, in fact, spatial streaming reduces the changing-state effect by producing all but a steady-state effect…”
I think the authors meant to say ‘nothing but a…’ not ‘all but a…’ here.