**Comparing statistical word learning in bilinguals and monolinguals **

**Elizabeth Wonnacott**based on reviews by 2 anonymous reviewers### The Influence of Bilingualism on Statistical Word Learning: A Registered Report

### Abstract

**EN**

**AR**

**ES**

**FR**

**HI**

**JA**

**PT**

**RU**

**ZH-CN**

*Submission: posted 28 June 2023*

*Recommendation: posted 01 July 2024, validated 01 July 2024*

**Cite this recommendation as:**

Wonnacott, E. (2024) Comparing statistical word learning in bilinguals and monolinguals .

*Peer Community in Registered Reports, .*https://rr.peercommunityin.org/articles/rec?id=494

#### Recommendation

**URL to the preregistered Stage 1 protocol:**https://osf.io/8n5gh

**Level of bias control achieved:**Level 6.

*No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.*

**List of eligible PCI RR-friendly journals:**

- Advances in Cognitive Psychology
- Biolinguistics
- Collabra: Psychology
- Cortex
- Experimental Psychology *pending editorial consideration of disciplinary fit
- Journal of Cognition
- Peer Community Journal
- PeerJ
- Psychology of Consciousness: Theory, Research, and Practice
- Royal Society Open Science
- Studia Psychologica
- Swiss Psychology Open

**References**

**Conflict of interest:**

The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

*Evaluation round ***#4**

**#4**

DOI or URL of the report: **https://osf.io/t3wkd**

Version of the report: 4

#### Author's Reply, 30 Jun 2024

Dear Dr. Wonnacott,

We are pleased to resubmit our revised manuscript, “The Influence of Bilingualism on Statistical Word Learning: A Registered Report.”

As you asked, we added the values we will use, such as x, in the manuscript and the table. We removed Appendix C, and instead, we created Table 2 in the analyses part with the values in the log odds scale (beta coefficient). We also added them in the final table (Table 3), like:

We will use the effect size of the main effect of language group (log odds = 0.37226) and the interaction between language group and mapping type (log odds = 0.718339) from Poepsel and Weiss's (2016) data as x (SD for the normal distribution of H1).

We want to thank you for your time and effort in reviewing our manuscript. The revisions we made will hopefully allow us to proceed to the data collection phase of the Registered Report.

Sincerely,

Matilde Simonetti, Megan Lorenz, Iring Koch, and Tanja Roembke

#### Decision by **Elizabeth Wonnacott**, *posted 30 Jun 2024**, validated 30 Jun 2024*

Thank you for making these changes- its now clear. Just one small change: where you say you will get the values to inform the model of H1 from your pilot/ previous papers- can you state the actual values you will use? They should be in log odds, to match the data summary (though you could also give the odds/ odds ratios as these are a more familiar measure of effect size).

If you could add these into the text and table then I will be happy to approve this.

===

**Note from Managing Board:** To accommodate reviewer and recommender holiday schedules, PCI RR will be closed to submissions from 1st July — 1st September. During this time, reviewers will be able to submit reviews and recommenders will issue decisions, but no new or revised submissions can be made by authors. If you wish your revised manuscript to be evaluated before 1st September,** please be sure to resubmit no later than 11.00am CEST on Monday 1st July.**

*Evaluation round ***#3**

**#3**

DOI or URL of the report: **https://osf.io/kxvsr**

Version of the report: 3

#### Author's Reply, 28 Jun 2024

Dear Dr. Wonnacott,

We are pleased to resubmit our revised manuscript, “The Influence of Bilingualism on Statistical Word Learning: A Registered Report.” We have thoroughly reviewed all your comments and suggestions, leading to the revision of the manuscript. We will reply to your comments point by point below.

(1) Your “interpretation” column” in the big table contains some errors. For example, for Hypothesis 1:

a. you are right that a BF greater than 3 would mean that you have evidence that participants are more accurate in 1:1 than 1:2 - assuming (i) that you are testing a one tailed hypothesis- which is the case if you use the half normal, as stated in footnote 5, and (ii) that 1:1 is coded as 0.5 in the model and 2:1 as -0.5 (so that a positive effect of mapping type is in the direction of higher accuracy in the 1:1 condition )

b. but a BF lower than 1/3 does NOT indicate that participants are more accurate with 1:2 than1:1. Assuming you conducted the test as I laid out in (a), a BF < 1/3 indicates you have evidence against the H1 that it is easier to learn 1:1 than 1:2 mappings. But note that this is consistent with either both conditions being the same OR with a reverse effect.

c. Also, a BF between 1/3 and 3 for this test does NOT indicate that they are equally accurate with the two types. Instead, it indicates that the data collected do not provide much evidence to distinguish your H1 from the null (the data are insensitive). So, you can’t say you have evidence that 1:1 mapping is learned better than 1:2 mapping, but you also can't say that you have evidence that this isn’t the case.

(I suggest looking at Zoltan’s excellent webpage https://users.sussex.ac.uk/~dienes/inference/Bayes.htm and following links to paper and short explainer videos in that page. Note that if there are places where you want do test a hypothesis in two directions you can do that either by testing a two tailed hypothesis (in which case evidence for the null will mean evidence that the groups don’t differ in either direction- i.e. they are the same) or testing two one tailed hypotheses.)

We apologise for our misunderstanding of how the Bayes Factor (BF) ranges were interpreted in the table. We changed the curve of H1 for the Bayes Factor. Instead of having a half-normal curve (as we had before), we chose to change it to a normal one. This will allow us to examine the direction of the results, and we think this is particularly important for the contrast between monolinguals and bilinguals. We adjusted the values in the table consistently with the normal curve and your explanations. In addition, we checked the table carefully for any other possible errors.

(2) Where you explain analyses in the main paper on p 23, what is in footnote (5) should be incorporated into the text and you need to be clearer that though you run a full model, your Bayes Factor to test each hypothesis is computed by extracting a beta and SE for one specific coefficient. Then you will then use the calculator to test if these two statistics (which constitute the data summary) are more likely under the H1 than H0, where H0 is a point null and H1 is represented as a half normal (i.e. a one-tailed test, because you are testing prediction in a particular direction) with a mean of 0 and sd set to a rough estimate of effect size x. You need to explicitly lay out what x is for each hypothesis you are going to test (i.e. give the predicted beta value for the fixed effect) and explain where that value came from (though NB this information will also need to be in the table at the end – see next point- so here in the text you could refer the reader to that table.)

As you suggested, we incorporated the footnote explaining the Bf function in the text and added more information regarding how we will calculate BF. We created Appendix C, in which we state the value for x for every hypothesis (and from which data they are derived).

(3) In the table, the analysis plan column should not only say what model you will run but also what fixed effect coefficient you will be extracting for this this specific analysis to test this specific hypothesis. So, for the Hypothesis 1, that is mapping type – the other fixed effects aren’t tested (though having them in the model is appropriate to account for the variance appropriately – and its right that you explain that that is the structure of the model). Again, it needs to be clear that you will be extracting the SE and estimate for that coefficient, and then using these two values as the model of the data that you feed it into the Bayes Factor calculator. As noted above, the other number that is critical to include in this column is the value of x you will be using as the value of the sd of the half normal. (You can also say where this value came from).

We added information regarding the fixed effect coefficients that we will extract in the Analyses plan column, following your suggestions. For clarity, the values of x are included in Appendix C.

(4) Within the text, there is still at least one place where you use the term “significant” (bottom of page 23). Also there are places where you talk about “not finding” an effect (e.g. a difference between monolinguals and bilinguals) but if you are using Bayes Factors you need to be more precise and differentiate between the case where you end up with insensitive data (where you don’t find evidence for H1, but also not for H0- i.e. BF between 1/3 and 3) and data where there is evidence for the null (BF<1/3)

We removed all the references to “significance” and now are careful to distinguish between not having enough evidence for either hypothesis or evidence for the alternative hypothesis.

(5) Power analysis: You say in the letter you are going to do optional stopping but are doing a power analysis to see what the minimum N you need is. I don’t fully follow this. If you are doing optional stopping, I would think what is key is to establish the maximum at which point you will stop even if you haven’t found BF>6 or BF<1/6? Also, if I have understood, the frequentist analysis you plan to do will tell you the sample size you need to find a z>2 in (90%?) of runs for each of the fixed effects. This is likely (but not guaranteed) to correspond reasonably well to the sample necessary to find BF>3 for the same fixed effects if H1 is true. However, for Bayes Factors we are also interested in establishing the sample necessary to find evidence for H0 if H0 is true (i.e. if the beta for the fixed effect is truely 0). There may be packages for this, but I only know how to do it with simulation. I am happy to share simulation scripts with if you if that is helpful – feel free to email. I also think what is in the supplementary materials could be a bit more fulsome – maybe include some of the output plots as well as just the script (unless I missed them on OSF- in which case maybe give clearer pointers).

We modified the participants’ paragraph: Following your suggestions, we removed the calculation of a minimum number of participants and wrote that we will stop before 200 if we reach a BF below 1/6 or above 6 in every hypothesis. In addition, we uploaded all our scripts and plots from our pilot analyses on OSF.

We want to thank you for the time and effort you put into reviewing our manuscript. The revisions we made will hopefully allow us to proceed to the data collection phase of the Registered Report.

Sincerely,

Matilde Simonetti, Megan Lorenz, Iring Koch, and Tanja Roembke

#### Decision by **Elizabeth Wonnacott**, *posted 14 Jun 2024**, validated 14 Jun 2024*

Thank you for resubmitting your stage 1 report.

I am happy that you have addressed reviewer comments and I didn't see the need to send it back out for further peer review. However, there are some issues with the analysis plan and interpretation of statistics. I outline these below and they will need to be addressed in a revision. I plan to assess it myself, but note that if necessary I might seek additional statistical support.

Comments on analyses plan:

(1) Your “interpretation” column” in the big table contains some errors. For example, for Hypothesis 1:

a. you are right that a BF greater than 3 would mean that you have evidence that participants are more accurate in 1:1 than 1:2 - assuming (i) that you are testing a one tailed hypothesis- which is the case if you use the half normal, as stated in footnote 5, and (ii) that 1:1 is coded as 0.5 in the model and 2:1 as -0.5 (so that a positive effect of mapping type is in the direction of higher accuracy in the 1:1 condition)

b. but a BF lower than 1/3 does NOT indicate that participants are more accurate with 1:2 than1:1. Assuming you conducted the test as I laid out in (a), a BF < 1/3 indicates you have evidence against the H1 that it is easier to learn 1;1 than 1:2 mappings. But note that this is consistent with either both conditions being the same OR with a reverse effect.

c. Also, a BF between 1/3 and 3 for this test does NOT indicate that they are equally accurate with the two types. Instead, it indicates that the data collected do not provide much evidence to distinguish your H1 from the null (the data are insensitive). So, you can’t say you have evidence that 1:1 mapping is learned better than 1:2 mapping, but you also cant say that you have evidence that this isn’t the case.

(I suggest looking at Zoltan’s excellent webpage https://users.sussex.ac.uk/~dienes/inference/Bayes.htm and following links to paper and short explainer videos in that page. Note that if there are places where you want do test a hypothesis in two directions you can do that either by testing a two tailed hypothesis (in which case evidence for the null will mean evidence that the groups don’t differ in either direction- i.e. they are the same) or testing two one tailed hypotheses.)

(2) Where you explain analyses in the main paper on p 23, what is in footnote (5) should be incorporated into the text and you need to be clearer that though you run a full model, your Bayes factor to test each hypothesis is computed by extracting a beta and SE for one specific coefficient. Then you will then use the calculator to test if these two statistics (which constitute the data summary) are more likely under the H1 than H0, where H0 is a point null and H1 is represented as a half normal (i.e. a one-tailed test, because you are testing prediction in a particular direction) with a mean of 0 and sd set to a rough estimate of effect size *x*. You need to explicitly lay out what *x* is for each hypothesis you are going to test (i.e. give the predicted beta value for the fixed effect) and explain where that value came from (though NB this information will also need to be in the table at the end – see next point- so here in the text you could refer the reader to that table.)

(3) In the table, the analysis plan column should not only say what model you will run but also what fixed effect coefficient you will be extracting for this this specific analysis to test this specific hypothesis. So, for the Hypothesis 1, that is *mapping type *– the other fixed effects aren’t tested (though having them in the model is appropriate to account for the variance appropriately – and its right that you explain that that is the structure of the model). Again, it needs to be clear that you will be extracting the SE and estimate for that coefficient, and then using these two values as the model of the data that you feed it into the bayes factor calculator. As noted above, the other number that is critical to include in this column is the value of *x* you will be using as the value of the sd of the half normal. (You can also say where this value came from).

(4) Within the text, there is still at least one place where you use the term “significant” (bottom of page 23). Also there are places where you talk about “not finding” an effect (e.g. a difference between monolinguals and bilinguals) but if you are using Bayes factors you need to be more precise and differentiate between the case where you end up with insensitive data (where you don’t find evidence for H1, but also not for H0- i.e. BF between 1/3 and 3) and data where there is evidence for the null (BF<1/3)

(5) Power analysis: You say in the letter you are going to do optional stopping but are doing a power analysis to see what the minimum N you need is. I don’t fully follow this. If you are doing optional stopping, I would think what is key is to establish the maximum at which point you will stop even if you haven’t found BF>6 or BF<1/6? Also, if I have understood, the frequentist analysis you plan to do will tell you the sample size you need to find a z>2 in (90%?) of runs for each of the fixed effects. This is likely (but not guaranteed) to correspond reasonably well to the sample necessary to find BF>3 for the same fixed effects if H1 is true. However, for Bayes Factors we are also interested in establishing the sample necessary to find evidence for H0 if H0 is true (i.e. if the beta for the fixed effect is truely 0). There may be packages for this, but I only know how to do it with simulation. I am happy to share simulation scripts with if you if that is helpful – feel free to email. I also think what is in the supplementary materials could be a bit more fulsome – maybe include some of the output plots as well as just the script (unless I missed them on OSF- in which case maybe give clearer pointers).

*Evaluation round ***#2**

**#2**

DOI or URL of the report: **https://osf.io/8j7fe**

Version of the report: 2

#### Author's Reply, 24 May 2024

#### Decision by **Elizabeth Wonnacott**, *posted 11 Jan 2024**, validated 11 Jan 2024*

Thank you for resubmitting this Stage 1 RR and for the considerable work you have put into it. It has been reviewed by the same two expert reviewers as before. One of the reviewers is happy with the document, barring some minor points. The other reviewer continues to have some concerns. One concerns the disconnect between the fact that questions about bilingual balance are now exploratory, and the framing in the introduction. I agree with this. For example, The Current Study section (p15) talks about looking at effects of extent of balancedness in the list of “critical components” of the experiment. Its ok to mention it, but it should be made clear that this is an exploratory element, mentioned after the factors related to the key hypotheses. I also got quite confused where you talk about the “Hypotheses 1-4” (p 7) as these don’t seem to match hypotheses later in the paper.

My other concern regards the Bayes Factors. First, can you clarify that you are moving to use Bayes Factors throughout as your statistic for inference, rather than significance? There are places where the word significance is used (e.g. p 28) and these will need to be changed. If you are planning to mix analyses that needs to be very explicit and well justified (it generally isn't recommended -see Dienes, Z. (2023, August 14). Use one system for all results to avoid contradiction: Advice for using significance tests, equivalence tests, and Bayes factors. https://doi.org/10.31234/osf.io/2dtwv). Second, while I applaud the move to these analyses, they are currently under specified. The author information for PCIRR says

*“For studies involving analyses with Bayes factors, the predictions of the theory must be specified so that a Bayes factor can be calculated. Authors should indicate what distribution will be used to represent the predictions of the theory and how its parameters will be specified. For example, will you use a uniform distribution up to some specified maximum, or a normal/half-normal distribution to represent a likely effect size, or a JZS/Cauchy distribution with a specified scaling constant?”*

You say you will follow a similar approach to Silvey et al. In that paper, we represented H1 as a half normal distribution with a mean of zero and SD set to *x*, where *x* is a rough estimate of predicted effect size. That may be reasonable here too (for one sided predictions assuming small effects are more likely) but you need to explicit say what distribution you are using and why. Most critically, for each hypothesis you need to specify the parameter *x* – i.e. what size of effect you predict. For where to get these values- I think for this study you have values from pilot data / previous study. If not, Silvey et al suggest some possible ways to estimate main effects and interactions which may (or may not) be appropriate and there are more suggestions in these papers by Dienes:

Dienes, Z. (2021). How to use and report Bayesian hypothesis tests. Psychology of Consciousness: Theory, Research, and Practice, 8(1), 9. https://osf.io/preprints/psyarxiv/bua5n

Dienes, Z. (2021). Obtaining evidence for no effect. Collabra: Psychology, 7(1), 28202. https://osf.io/preprints/psyarxiv/yc7s5

You also need to explain more clearly how you will use the information from the mixed models to get the data summary which is fed to the calculator etc. I know that information is in Silvey et al, but you can’t assume the reader will look at that and it isn't a "standard approach".

A related point is that you talk about doing “power analysis”, but that implies that what you are going to be interpreting is the p value. If instead you are going to compute Bayes factors you need to do analyses that BOTH (1) show that if H1 is true, you have a reasonable chance of finding your threshold BF (e.g. BF>3) AND (2) show that if H0 is true you have reasonable chance of finding BF<0 (which btw is generally harder). I think that you used simulation to do your power analyses, in which case you should be able to adapt these to do the Bayesian version. (For looking at 1, you can run the same simulation but see how often you get BF>3 (or whatever threshold you are choosing) rather than how often you get a p <.05; For looking at (2) you will have to run versions of the model where the relevant coefficient in the mixed model is set to 0, and then you can see how often you find BF below your threshold fo 1/3). I would be happy to share some scripts with you where I did something similar if that is helpful (feel free to email me). I would also request that you share your own scripts for power analyses on OSF as it will make it easier for me – and future readers- to evaluate the approach you are taking.

In a revision, please make sure to address these and the reviewers points (or justify why you choose not to do so).

#### Reviewed by anonymous reviewer 1, 14 Dec 2023

See attached, I believe I uploaded correctly but please let me know if it did not work (I might have done it twice).

Download the review#### Reviewed by anonymous reviewer 2, 09 Jan 2024

The revised ms. addresses the issues raised well and thoroughly, and the implemented changes further strengthen the design of the study and the overall ms.

I only have a very minor suggestion that the BFs should be included in the final section when considering conclusions/implications.

Good luck with data collection!

typos:

p. 10 ‘In another recent study (Aguasvivas et al.)...’ – verb missing, and Basque misspelled

p. 13 ..Singaporean balanced bilinguals adults - balanced bilingual adults

p. 16 first line: with ‘A word and three objects’

p. 18 ‘The bilingual group will be proficient’ (remove ‘speak’)

*Evaluation round ***#1**

**#1**

DOI or URL of the report: **https://doi.org/10.31219/osf.io/drszy**

Version of the report: 1

#### Author's Reply, 30 Nov 2023

#### Decision by **Elizabeth Wonnacott**, *posted 11 Sep 2023**, validated 11 Sep 2023*

First, I would like to apologize for the unusual length of time in getting this decision to you. I had difficulty finding reviewers, and then once reviewers were found there were some further unavoidable delays.

The reviewers are generally positive about this paper, both in terms of the topic and the design of the study. I agree- this research has a lot of potential.

One point made by both reviewers is that you need to more clearly motivate your hypotheses. I agree that you can do more to explicitly spell out the links to theory, particularly for the predicted differences between different type of bilinguals.

The reviewers make several other good points, and I would like to invite you to address these in a revision. I will pick up on a few, mostly concerning analyses.

**Sample plan

In the text you tell us that you that you planed on effect size “based on the effect size reported by Poepsel and Weiss (2016) and Escudero et al., (2016)”. As the reviewers point out, this is underspecified. Is this an effect size for the main effect of language group, or language group by trial type interaction? What actually is this effects size? What analysis did you do with this value?

In addition, you currently do not have separate justifications of the sample required for each of your hypotheses, but this is a requirement for being accepted as a Stage 1 RR. That is: every row in the big table at the end has to have its own sample size justification. (If there are places where you really aren’t able to do that, you can still do the analysis as an exploratory analysis, but they won’t be part of your pre-registered analysis plan and this means they can’t feature in the abstract and if you talk about them in the discussion they can’t be the main focus.)

I think the only way you are going to be able to get all of these values is through simulation. If I have understood correctly, you only currently use simulation-based power analysis when looking at the effect of trial type. The reason was that you were using your pilot data, and you couldn’t use this method for the hypotheses where the relevant main effect/interaction was not significant. I see the difficulty here, however it should be possible for you to simulate mixed models using a combination of some parameters extracted from a models run over your pilot data (e.g. for the random effects and for the effects of trial type and block) but replacing the estimate for the main effect of language group, and for the language group by trial type interaction with relevant values from Poepsel and Weiss (2016) and Escudero et al., (2016) etc. (You could either use beta estimates from directly from equivalent glmer models, or if those aren’t reported you could compute odd ratios - and from these log odds- based on the relevant differences between means. If there are places where there really is no previous value in the literature you can use for a fixed effect, as a last resort you could consider estimating using rule-of-thumb estimates for odd ratios as indices of effect sizes (see e.g. https://easystats.github.io/effectsize/articles/interpret.html#odds-ratio-or,). Similarly, for the analyses looking at trial by trial data, you can eithr use relevant estimates from the pilot or from values extracted from models from previous datasets (or some combination).

(NB- I am not sure whether you did the power simulations using a package or if you wrote your own code. If the former and you would like help with code for simulating data for mixed effect models (i.e. to make it a bit more flexible) I would be happy to share some scripts with you. (They might not do exactly what you need but might save you a bit of time in getting started). Feel free to reach out by email if that would be helpful.)

** Assessing evidence for H0

One of the reviewers asks you to consider including a measure to assess the strength of evidence for H0/H1. I agree with this and would strongly encourage you to consider using either Bayes factors or equivalence tests (Lakens, D., McLatchie, N., Isager, P. M., Scheel, A. M., & Dienes, Z. (2020). Improving inferences about null effects with Bayes factors and equivalence tests. The Journals of Gerontology: Series B, 75(1), 45-57.)

This isn’t a requirement for stage 1 acceptance, however if you don’t have any method to assess evidence for the null, then you cannot make statements such as the following:

“Thus, if we do not observe a performance difference between language groups (either as an

interaction with mapping type or as a main effect), we will conclude that bilinguals are not

better at learning words statistically than monolinguals.-- “

A p value greater than .05 doesn’t provide any evidence against an effect, and so you can’t draw this conclusion (Dienes has written many papers on this – see his website http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/Bayes.htm for references)

If you did decide to use Bayes Factors and want to combine this with logistic mixed effect models, you might also find this preprint https://psyarxiv.com/m4hju (on which I am an author) useful for some practical demonstrations of how you can combine linear/logistic mixed effect models run using the lmer package with Bayes Factors. An alternative (and more commonly used) approach is to use the brms package and use model comparison (this is also briefly discussed at the end of our preprint). Both approaches require you to have an estimation of your predicted effect sizes to inform model of H1. However, you could use the same estimates you use for the fixed effects in power simulations. If you instead compute equivalence tests you will need to come up with ways of estimating the smallest effect sizes of interest for each hypothesis.

** Other smaller points

-random structures: is the mapping of labels and words randomized across participants? If not, I think you might need a nested random effects structure for these? (If it is randomized and I missed this, maybe remind the remind the reader of this when you talk about the random effects)

-target count (H4) as a fixed effect: you say this will be log transformed, will it also be centered around 0 as you do for your other fixed effects?

“we will exclude any participants with a linear slope of the accuracy of −1 SD deviation from average (slope will be calculated across bins of 36 trials; c.f., Roembke et al., 2018).” this needs justification beyond a reference to another paper

-I agree with the reviewer that it is odd to use terminology of “balanced” and “unbalanced” bilinguals where your criteria seem to be more about onset of acquisition than about proficiency

#### Reviewed by anonymous reviewer 1, 21 Jul 2023

This registered report promises to provide more data on a currently understudied question that to date has conflicting patterns of findings: whether bilingualism influences cross-situational statistical learning and whether it does so specifically for learning multiple word-to-world mappings. The study is well designed, though a few deviations from more traditional CSSL tasks could be further explained, and the analyses mostly make sense. I have some comments below intended to strengthen the theoretical motivation (particularly for Hypothesis 3) and clarifying the analyses.

My main comment concerns the question of whether balanced bilinguals would differ from unbalanced bilinguals, which seems to be a main question of interest for this RR, but is not at all motivated in the manuscript. What is the reason they would differ? Further, the definition of a balanced bilingual is not what I would expect – I expected a balanced bilingual to be one who is reasonably equally proficient in both languages. The authors seem to define balance based on when exposure to both languages occurred, but use a very early cutoff for that. For example, the authors state “balanced bilinguals learned both languages simultaneously, whilst unbalanced bilinguals learned English as their mother tongue and German later in life”. Is this a criteria that will be imposed on participant groups (e.g. will participants who do not meet these criteria be excluded before/after data collection), or are the authors expecting this to be true? I can imagine many circumstances where simultaneously learning two languages could result in unbalanced bilingualism and vice versa. Further, the definition for unbalanced says “they will only be included if their LexTALE score exceeds 70% for English and 50% for German” but this means that they could both be equal and exceed 70% so then they would be balanced in proficiency but be sequential bilinguals. I would encourage the authors to motivate this question further, and then to clarify the definition of each group – if it’s about balance in proficiency then is it fair to impose an Age of Acquisition cutoff? Maybe renaming the groups would also be helpful.

Comments related to methods and analyses:

The power analysis doesn’t really seem like a power analysis? What does it mean that “based on the effect size reported by Poepsel and Weiss (2016) and Escudero et al., (2016), we concluded that a sample size of 50 should be sufficient to detect the effect of the language group if it exists”. What are the effect sizes reported in those studies, and what power does a sample size of 50 allow you to have to detect that effect size?

Why is only one word presented on each trial? I appreciate using orthographic instances so as to not cue a specific language, but in the Escudero et al., 2023 study I believe they presented the same number of words as pictures on the screen. The proposed methods are a deviation from more traditional statistical learning paradigms, and change the statistics that can be computer on any given trial. This seems like it could have meaningful differences in outcome. I can see how showing three pictures and providing words on different consecutive trials would sequentially narrow the choices participants could make (e.g. if they saw objects A, B, and C, and heard nonword1 and clicked on A, then the next time they hear nonword2 they would most likely to say it mapped to object B or C, etc), but could all three words be displayed at the bottom and dragged to the image they think it labels? If not, this is another limitation of the comparison of this study to previous ones if there end up being differences in findings.

Main analyses – it doesn’t seem like the main analysis accounts for repeated measures on each item, right? Yes, there is a block main effect, but if I understood the methods correctly each word will be the target in multiple trials within a single block. So yes, accuracy should increase across blocks, but correct accuracy on the first trial is based entirely on chance, and on later trials can be due to statistical learning. I understand that the trial by trial analyses will account for some of this, but it seems like maybe the main analysis needs to account for trial as well? Or calculate accuracy on the last instance in the block for each word? Or an average across the block?

Minor comments:

· Pg 4 – “Fourth, the bilingual cognitive control hypothesis proposes that bilinguals’ advantage on monolinguals is due to bilinguals’ better cognitive control on executive function tasks. Another possible explanation is that bilinguals have a higher executive control than monolinguals…” aren’t the 4th and the “another possible explanation” essentially the same?

· Pg 7 – description of the second/testing phase in a lot of these tasks – “participants’ memory can be tested through different methods” is memory the key outcome here, or learning? In order to be tested on anything, even at an immediate time point, there is some memory involved, but that’s different from testing memory separate from learning.

· Pg 11 – “whilst the interaction between languages and mappings was not significant” – maybe edit languages to “language background”

· Pg 12 – “opposite effects” are referred to a lot, but this is hard to keep track of because what the opposite is referring to keeps changing. Maybe mixed results would be clearer?

o Also, the Crespo 2023 study finding is a little more complicated than is alluded to here – maybe they are less ‘hurt’ by overlapping cues? It’s not really that they found a bilingual advantage in 1:1 mappings overall, just under this one specific condition.

· What are typical ranges of scores for LexTALE? Is 70% not too low of a cutoff for native monolingual English speakers?

#### Reviewed by anonymous reviewer 2, 01 Sep 2023

The ms. proposes an experimental study to examine possible differences in word learning between bilingual and monolingual participants using a cross-situational statistical learning (CSSL) paradigm. The study is well situated in the existing literature on bilingual word learning, the main research questions are articulated clearly, and they are followed by specific and detailed hypotheses. The experimental protocol provides sufficient detail to enable replication, and the proposed statistical analyses are mostly well justified. The authors are encouraged to consider including a measure to assess the strength of evidence for H0/H1.

One area where the ms. can be strengthened is in being more explicit about the links between the hypotheses and theory. For instance, why will the differences between 1:1 and 1:2 mappings increase over the course of training/exposure, or why would the differences between 1:1 and 1:2 mappings be more pronounced in balanced than unbalance bilinguals? It would be helpful to be more explicit about the possible processes that may be underpinning these differences. Additionally, it would be helpful for the authors to elaborate on the links between the specific measures used and the specific cognitive processes they are thought to index (e.g., last trial accuracy).

Additional suggestions to consider are provided below.

p. 3 bottom:

It'd be helpful to provide a brief operational definition of balanced vs. unbalanced bilinguals.

p. 6 towards the bottom

For hypothesis 4, please clarify the distinction between 'better cognitive control' and 'higher executive control'.

p. 7

It'd be helpful to state somewhere here explicitly which of the hypotheses listed on p. 6 the current study focuses on (i.e. the current study focuses on mutual exclusivity, and it won't be able to distinguish between the other hypotheses).

p. 9 middle

It'd be helpful to provide a little bit more information here on the cognitive mechanisms that these measures are thought to index.

p. 9 bottom

It'd be helpful here to provide broader context for the paradigm here -- why would CSSL be a useful paradigm to shed light on bilingual word learning? It would be also useful to consider the extent to which the specific paradigm that includes an explicit selection of the target object in every trial might influence the cognitive processes that underpin the learning (e.g., enhance the contribution of explicit processes, including specific metacognitive strategies), and if so the implications for models of word learning.

p. 15

There is an inconsistency between Table 1 and the text about L1/L2 for unbalanced bilinguals.

p. 16 bottom

Please clarify whether the choice of sample size (50 ppts per group) was based on a power analysis using the effect size reported by Poepsel & Wiess (2016).

pp. 17-18

* It'd be useful to provide the English/German-like ratings for the stimuli somewhere (Appendix, OSF).

* Please provide a rationale for the choice of objects, and clarify whether they were all artifacts, and how they were sourced.

* Larger images/figures would be really helpful.

p. 21 bottom

It'd be useful to be more explicit about what type of strategies will be taken as evidence of 'cheating'.

p. 24

It'd be helpful to state explicitly the possible reasons for the overall advantage of the 1:1 mapping -- this could be due to the inherent difficulty of learning 1:2 vs. 1:1 mappings, but it could also be due to the frequency of exposure (if I understood the design correctly, the 1:1 mappings will be presented twice as often as the 1:2 mappings).

p. 29

For H2, the authors may want to consider the possibility of no significant two-way interaction between mapping type and language group, but a significant three-way interaction between mapping type, language group, and block that might indicate that the bilingual advantage only emerges with increased levels of training.

p. 37

For H6, it would be useful to consider whether the interaction may only be expected to emerge after some learning has occurred, i.e. in later blocks.