Can discriminative learning theory explain productive generalisation in language?
Go above and beyond: Does input variability affect children’s ability to learn spatial adpositions in a novel language?
Recommendation: posted 29 June 2022, validated 29 June 2022
URL to the preregistered Stage 1 protocol: https://osf.io/37dxr
List of eligible PCI RR-friendly journals:
- Advances in Cognitive Psychology
- Cambridge Educational Research e-Journal
- Experimental Psychology
- Infant and Child Development
- Journal of Cognition
- Peer Community Journal
- Royal Society Open Science
- Swiss Psychology Open
Chris Chambers (2022) Can discriminative learning theory explain productive generalisation in language?. Peer Community in Registered Reports, . https://rr.peercommunityin.org/PCIRegisteredReports/articles/rec?id=147
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
Evaluation round #3
DOI or URL of the report: https://osf.io/tde6b/
Version of the report: v3
Author's Reply, 22 Jun 2022
Amsterdam, 22nd June 2022
Dear Prof. Chambers,
As promised, please find attached our new version of the stage 1 manuscript which now includes data of 10 children (8 at the net of the exclusion criteria) in the Pilot 2 section (pages 72-77). We have also updated the OSF repository with the new data and R’s analysis scripts.
Admittedly, this took longer than we had hoped. This is partly due to me (Eva) starting a new job as a Research Software Engineer in Amsterdam, but also due to the challenge of recruiting families online – even though we thank families for their participation with book vouchers and amazon gift cards, this seems to be not enough to collect data quickly. This has led us to set-up an official website of the experiment (https://languagelearningchildren.web.ox.ac.uk/home) to recruit families of 7–8-year-olds online, and to seek the involvement of head teachers among public schools in the surroundings of Oxford, so that we can collect data from multiple children at once. Note that even where children are tested in groups, they will continue to do the training and testing individually, on a computer wearing headphones, so our core experimental procedure is unchanged.
We just had news that we can collect data from 15 children from one of the schools we have contacted, starting on June 28th. For this reason, we are sending you the data from the 10 children recruited so far straight away, even though they are only 8 at the net of the exclusion criteria, and we hope that you will be able to give us the green light to start data collection for the main study.
While the data are of only 8 children, you will see that they are strikingly clear for the purpose of Pilot 2: the paradigm that the reviewers suggested is promoting learning i.e., children are learning, as attested by the higher than chance performance at the group level in training, and numerically they improve from sessions 1 to 2, and this performance is maintained in testing, whereby children are exposed to sentences that they see for the first time (figure 17, page 75).
We were also able to report (i) that performance in the noun practice tasks was at ceiling (like in pilot 1), clearing out doubts regarding children’s difficulties in recognising English cognates pronounced by a Japanese speaker (figure 16, page 73), (ii) that performance in testing is directly related to performance in training, as attested by the positive trend between the by-subject performances in training and in testing reported in fig. 18 (page 77) (as is expected in this high variability condition given pilot 1 results), and that (iii) moves of the legal types (A, B, C and D), where children move the correct pictures to appropriate places on the grid are the majority of the moves, as attested by the small number of moves (i.e., less than 10% in training, and less than 3% in testing) involving any other move. In sum the data look very similar to the equivalent condition in pilot 1.
As Pilot 2 was a proof of concept, we hope that you will agree with us that 8 children trained in the most difficult learning condition are sufficient to establish that the paradigm is sound, and that the results we showed you support our conclusion that we can continue with data collection in schools.
As a side note, we wanted to ask your advice as to whether we can include the children collected in pilot 2 within the final sample. We acknowledge that including pilot data in a main study isn’t typical. However, we are making no changes to the program and, critically, the analyses we have done on this data don’t pertain to our hypotheses about difference between conditions (as only one condition is included). This therefore doesn’t feel like “double dipping”. However, we are happy to take your advice on this.
Eva (on behalf of all the authors)
Decision by Chris Chambers, posted 06 Apr 2022
Thank you for submitting your revised manuscript and response. I've now had a chance to review the changes and I'm happy to approve the next stage of proceeding with Pilot 2. Once that is complete, please submit a final Stage 1 manuscript. If all goes smoothly with the pilot then IPA should then be forthcoming without requiring further in-depth Stage 1 review.
Best of luck and we look forward to hearing from you soon!
Evaluation round #2
DOI or URL of the report: https://osf.io/359kt/
Version of the report: v2
Author's Reply, 06 Apr 2022
Decision by Chris Chambers, posted 30 Mar 2022
All three reviewers kindly returned to evaluate the revised manuscript. The assessments continue to be very positive and the reviewers are broadly very satisfied with your revised design. There are however some remaining queries to address concerning specifics of the design and analysis.
Please address these remaining points in a further revision and response. Provided the revision is comprehensive, I will assess this at desk (without further in-depth review), and if, as seems likely, your response is sufficient then instead of issuing in-principle acceptance (IPA) I will follow the approach we discussed and issue a further Stage 1 interim decision to proceed with Pilot 2. Once Pilot 2 is complete, you would then resubmit the Stage 1 manuscript with the pilot results included and if all goes smoothly we should be in a position to then issue IPA without further in-depth Stage 1 review.
If you have any questions as you progress don't hesitate to contact me (at firstname.lastname@example.org or email@example.com)
Reviewed by Natalia Kartushina, 18 Mar 2022
I would like to thank the authors for providing such a thorough and clear rebuttal letter. They have satisfactory addressed all my comments and questions and those of my colleagues, I believe. I was pleased to see a schematic illustration of the stimuli configuration and to have an opportunity to try out a new version of the game. I would like to mention that it is an honor and a privilege to be able to contribute to this absolutely high-standard piece of work! I have only one remaining concern that I wanted to bring to the authors’ attention in an attempt to reflect on it, but not necessarily to require any changes.
I appreciated that the authors explained why they added the distractor objects in the design; yet, it seems to me that children in the LV group would quickly realize the lack of variability in stimuli (and what are the target objects) and would ignore the distractors, which might not (fully) serve the purpose of having them, i.e., mitigating against associating images with specific moves without listening to the sentences for the LV group. For the HV group on the other hand, having the distractor objects can make the task more difficult and decrease their chance level, as they might consider the distractor objects to a larger extend as compared to the LV group. This brings us back to the original concern with respect to the chance level and the trial removal strategy (see below). This does not have major implications for the analyses, but maybe this needs to be thought thoroughly before the data collection, as it can affect the chance level, which is lower than 25% then. Maybe the authors could comment on this?
I am afraid I have doubts about the exclusion of the distractors-related moves from the analyses. The authors indicate that these moves mean that children haven’t identified the two nouns involved in the sentence; however, given that the pilot data demonstrates that children learnt the words at the noun practice task and the authors will exclude participants who would have scored below 80% in the noun practice task, then we would expect only few of them to occur (although see the above concern), so why not keeping these moves in the analyses (as they also contribute to decreasing the chance level)? Relatedly, the authors indicate that they will measure verbal short-term memory and will include this in the analyses “as a covariate which is expected to explain variance in learning over the training sessions”; wouldn’t then it benefit the analyses to not exclude from the model trials where children made errors (i.e., with the distractors), as they are likely related to differences in verbal short-memory (among others)?
The inclusion of the analysis “Recognition of nouns in continuous speech—Do children correctly identify the1311nouns in the sentence?” is a very nice way to address (potential) differences in the proportion of distractor-related errors between conditions. I wonder whether the verbal short-term memory shall be included in the model as a co-variate. In this analysis, the authors set the chance level to 1/18; yet, I wonder whether this shall be changed (increased) given that four (A,B,C,D) out of 18 moves are counted as accurate.
Line 17: remove “an” (abstract relationships)
Line 996: set movable => set of movable
Figure 6 caption: moving the banana to square 6 => moving the banana to square 4
Why the memory span task was run twice?
Line 1275: the moves B, C, D are repeated twice
Line 1844: “the majority of children (i.e.,>5)”, do you mean more than 5 children would be majority?
Best of luck with Pilot 2! and again congratulations on such an exemplary registered report!
Reviewed by Caroline Rowland, 24 Mar 2022
I want to thank the authors for paying such close attention to my comments; this makes the process of reviewing feel worthwhile! The authors have responded well I have no hesitation in recommending acceptance. I agree that it will be interesting to include the working memory task. I also thank them for providing a more detailed explanation of how moves will be removed from analysis and for the interesting new analyses on word order and on differences earlier in learning. I am looking forward to seeing the results.
The comments below are merely for consideration when writing the stage 2 report, after data are collected and analysed.
1. Regarding noise and the choice of priors, the authors have provided sound justification for their decision. If they do find null results, however, I would urge them to discuss the choice of prior as a possible reason for this in the discussion and, if possible, dismiss it (by for example, providing the same type of comparison of SDs that they provide in their reply)
2. Footnote page 10: “From a theoretical perspective, although transfer across related constructions does happen in some cases (Abbot-Smith & Behrens 2001) we do not believe there is any good reason to expect strong transfer to new adpositions in this paradigm”. I still didn't quite follow why - from a theoretical perspective - the authors do not expect strong transfer, as a clear explanation wasn't given. But this could be something to address further in the discussion - under what circumstances the authors would, and would not, expect transfer.
Reviewed by Julien Mayor, 10 Mar 2022
The authors have been extremely responsive in addressing comments raised in the first round of reviews.
I only have two minor comments:
- The new design is very nice, in addressing the issue of word order. I just wonder if this will be clear, for participants, that they should take an object from bottom of the screen and move it into the grid (apologies if I have missed it)? As the object to be moved is also present in the grid itself, my first attempt was to move that object instead. I guess that, if participants are providing an erroneous move, they will be illustrated with the correct move, and hence they will ultimately understand what they are required to do - but that would mean that during the first move (or couple of moves) participants may still be learning the rules of the game, rather than about cracking the sentences' meanings (and if that's so, a few practice trials would be great).
- On line 939 "with respect to which the first noun should be moved" should probably read "with respect to which the first object should be moved"
Evaluation round #1
DOI or URL of the report: https://osf.io/hu68w/
Author's Reply, 03 Mar 2022
Decision by Chris Chambers, posted 27 Jan 2022
Three expert reviewers have now evaluated the Stage 1 manuscript and I'm happy to say that the reviewers are all enthusiastic, noting in particular the value of the research question, high level of methodological rigour, and overall clarity of the presentation. As a non-specialist, I fully agree -- from a Registered Reports perspective, the manuscript is strong and already comes close to meeting the Stage 1 criteria. The reviews do raise a number of significiant issues that warrant careful consideration, including justification of various design decisions such as statistical priors, clarification (and potential revision) of the data exclusion criteria, and consideration of potential confounding variables such as word order.
Overall, the manuscript is a promising contender for IPA and I'm pleased to invite a revision. As part of the revision process, please be sure to include a comprehensive, point-by-point response to the reviews and a tracked-changes version of the manuscript.
Reviewed by Julien Mayor, 21 Jan 2022
Reviewed by anonymous reviewer, 27 Jan 2022
The manuscript addresses the role of variability in learning spatial adpositions in 7-year-old children. Spatial adpositions describe the location of one object in relation to another (e.g., “above” and “below”). Using a gaming paradigm, the authors plan to train children to learn two novel (Japanese) adpositions, above and below, and to manipulate the amount of sentence variability: high variability with 56 unique sentences, low variability with 4 unique sentences repeated 14 times, and skew variability with 4 unique sentences repeated 8 times and 24 unique sentences. The authors plan to assess children’s training performance and generalization of learning to novel nouns.
It has been a great pleasure to read the manuscript: it is very clear, exceptionally-well structured and thorough. The authors performed computational simulations, implementing an error-driven, discriminative learning process, and two pilots to ensure the validity of the design. The research questions make sense in light of the presented theories and they are clearly defined. The protocol is sufficiently detailed to enable replication (see some minor suggestions though). The hypotheses are clear and the authors have explained precisely which outcomes will confirm their predictions. The sample size is justified. The authors need to provide data quality checks, though. The authors clearly distinguished work that has already been done (e.g. Pilot 1 and 2) from work yet to be done. Although I am overall very impressed by the quality of the manuscript, I would like the authors to address the below-mentioned points, before I can recommend in principle acceptance.
I thank the authors for providing a link to the experiment. It has been a lot of fun to play it (the whole family enjoyed). However, after having passed the experiment myself, it seemed to me that, in the high variability condition, learners are in a more disadvantaged position, as compared to the other two conditions: In addition to learning the adpositions, the high-variability group has a concurrent task to perform - to recognize and learn Japanese words in running speech (to adapt them and connect with their English equivalents) across different sentential positions and noun combinations, which might hinder learning of the adpositions per se to a larger extent as compared to the low-variability group (cf below). Accent adaption can be challenging and cognitively demanding for children, using the resources necessary for learning the target adpositions. On the other hand, this can provide an advantage when exposed to novel speakers and novel items (as tested in the current study), as the high-variability group would have had learnt to adapt Japanese accents more efficiently, which can facilitate processing of adpositions and enable to perform better in the generalization task when they are exposed to novel Japanese words/sentences. I wonder whether the authors considered (benefits of variability in) accent adaptation as a confound in their design.
I see that the authors provided noun practice with isolated Japanese nouns; how many trials did it contain? Did all participants display accurate word recognition?
Related to the first point, and more concerning is the use of unique sentences in the low-variability group. Given that there were only 2 unique sentences per spatial word, I wonder whether children performed the task as expected (i.e., extracted the adpositions from the speech stream and learnt them) or whether they simply memorized the whole sentence and performed the task as a function of the first word they have heard in a sentence. That is, if they hear a word shampoo at the onset of a sentence and it has been associated with the moving object being placed above, they can simply learn an association shampoo onset – object above, without relying on the target adpositions per se. In addition, when applying this strategy, the low-variability group might not need to recognize the Japanese cognates in running speech per se (which distinguishes them from the high-variability group); this can penalize them in the generalization task, as they would have not learnt to recognize/adapt novel Japanese words in running speech.
The game provides an option to replay the sentence (multiple times) by clicking on a speech bubble. Related to the first point, is it possible that participants in the high-variability group used the replay option more than children in the other groups, which would lead to an increase in their overall exposure to the stimuli? Did the authors examine the number of replays in each group and whether it was related to their performance?
I find it a bit problematic, given the focus of the manuscript, to exclude participants who display consistent floor (average performance at chance level, i.e., 50%) or ceiling effects (average performance above 90%) during training in both sessions. If one of the conditions is intrinsically more difficult than the other, then applying this criterion might artificially increase the mean for a difficult task and decrease the mean for an easy task. I understand the authors’ concern that parents might interfere, but this can also happen when children perform “as expected”. When we look at Pilot 1 data, run by an experimenter, we can see that many children remained below the chance level even after the second session, in particular in the high-variability group. Did the pilot data revealed statistically significant differences in the number of participants displaying floor and ceiling effects between the conditions? How much data was concerned?
Likewise, I find it problematic that trials in training whereby children pick up either the wrong moveable object, or they position it in one of the distractors cell will be removed. Doesn’t this reveal that children were not able to accurately segment the sentence or/and recognize the Japanese words or/and learn the postposition? These errors inform us about how difficult a given condition is and its learnability (8% of trials, as revealed by Pilots, seems considerable to me); in addition, it is not clear to me how removing these erroneous trials makes the task comparable to a two-alternative forced-choice task. Maybe the authors could comment on this?
The authors do not include nouns as a random factor in the model. Is it possible that some items, e.g., “ice-cream” perform better than others?
For the third hypothesis, is it possible that the word for above in Japanese is acoustically more salient (with the three vowels following each other) as compared to the word for below?
van Heugten, M., & Johnson, E. K. (2014). Learning to contend with accents in infancy: Benefits of brief speaker exposure. Journal of Experimental Psychology: General, 143(1), 340–350. https://doi.org/10.1037/a0032192
L 17: an abstract => remove “an”
L 267: a=> at
L 285: countering => encountering
L 497: that predicts=> predict
L 603: If my understanding is accurate, the “set of 10 unique sentences with frequency of 8 and a set of four sentences with frequency 1” result in 84 sentences for the skewed condition, whereas there are 56 sentences in other conditions but also on figure 2 for the skewed distribution. Please check for consistency.
L 757: remove one “without”
Do hypotheses 8 to 10 refer to the training data or to novel noun trials?
L 1113: in the in novel nouns test => remove second in
L 1235: to corresponds => to correspond
L 1236: referred to
Reviewed by Caroline Rowland, 22 Jan 2022
This is an very comprehensive stage 1 proposal that combines computational modelling and pilot data to motivate a study of the relationship between input structure and generalisation in an online experimental paradigm with 7-8 year old English speaking children. The authors will test the hypothesis (generated from their theory of discriminative, error-based learning) that children will learn the meaning and use of Japanese spatial adpositions more effectively when there is more variability in the use of the nouns within spatial sentences. They also propose to test a number of secondary hypotheses; most notably, an empirically generated hypothesis that skewed distributions might be as good (or even better) for learning generalisations than highly variable input. The report satisfies all the necessary criteria in my opinion; the authors have done an excellent job. Below I summarise my comments under the 5 headings/areas suggested in the Guidelines for Reviewers, before finishing with some more general comments.
1A. The scientific validity of the research question(s).
The research question is scientifically valid and is detailed, with sufficient precision as to be answerable through quantitative research. The authors motivate their theoretical perspective with a literature review, and with a computational model. The study proposed falls within established ethical norms for working with children of this age.
1B. The logic, rationale, and plausibility of the proposed hypotheses, as applicable.
The proposed hypotheses are coherent and credible and are very robustly motivated. The authors motivate their primary hypotheses with a detailed, evaluative literature review, and a computational model. Secondary hypotheses are motivated empirically (on the basis of previous studies and/or pilot data). Both types of hypothesis are stated precisely and are sufficiently conceivable to be worthy of investigation. They follow either directly from the research question, or indirectly via empirical evidence (in the latter case, the analyses will yield important additional information that might lead to modifications of the theory). The pilot studies (pilot 1 and 2) are also well designed and well explained (though I have one point regarding the chance level of 25%, which I address under 1C below).
However, I would like the authors to address one point here regarding the statement (page 31) that they also plan to include measures of individual differences (e.g. attention, vocabulary) for exploratory analysis. I recognise that these are exploratory analyses, but the authors should motivate them in some way - what factors will be assess here, what relevant information might the tasks yield, why are individual differences of interest here etc. In addition, these task are not mentioned at all in the methods section (see point 1.c below).
1C. The soundness and feasibility of the methodology and analysis pipeline (including statistical power analysis or alternative sampling plans where applicable).
The study procedures and analyses are incredibly well described and are valid. Critical design features such as randomisation, rules for exclusion etc, are present and fully explained.
Please note that I have not conducted Bayes analyses myself, so my knowledge of what needs to be considered is purely theoretical. Bearing in mind that caveat, the proposed sampling plan is rigorously described, and the thresholds for evidence at different levels (strong, moderate etc) for H1 and H0 are clearly described.
However, I have three points for the authors to consider:
a. As mentioned above, the authors state on page 31 that they also plan to include measures of individual differences (e.g. attention, vocabulary) for exploratory analysis. There is no mention of these tasks at all in the methods section. If these tasks are to be included, please add the usual methodological details (e.g what the tasks will be and how the data will be collected).
b. Motivation for the choice of statistical priors. Their priors are defined on the basis of effect sizes taken from the pilot study, which was conducted in person in the children's schools. I think there is now strong evidence that data collected online tends to be nosier - and effect sizes smaller - than data collected in person. This is particular the case with studies with children, and even more so when in person data was collected in a structured environment such as a school, where there are minimal distractions. If their online study yields substantially smaller effect sizes than their pilot data, will their study still be adequately powered to find strong/moderate evidence for H1 and/or H0 for all their hypotheses?
c. Chance level. On page 35, the authors state that they will remove trials in which children make 'illegal'' moves (e.g. placing the object on a distractor cell), so that chance level is 50%. However, I'm not sure this is right. Even on trials in which children make legal moves, they still have the possibility of making an illegal move (placing an object on a distractor cell, or of choosing a distractor object). So even on legal moves (which are included in the analysis), chance is less than 50%. This doesn't (I don't think) have any profound implications for the analysis because none are comparing performance with chance (though authors should check this). But either way, if I'm right, the authors need to:
· calculate the actual chance levels across trials and state this in the paper.
· Or remove distractor objects and cells
· Or (and this is my preferred option) make it impossible to place objects on distractor cells (i.e. make distractor objects non-movable and make objects ping back to their original position if you try to place them on a distractor cell).
Note that this issue also applies to the analysis for pilot 1, where chance is set to 25% for the same reason (page 58). Again, I don't think this is right; i.e. removing trials from the analysis where children make illegal moves doesn't make any difference to the chance level on legal trials. But please tell me if I'm wrong about this - I may have misunderstood what the authors did here.
1D. Whether the clarity and degree of methodological detail is sufficient to closely replicate the proposed study procedures and analysis pipeline and to prevent undisclosed flexibility in the procedures and analyses.
The protocol certainly contains sufficient detail to be reproducible and ensure protection against research bias, and specifies precise and exhaustive links between the research question(s), hypotheses methods and results. The design summary table is very useful.
Please note that some reviewers might state that they find the introduction section overly long. It is, indeed, very comprehensive. However, I appreciate this. It lays out, very clearly, the authors; theoretical perspective, the learning mechanism they are proposing, and neatly evaluates all the relevant previous literature. As many of us are now arguing, there aren't nearly enough papers in the child development literature that really get to grips with potential mechanisms of development; i.e. we have too few papers that accurately explain, in detail, how learning processes might work. Thus, I find it admirable that the authors have prepared such a careful, detailed review. There is only one sub-section I might shorten, which is the one describing the study by Hsu & Bishop (2014). However, even here the detail can be justified given how closely that study relates to this one.
1E. Whether the authors have considered sufficient outcome-neutral conditions (e.g. absence of floor or ceiling effects; positive controls; other quality checks) for ensuring that the obtained results are able to test the stated hypotheses or answer the stated research question(s).
The proposal contains all the necessary data quality checks (though please note my worry about the statistical priors detailed in 1B above is also relevant here). Proposed statistic tests are appropriate and outcome neutral. The pilot data suggests that there are unlikely to be floor or ceiling effects. Positive controls are appropriate (high, low, skewed variability input).
Page 10, footnote: "From a theoretical perspective, we do not believe there is any good reason to expect transfer to new constructions..." I don't quite agree with this - there is good evidence for construction-general transfer in some circumstances (e.g. Abbot-Smith and Behrens' wonderful construction conspiracies'' paper (2006)).
Page 12: Both paragraphs on skewed distributions. I found it really hard to follow these two paragraphs; Paragraph 1 seems to be saying there is a skewed distribution in natural language, and paragraph 2 contradicts that. I think I know what the authors are saying but it's a bit confusing. Can they rephrase? It would also be useful to give a short definition of a 'geometric distribution' in the text, so readers don't have to read the footnote to understand what it is (footnotes should probably just include additional information, not information essential to understanding the main text).
Page 18: "word order is not captured by the model'. I wondered what consequences this had for learning, and for the comparison with real children, since children certainly do have access to word order cues. If word order *was* captured by the model, how would this change the pattern of results (if at all)? Could the authors speculate here?
Page 27: difference between simulation results (no benefit of skew) and empirical results (benefit of skew). I did wonder what the simulation results looked like earlier in the learning cycle. One possibility is that the simulations have just learned the generalisation much better than children by the end of the learning cycle. So if we want to replicate empirical results (especially from children with DLD) we might want to look at the data earlier in the learning cycle of the simulation. If we administer the test session earlier in the model's learning cycle, is there any evidence of a skew advantage?
Page 60; Table 7. Please add a description of the three "response types" to the label of table 7. I was initially confused until I realised these referred back to the "four possible moves" described on page 58.
Throughout, but especially in the sections describing the pilot, two different labels are used for the skewed input: 1) skew-bimodal/skew and 2) exponential/geometric. This can make the paper a bit different to parse (especially on pages 24 and 25 where the labels on figure 4 refer to skew-bimodal and exponential conditions, but the description of the figure in the text uses skew and geometric labels).
There are a number of typos throughout so a good proof read would be useful. I have not listed them here because of the time that would take. If the authors would like to see these, please can they send me an editable version of the manuscript (e.g. googledoc or overleaf) and I can use track changes to point them out. (NB noueni and noshitani are sometimes italicised and sometimes not).