Exploring the enjoyment of voices

ORCID_LOGO based on reviews by Patrick Savage, Christina Vanden Bosch der Nederlanden, Christina Krumpholz and 1 anonymous reviewer
A recommendation of:

Voice preferences across contrasting singing and speaking styles

Submission: posted 30 November 2022
Recommendation: posted 15 February 2024, validated 15 February 2024
Cite this recommendation as:
Chambers, C. (2024) Exploring the enjoyment of voices. Peer Community in Registered Reports, .


Beyond the semantics communicated by speech, human vocalisations can convey a wealth of non-verbal information, including the speaker’s identity, body size, shape, health, age, intentions, emotional state, and personality characteristics. While much has been studied about the neurocognitive basis of voice processing and perception, the richness of vocalisations leaves open fundamental questions about the aesthetics of (and across) song and speech, including which factors determine our preference (liking) for different vocal styles.
In the current study, Bruder et al. (2024) examine the characteristics that determine the enjoyment of voices in different contexts and the extent to which these preferences are shared across different types of vocalisation. Sixty participants will report their degree of liking across a validated stimulus set of naturalistic and controlled vocal performances by female singers performing different melody excerpts as a lullaby, as a pop song and as opera aria, as well as reading the corresponding lyrics aloud as if speaking to an adult audience or to an infant. The authors will then ask two main questions: first if there is a difference in the amount of shared taste (interrater agreement) across contrasting vocal styles, and second, as suggested by sexual selection accounts of voice attractiveness, whether the same performers are preferred across styles.
The Stage 1 manuscript was evaluated over three rounds of in-depth review. Based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).
URL to the preregistered Stage 1 protocol:
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA. 
List of eligible PCI RR-friendly journals:
1. Bruder, C., Frieler, K. & Larrouy-Maestri, P. (2024). Voice preferences across contrasting singing and speaking styles. In principle acceptance of Version 5 by Peer Community in Registered Reports.
Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Evaluation round #4

DOI or URL of the report:

Version of the report: 1 (inside folder Revision_February_2024 at osf)

Author's Reply, 12 Feb 2024

Download author's reply Download tracked changes file

(please see two documents attached)

Decision by ORCID_LOGO, posted 10 Feb 2024, validated 10 Feb 2024

The submission is now on the verge of IPA. Please attend to the remaining point noted by the reviewer and Stage 1 acceptance will then be issued without delay.

Reviewed by ORCID_LOGO, 08 Feb 2024

Thanks to the authors for their many revisions, whch I think have greatly strengthened the proposal. I think the current version satisficatorily address my main previous concerns. I have only one very small suggestion remaining: I find it confusing that Table 1 refers interchangingly to an SESOI of 0.1 and 0.5. I believe 0.5 refers to Cohen's D and 0.1 refers to the equivalent size of MM1, but this is not clear from the Table. Perhaps add "D" and "MM1" or otherwise clarify this?

I look forward to seeing the results of the study. 

Evaluation round #3

DOI or URL of the report:

Version of the report: 1 (inside folder REVISION_November2023 at osf)

Author's Reply, 06 Feb 2024

Download author's reply Download tracked changes file

(please see two attached files)

Decision by ORCID_LOGO, posted 15 Jan 2024, validated 15 Jan 2024

The three reviewers from the previous round kindly returned to evaluate your revised manuscript. As you will see, the submission is now approaching Stage 1 IPA, pending clarification of some specific points concerning the clarity and precision of the hypotheses. I look forward to receiving a (hopefully final) revision addressing this issues.

Reviewed by ORCID_LOGO, 14 Dec 2023

I thank the authors for making an in-depth effort to address the issues raised by myself and the other two reviewers, which has resulted in largely a new manuscript.

Not all the choices they made the one I would have done, but that is fine, I'm just glad they've had the opportunity to consider our suggestions before beginning data collection/analysis. I don't want to prolong the process un-necessarily, so will refrain from nitpicky comments, but I do have a couple of concerns that affect the core Registered Report design planner shown in Table 1 and thus deserve to be resolved before In Principal Acceptance (IPA):

1) Why are there four analyses specified in section 2.4 (2.4.1-2.4.4) when there are only two hypothesis tests listed in the table? Are 2.4.3 and 2.4.4 "Supporting analyses"? If so, I think these would be better to be described as "Exploratory Analyses" and clearly separated in a different section (I don't think you need to delete them entirely).

2) Several things feel not quite right about the proposed testing of H1 and related power analysis:

a) Why does hypothesis test 1 contain so many different comparisons?

("three pairwise comparisons (paired t-tests, one tailed). Note that we will also compare all styles to each other with a repeated measures ANOVA, potentially followed by 10 pairwise comparisons (paired t-tests, two-tailed))"

What would the interpretation be if only one or two comparisons were supported but not the other(s)? Would this be better broken into, say, three individual predictions? Or modeled in a different way?

(NB: This prediction structure looks similar to one our lab did in our 1st PCI-RR submission for the Hadavi et al. paper I mentioned previously. However, we later updated this after reviewer feedback - perhaps this later update may help give ideas?

b) Do you really need all those 13 different comparisons (3 pairwise + 1 ANOVA + 10 pairwise)? How is this reflected in the power analysis, which appears to assume 10 comparisons?​

(“To compare the amount of shared taste across the five vocalization styles, we calibrate our power analysis to have enough power for all 10 pairwise comparisons between styles (and not only the omnibus test). This comes at the cost of conservatively correcting our alpha for ten comparisons (α = .005 with Bonferroni correction for multiple comparisons), which, considering our SESOI of d = 0.5, necessitates a sample size of 71 participants to achieve power of .9 (paired, two-sided t-test, calculated with the pwr.t.test function from the pwr R package - Champely, 2020).”)

(NB: If you focus on fewer comparisons for confirmatory analysis, you might not need as large a sample. Likewise, what is the justification for power of .9? I believe many journals either request minimum power of .95 or .8, so depending on your plans you might also not need so many participants if you are willing to have a lower power (and this could give more flexibility in other analyses).)

c) Both H1 and H2 are basically about degree of agreement, right? If so, what is the rationale for using different agreement metrics (MM1 for H1 vs. Krippendorf's alpha for H2)? Wouldn't it be simpler to use a single metric for both tests? Sorry if this is a stupid question, but others will probably have it so would be better to clearly explain.

d) The new simulation figures I found at​ are great, but would be better to be included in the main manuscript file (perhaps just merge the 2 pdfs into one?). Figs. S2 and S3 help me a lot to understand what is happening to test H2, but Fig. S1 doesn't really shed much light on what is happening to test H1 for me, I'm sorry to say. Could that testing process be made more clear by visualizing simulated data?

e) Why are the two speaking conditions included at all if they are not the focus of the main tests of H1?

I hope these can help tighten up the manuscript, particularly Table 1, whether that means changing the table or better explaining elsewhere why those decisions were made.

Reviewed by , 14 Jan 2024

The authors have done a terrific job of addressing all of the items I and other reviewers have suggested. The restructure of the introduction makes the document very readable and the focus on the main 2 hypotheses, dropping the raft of (interesting!) exploratory analyses also really helps to streamline the proposal. Yet, I still feel one of my comments was not adequately addresses related to the comparison of speech to song. Specificaly, the lack of including speech in hypothesis 1. The predictions seem to only mention song, which dichotomizes the speech-to-song continuum. I understand the authors describe that they do not want to directly compare speech to song because they are considering the whole musi-language continuum, but, by not addressing speech or making any predictions (even if there is little data using your metrics with speech, that indicates to me that this would be novel and useful contribution to the literature!) the authors are treating as a separate category. I would like to see speech added to the analysis for prediction 1 and predictions included for hypotheses 1 and 2. Alternatively, the hypotheses should be split by speech and song, which is logical given that the categories chosen are not equally spaced along a musi-language continuum (that is, none are ambiguous).

Thank you for the thorough and thoughtful responses to all the reviewers comments. I look forward to seeing the data!


Christina Vanden Bosch der Nederlanden


Reviewed by , 22 Dec 2023

Dear editors, dear authors,

The planned study and the manuscript have improved a lot and all of my comments have been addressed appropriately. I believe that the planned study profits a lot from the reduced format and the adjusted theoretical framework. 

I also had a glance at your provided code which seems to include suited analyses for your planned study. 

Therefore, I have no further comments and can with the best of conscious recommend an execution of the study. 

Best regards and merry Christmas
Christina Krumpholz

Evaluation round #2

DOI or URL of the report:

Version of the report:

Author's Reply, 30 Nov 2023

Download author's reply Download tracked changes file

(Please see attached letter of response as PDF file, and tracked changes word document. These are also in the project folder at osf) 

Decision by ORCID_LOGO, posted 29 May 2023, validated 29 May 2023

Your submission has now been evaluated by three expert reviewers. As you will see, the reviews are critical but constructive, and rich in helpful recommendations.

There are three main areas to address in revision.

1. Clearly distinguishing confirmatory and exploratory analyses. Concerning the inclusion of exploratory analyses at Stage 1 -- where they are crucial for answering the research question they need to be properly powered and specified in the same way as confirmatory tests. Where they are truly exploratory/secondary, they can be included at Stage 1 in order for readers to understand the motivation for including various methodological characteristics, but these analyses must be clearly distinguished from the primary questions and analyses.

2. Clarifying and strengthening the theoretical basis of the study, and in doing so better focusing the motivation for the specific research questions in the introduction (please also add the missing column to the study design template)

3. Increasing the degree of methodological detail to justify specific design and analysis decisions and maximise computational reproducibility.

In addition to these points, the reviewers make numerous additional specific comments, so please be sure to address them all comprehensively in a revision and response.

Reviewed by ORCID_LOGO, 04 May 2023

I applaud the authors for taking on the challenge of using the Peer Community In Registered Reports (PCI-RR) framework to undertake interesting, largely exploratory, analyses. I find the proposed topic of preferences for speaking and singing voices interesting and valid in principle, though it needs some refinement in terms of theoretical and methodological framing. So in principle I support eventual Recommendation of an improved version via PCI-RR.

That said, I think the current manuscript needs substantial revisions before it meets the standards of a PCI-RR Stage 1 Protocol. In particular, one of the primary goals of RRs is to clearly separate confirmatory and exploratory analysis. Table 1 does this to an extent, but in the main text confirmatory and exploratory analyses are often mixed within the same section or paragraph such that they are hard to distinguish (e.g., Hypothesis 1.1.3 is described as ““not included in Table 1 because it is exploratory”, and I think the same might apply to hypotheses predicting liking ratings from acoustic features?). I think all exploratory analyses, variables, etc. should be moved to a separate section and clearly labeled (e.g., "Section 3: Exploratory analyses". I also have some concerns about the generalizability of the stimuli and experimental design, which I suggest the authors consider carefully before deciding whether to continue with the current design or revise.

[NB: In a recent submission from my lab (Hadavi et al., Under review), the PCI-RR recommender actually requested during pre-screening that we completely remove all exploratory analyses for this reason. Personally I think this may sometimes be too drastic and overly limit the ability of authors to use PCi-RR for exploratory research, so I have not recommended it for this case.]

I understand this may be a challenge for the current research, which the authors admit is largely exploratory. I would encourage the PCI-RR Recommender to discuss this issue with the Editorial Board, as the question of how best to use RRs for exploratory research is I believe still an ongoing one without a fixed answer. I strongly support making RRs as flexible as possible to allow for more exploratory work, so my review here is intended to provide constructive suggestions for how to achieve this. I will add that my lab has submitted three manuscripts to PCI-RR on related topics of song/speech/music cognition that are in different stages of the review process, and I encourage the authors to refer to these for ideas/templates for how to reorganize their manuscript/experimental design to make better use of the format (Chiba et al., In Press; Ozaki et al., Accepted In Principle; Hadavi et al., Under Review). 

My main suggestions are as follows:

1) Move all details relating to exploratory analyses to a separate, dedicated section (see above)

2) Add figures visualizing the main confirmatory analyses. In my experience, it is worth collecting a small amount of pilot data (this can even be just from your three coauthors and/or lab members, colleagues) to show proof of principle. This could potentially be combined with the simulations recommended by Lisa De Bruine, although perhaps just the simulations alone might be enough. Even if you only use simulations, I still recommend plotting them in the manuscript.

3) Add a "Data/code/stimuli availability statement". Regarding the simulated data, I note that the authors say they have added R scripts simulating data, but I cannot find those scripts. I recommend uploading them to a repository (e.g., GitHub) and adding a "Data/code/stimuli availability statement" before the reference section incluing this link. I also recommend uploading the full stimulus set here (the manuscript links to a partial stimulus set at, but elsewhere says that “[the stimulus dataset, along with details about the validation experiment and acoustic analyses of the stimuli, will be, at the time of publication of this paper, available open access - currently work in progress]”. I think RR format best practice would be for the stimulus set to be fixed and open access before receiving In Principle Acceptance (IPA; cf. Chiba et al. In Press for an example).

4) Explicitly state how you are correcting for multiple comparisons. In your response to Lisa De Bruine, you say that the scripts do this, but Table 1 still just shows p<.05 without mentioning any correction.

5) More clearly connect the introduction, hypotheses, stimulus selection, and participant recruitment. At minimum, I believe PCI-RR requests that submissions use this template including an additional column: “Theory that could be shown wrong by the outcomes” ( 

[NB: I think the version used by the authors may have come from a different website (maybe linked from - I recall having a similar discrepancy in the past and would encourage the PCI-RR Editorial Board to try to standardize this to avoid future confusion.] 

While I find the general title topic of speaking/singing voice preferences of great interest, the current discussion of previous literature focuses on hypotheses about the evolution of infant-directed vocalization, but does not make it not clear how the proposed analyses would be interpreted with respect to these hypotheses. For example, it is very difficult to propose predictions that can uniquely falsify theories of lullaby evolution via credible signaling but not social bonding, and vice versa (cf. Savage et al., 2021). The authors propose 5 key conditions, mostly corresponding to infant-directed song, infant-directed speech, adult-directed song, and adult-directed speech, with adult-directed song further divided into operatic and popular. However, the theoretical rationale for dividing into operatic and popular is not clear to me, and even the division between infant-directed and adult-directed is not clear, especially since it seems that both the singer/speaker and the participants listening are probably all adults? Given this, it would seem more natural to focus the experimental design on the singing/speaking contrast, and not worry about infant-directed vocalization. Thus I'd recommend citing and discussing Ozaki et al. and many of the references cited therein (e.g., Albouy et al., 2023; Livingstone & Russo, 2018; Ding et al., 2017; Sharma et al., 2021) regarding singing/speaking contrasts, without so much discusson of infant-directed vocalization. This may also help to refine the confirmatory hypotheses/analyses a bit.

On a related note: who are the 22 females who provided the recordings, and why were they chosen? Are they all trained opera singers (as I would guess from listening to some stimuli)? What languages do they speak? Are they intended to be representative of some broader general population? Should there be some control group(s) to show what effects sex, musical training, language, etc might have? Again, cf. Chiba et al. In Press for examples of selecting stimuli of different types to test hypotheses (in Chiba et al.'s case, low vs. high variance in performer quality; Western classical vs. Japanese folk instruments).

The same goes for participant recuitment: I see the authors have added a statement about this in response to Lisa De Bruine's point, but I didn't see many more details about them in this statement beyond "Participants will be recruited from the participant database of the Max Planck Institute for Empirical Aesthetics’s, in Frankfurt, Germany, which consists mostly of lay listeners, with a preponderance of students and retired subjects". Does this database only include adults? (If so, see above regarding whether this is appropriate for testing theories of infant-directed vocalization). Are they all German speakers? Do we need to think about gender? (I suspect male ratings of preference for female vocalization may be quite different from female!) 

On the issue of language, I was very surprised to hear that all vocalizations only included lexically meaningless "lu" vocables. While this may avoid issues of language confounds, it also doesn't seem to me to be an appropriate proxy for "speech", which by definition uses lexically meaningful words. It sounds like there are also recordings with words as well - my suggestion would be to use those if you have to choose. You could also try running pilot experiments both ways (real lyrics and vocables) to get a sense of feasibility - and perhaps even include both if needed (though I imagine this may be logistically challenging for a long experiment). Cf. Ozaki et al. (Accepted In Principle) for ideas for comparing singing, the same lyrics recited as speech, and conversational speech (which has slightly different acoustic profiles from recited lyrics).

On a less core but also important note - it sounds like all stimuli are restricted to "only use one of the melody excerpts, the first phrase from “Chove Chuva” (by Brazilian artist Jorge Ben Jor)", which I believe is shown in Fig. 1. This seems like it will pretty dramatically limit the generalizability of the results to other songs and genres. And why choose a Brazilian piece (in Portuguese?) to test German participants? Consider expanding /changing the stimulus sample.

These are all pretty core issues that all may affect the ability to reach meaningful conclusions after collecting and analyzing data. The great thing about RRs is that it is not too late to change this design before you do this! I strongly encourage you to consider my comments here and revise some or all of your experimental design and hypotheses appropriately. (Not saying at all you need to implement all my suggestions, but I do recommend considering them carefully.)


Minor points:

I also have a few more minor points I'd recommend considering:

-Some statements need references (e.g., “Voice attractiveness has been shown to co-vary with sexually dimorphic traits.”)

-Paragraph beginning: “A first step in this direction was taken in a previous study (Bruder et al., 2021a, 2021b/in preparation)…”: Is it appropriate to rely so heavily on unpublished/in prep. conference presentations when readers don’t have access to them to confirm? I’d suggest not doing this unless there is a written preprint available that people can consult for details if needed.

-““pop singing is defined here as singing without any specific type of technique”  - I think pop singers may be offended by this, as most pop music does use a variety of genre-specific techniques - could this be phrased differently?

-Please explicitly specify independent variable(s) (vocalization type?) and dependent variable(s) (liking rating?)

-“performed one fifth higher as pop and lullaby stimuli” - what does this mean and why was it done?


I hope these suggestions are constructive, and wish you good luck in trying to appropriately revise the project!


Patrick Savage

PS For transparency, I wish to disclose that two of these three authors are coauthors with me on a mega-collaboration with over 70 coauthors on the topic of speech and song that has received In Principle Acceptance from PCI-RR (Ozaki et al. Accepted In Principle; names bolded below). I have not otherwise collaborated with or otherwise have conflicts of interest with any of the authors. I confirmed with PCI-RR before accepting the review that such mega-collaboration coauthorship does not disqualify me from serving as a peer reviewer.

PPS: The linked PDF did not appear to incorporate the changes from the previous revision - fortunately I downloaded the tracked change file which did! But in future please try to ensure the revised version is correctly uploaded.


Chiba, G., Ozaki, Y., Fujii, S., & Savage, P. E. (In Press). Sight vs. sound judgments of music performance depend on relative performer quality: Cross-cultural evidence from classical piano and Tsugaru shamisen competitions [Stage 2 Registered Report]. Collabra: Psychology. Preprint: (Peer Community In Registered Reports editorial recommendation and peer review: 

Hadavi, S., Kuroda, J., Shimozono, T., & Savage, P. E. (Under Review). Cross-cultural relationships between music, emotion, and visual imagery: A comparative study of Iran, Canada, and Japan [Stage 1 Registered Report]. PsyArXiv preprint: 

Ozaki, Y., Tierney, A., Pfordresher, P. Q., McBride, J., Benetos, E., Proutskouva, P., Chiba, G., Liu, F., Jacoby, N., Purdy, S. C., Opondo, P., Fitch, W. T., Rocamora, M., Thorne, R., Nweke, F., Sadaphal, D., Sadaphal, P., Hadavi, S., Fujii, S., Choo, S., Naruse, M., Ehara, U., Sy, L., Parselelo, M. L., Anglada-Tort, M., Hansen, N. Chr., Haiduk, F., Færøvik, U., Magalhães, V., Krzyżanowski, W., Shcherbakova, O., Hereld, D., Barbosa, B. S., Varella M. A. C., van Tongeren, M., Dessiatnitchenko, P., Zar Zar, S., El Kahla, I., Farwaneh, S., Muslu, O., Troy, J., Lomsadze, T., Kurdova, D., Tsope, C., Fredriksson, D., Arabadjiev, A., Sarbah, J. P., Arhine, A., Ó Meachair, T., Silva-Zurita, J., Soto-Silva, I., Maripil, J., Millalonco, N. E. M., Ambrazevičius, R., Loui, P., Ravignani, A., Jadoul, Y., Larrouy-Maestri, P., Bruder, C., Aranariutheri, M. I., Teyxokawa, T. P., Kuikuro, K., Matis, T. U. P., Natsitsabui, R., Irurtzun, A., Sagarzazu, N. B., Raviv, L., Zeng, M., Varnosfaderani, S. D., Gómez-Cañón, J. S., Kolff, K., Vanden Bosch der Nederlanden, C., Chhatwal, M., David R. M., I Putu Gede Setiawan, Lekakul, G., Borsan, V. N., Nguqu, N., & Savage, P. E. (Accepted In Principle). Globally, songs are slower, higher, and use more stable pitches than speech [Stage 2 Registered Report]. Peer Community In Registered Reports. Preprint: ((Peer Community In Registered Reports editorial recommendation and peer review:    

Savage, P. E., Loui, P., Tarr, B., Schachner, A., Glowacki, L., Mithen, S., & Fitch, W. T. (2021). Authors’ response: Toward inclusive theories of the evolution of musicality. Behavioral and Brain Sciences, 44(e121), 132–140.

Reviewed by anonymous reviewer 1, 29 May 2023

Reviewed by , 16 May 2023

Dear authors, dear editors, 

In the following, I'm reviewing the registered report "Voice preferences across contrasting singing and speaking styles" by Bruder, Frieler & Larrouy-Maestri. Generally, I think this is a well-planned study which can extend previous knowledge and lead to interesting findings! However, I have some suggestions that should be taken into consideration before conducting the study. 

    1. Major: In general, I'm missing a proper theoretical framework for this study. As this is generally a problem in psychology, I'm pointing it out here hoping that you will put more emphasis on this and also consider it in your discussion when intepreting results again!
    2. Major: The introduction would profit a lot from a different structure. Currently, starting with "infant speaking" and then just briefly touching "general speaking" is misleading -  I would expect a study that is much more focussed on infant vs. adult talking, which is not the focus of the study. Maybe you can think of a more general introduction that mentions infant talking as one subpoint? This would also better lead to your paragraph of the describing the present study and make the research interest more pronounced. 
    3. Major: On page 2, research questions are defined: "But why dow we like some voices more than others? How does our enjoyment of voices vary across diverse types of vocalizations?" - I think it could be formulated more precisely, so that it's clear that you're looking at both shared and individual taste here.
    4. Major: I agree with what Lisa DeBruine has mentioned before, a data simulation would be advantageous - however, I don't see any code? Would it be possible to send me your planned analysis code?
      • About the sample sizes. For 1.2.1. and 1.2.2.: Why are you running your sample size calculations with an expected small to moderate effect size? What is your rationale behind that? SESOI or previous findings? For 1.2.3.: Sample size calculation is missing completely - for Linear Mixed Models you need specific sample sizes in order to make valuable statements, especially when interpreting random effects. Please complete.
      • In general, I think 1.2.3. can benefit from more detail: How are you going to decide which effects will remain in the final model? Step-wise LMM? Random slopes or interecepts? And so on... 
    5. Major: About the MMI: As you are testing participants in two sessions anyways, I think the Beholder Index (as described in Hönekopp, 2006, and in Specker et al., 2020, is the more appropriate analysis method to account for the measurement error! From Specker et al. (2020): "The bi method estimates variance components that can be interpreted as the observed variance attributed either to the participant or the stimulus (see [20], p.2 for a comprehensive explanation of the estimation of the variance components). In order to do this, participants need to rate each stimulus twice. The repeated measure allows for estimation not just of how much participants agree with each other on a rating (shared evaluation) but also how much participants agree with themselves on the repeated rating (private evaluation)." 
    6. Minor: Is singing to infants only used cross-culturally or are its features also comparable across cultures? This is for me not clear from the intro. 
    7. Minor: When you discuss voice attractiveness, you could also mention studies which don't find effects of perceptual/acoustic features on voice attractiveness (or e.g., Mook & Mitchel, 2019, 10.1037/ebs0000128, who did find that averageness lowered voice attractiveness)
    8. Minor: Could you give some more information on which singing styles were employed in the Valentova et al. (2019) study?
    9. Minor: Maybe focus more on invididual vs. private taste in the introduction; you could give examples from face research or other aesthetic research (artworks etc.). Also, when you describe your previous studies, give more detail on how much was explained by individual differences. In the last paragraph before study aims and hypotheses (p. 4/5) make clear how these results apply to the different song & speech styles you are using. 
    10. Minor: Although you plan to do exploratory analyses in 1.1.3., I think you could be more precise. Which perceptual features are taken into account and why? What could you expect for e.g. the speaking voices based on previous results?
    11. Minor: Can you maybe test hearing impairments instead of using self-report?
    12. Minor: What biographical data do you collect and why?
    13. Minor: The three questionnaires could be explained in more detail why they are included, what they measure precisely (e.g. the Music Sophistication subscale) - maybe you could even mention sth about them in the introduction!
    14. Minor: Could you add some comparison within style (especially pop, which has low accuracy?) instead of just between styles?
    15. Minor: Why are the questionnaires presented half-way through the experiment? Maybe you could present them at both sessions to replicate?
    16. Minor: How is the randomization conducted? 
    17. Minor: I feel like it could be advantageous to always let participate rate liking first before rating all other dimensions to get a more spontaneous rating there?


    1. There are a few grammar issues (especially regarding prepositions)
    2. Report is not APA-conform; the reference list definitely needs a review! Also: Headings, Table descriptions, Figure descriptions etc.

Evaluation round #1

DOI or URL of the report:

Version of the report: 1

Author's Reply, 05 Mar 2023

Decision by ORCID_LOGO, posted 12 Dec 2022, validated 12 Dec 2022

I've carefully read your submission and have decided to request some revisions before sending it out to review (these revisions should save time in the long run).I'm enthusiastic about this project, and I think these revisions will make it much easier for the reviewers to assess this registered report.

Rating norms

I'd like to recommend you have a look at this paper and see if it might apply to your study:

Taylor, J. E., Rousselet, G. A., Scheepers, C., & Sereno, S. C. (2021, August 3). Rating Norms Should be Calculated from Cumulative Link Mixed Effects Models.

Analysis scripts

I'd like you to include R scripts with your proposed analyses, run on simulated data. You can simulate entirely null effects (just use simulate 45 observations for each perceptual rating for each stimulus with realistic distributions on the Likert scale) or simulate data with a mixed effects structure. I've included some example code below to get you started, based on a faux tutorial.

Please use the simulated data to write explicit analysis code for your planned analyses. It is my experience that describing analyses with prose is too ambiguous for a registered report and leaves all parties open to misunderstanding. For example, I'm not very familiar with MM1 measures and the extent to which they suffer from pseudoreplication; explicit code would be helpful for comparisons with other methods. Please also be explicit about any corrections for multiple comparisons (e.g., include your critical alpha for concluding significance for each test in the code).

The simulation method can also help you to do power calculations for analyses that don't have an analytic solution (e.g., in GPower). I have a tutorial guide to this method using mixed models, but the technique is applicable to any analysis (just write a function that simulates the data, runs the analysis, and returns the p-value of interest; then run this function ~1000 times and report the proportion of p-values less than your critical alpha).

If possible, include explicit code and decisions rules for determining if raters should be excluded for inattentiveness (although I acknowledge that sometimes they do things that are obvious only in retrospect, and would absolutely support modification of the criteria is warranted).

I acknowledge that this is a big ask, but it's all code that will have to be written for the study eventually, so if you write it now, it will only take minutes to run the analyses after you collect the data, and there will be much less chance of misunderstadnings with the reviewers.

Rater details

I'd also like a more detailed justification of the raters. It seems probable that consistency of rating things like timbre or resonance would be much more consistent for raters with musical experience (especially choir experience) and some of your results may be entirely dependent on the type of raters you recruit. you state that the subject poolis "mostly lay listeners". It would be good to be more explicit about whether and how raters' experince is expected to affect the ratings and how this might affect the generalisability of your results.



# simulate multilevel (crossed) data

ratings <- add_random(voice = 22) |>
  add_random(rater = 45) |>
  add_within(style = c("pop", "opera", "lullaby", "adult", "child"),
             rating = c("liking", "attack", "breathiness", "loudness", "resonance", "tempo", "timbre"),
             time = c("T1","T2")) |>
  # add random effects structure (induces correlations)
  add_ranef(.by = "voice", b0_v = 1) |>
  add_ranef(.by = "rater", b0_r = 1) |>
  add_ranef(err = 5) |>
  mutate(score = 4 + b0_v + b0_r + err,
         # 1-7 likert with proportional probabilities
         score = norm2likert(score, prob = c(1, 2, 4, 6, 4, 2, 1))) |>
  select(-b0_v, -b0_r, -err)

# calculate average scores
avg <- ratings |>
  group_by(voice, style, rating, time) |>
  summarise(score = mean(score), .groups = "drop")

# correlation between T1 and T2
avg |>
  pivot_wider(names_from = time, values_from = score) |>
  group_by(style, rating) |>
  summarise(value = cor(T1, T2)) |>
  pivot_wider(names_from = style)

User comments

No user comments yet