DOI or URL of the report: https://osf.io/9d58j?view_only=506d243a6e7a4d3680c81e696ca81025
Version of the report: 1 (inside folder Revision_February_2024 at osf)
The submission is now on the verge of IPA. Please attend to the remaining point noted by the reviewer and Stage 1 acceptance will then be issued without delay.
Thanks to the authors for their many revisions, whch I think have greatly strengthened the proposal. I think the current version satisficatorily address my main previous concerns. I have only one very small suggestion remaining: I find it confusing that Table 1 refers interchangingly to an SESOI of 0.1 and 0.5. I believe 0.5 refers to Cohen's D and 0.1 refers to the equivalent size of MM1, but this is not clear from the Table. Perhaps add "D" and "MM1" or otherwise clarify this?
I look forward to seeing the results of the study.
DOI or URL of the report: https://osf.io/gnt8k?view_only=506d243a6e7a4d3680c81e696ca81025
Version of the report: 1 (inside folder REVISION_November2023 at osf)
The three reviewers from the previous round kindly returned to evaluate your revised manuscript. As you will see, the submission is now approaching Stage 1 IPA, pending clarification of some specific points concerning the clarity and precision of the hypotheses. I look forward to receiving a (hopefully final) revision addressing this issues.
I thank the authors for making an in-depth effort to address the issues raised by myself and the other two reviewers, which has resulted in largely a new manuscript.
Not all the choices they made the one I would have done, but that is fine, I'm just glad they've had the opportunity to consider our suggestions before beginning data collection/analysis. I don't want to prolong the process un-necessarily, so will refrain from nitpicky comments, but I do have a couple of concerns that affect the core Registered Report design planner shown in Table 1 and thus deserve to be resolved before In Principal Acceptance (IPA):
1) Why are there four analyses specified in section 2.4 (2.4.1-2.4.4) when there are only two hypothesis tests listed in the table? Are 2.4.3 and 2.4.4 "Supporting analyses"? If so, I think these would be better to be described as "Exploratory Analyses" and clearly separated in a different section (I don't think you need to delete them entirely).
2) Several things feel not quite right about the proposed testing of H1 and related power analysis:
a) Why does hypothesis test 1 contain so many different comparisons?
("three pairwise comparisons (paired t-tests, one tailed). Note that we will also compare all styles to each other with a repeated measures ANOVA, potentially followed by 10 pairwise comparisons (paired t-tests, two-tailed))"
What would the interpretation be if only one or two comparisons were supported but not the other(s)? Would this be better broken into, say, three individual predictions? Or modeled in a different way?
(NB: This prediction structure looks similar to one our lab did in our 1st PCI-RR submission for the Hadavi et al. paper I mentioned previously. However, we later updated this after reviewer feedback - perhaps this later update may help give ideas? https://doi.org/10.31234/osf.io/26yg5
b) Do you really need all those 13 different comparisons (3 pairwise + 1 ANOVA + 10 pairwise)? How is this reflected in the power analysis, which appears to assume 10 comparisons?
(“To compare the amount of shared taste across the five vocalization styles, we calibrate our power analysis to have enough power for all 10 pairwise comparisons between styles (and not only the omnibus test). This comes at the cost of conservatively correcting our alpha for ten comparisons (α = .005 with Bonferroni correction for multiple comparisons), which, considering our SESOI of d = 0.5, necessitates a sample size of 71 participants to achieve power of .9 (paired, two-sided t-test, calculated with the pwr.t.test function from the pwr R package - Champely, 2020).”)
(NB: If you focus on fewer comparisons for confirmatory analysis, you might not need as large a sample. Likewise, what is the justification for power of .9? I believe many journals either request minimum power of .95 or .8, so depending on your plans you might also not need so many participants if you are willing to have a lower power (and this could give more flexibility in other analyses).)
c) Both H1 and H2 are basically about degree of agreement, right? If so, what is the rationale for using different agreement metrics (MM1 for H1 vs. Krippendorf's alpha for H2)? Wouldn't it be simpler to use a single metric for both tests? Sorry if this is a stupid question, but others will probably have it so would be better to clearly explain.
d) The new simulation figures I found at https://osf.io/q2sgk?view_only=506d243a6e7a4d3680c81e696ca81025 are great, but would be better to be included in the main manuscript file (perhaps just merge the 2 pdfs into one?). Figs. S2 and S3 help me a lot to understand what is happening to test H2, but Fig. S1 doesn't really shed much light on what is happening to test H1 for me, I'm sorry to say. Could that testing process be made more clear by visualizing simulated data?
e) Why are the two speaking conditions included at all if they are not the focus of the main tests of H1?
I hope these can help tighten up the manuscript, particularly Table 1, whether that means changing the table or better explaining elsewhere why those decisions were made.
The authors have done a terrific job of addressing all of the items I and other reviewers have suggested. The restructure of the introduction makes the document very readable and the focus on the main 2 hypotheses, dropping the raft of (interesting!) exploratory analyses also really helps to streamline the proposal. Yet, I still feel one of my comments was not adequately addresses related to the comparison of speech to song. Specificaly, the lack of including speech in hypothesis 1. The predictions seem to only mention song, which dichotomizes the speech-to-song continuum. I understand the authors describe that they do not want to directly compare speech to song because they are considering the whole musi-language continuum, but, by not addressing speech or making any predictions (even if there is little data using your metrics with speech, that indicates to me that this would be novel and useful contribution to the literature!) the authors are treating as a separate category. I would like to see speech added to the analysis for prediction 1 and predictions included for hypotheses 1 and 2. Alternatively, the hypotheses should be split by speech and song, which is logical given that the categories chosen are not equally spaced along a musi-language continuum (that is, none are ambiguous).
Thank you for the thorough and thoughtful responses to all the reviewers comments. I look forward to seeing the data!
Signed,
Christina Vanden Bosch der Nederlanden
Dear editors, dear authors,
The planned study and the manuscript have improved a lot and all of my comments have been addressed appropriately. I believe that the planned study profits a lot from the reduced format and the adjusted theoretical framework.
I also had a glance at your provided code which seems to include suited analyses for your planned study.
Therefore, I have no further comments and can with the best of conscious recommend an execution of the study.
Best regards and merry Christmas
Christina Krumpholz
DOI or URL of the report: https://osf.io/826sj?view_only=506d243a6e7a4d3680c81e696ca81025
Version of the report: https://osf.io/tqwpg?view_only=506d243a6e7a4d3680c81e696ca81025
(Please see attached letter of response as PDF file, and tracked changes word document. These are also in the project folder at osf)
Your submission has now been evaluated by three expert reviewers. As you will see, the reviews are critical but constructive, and rich in helpful recommendations.
There are three main areas to address in revision.
1. Clearly distinguishing confirmatory and exploratory analyses. Concerning the inclusion of exploratory analyses at Stage 1 -- where they are crucial for answering the research question they need to be properly powered and specified in the same way as confirmatory tests. Where they are truly exploratory/secondary, they can be included at Stage 1 in order for readers to understand the motivation for including various methodological characteristics, but these analyses must be clearly distinguished from the primary questions and analyses.
2. Clarifying and strengthening the theoretical basis of the study, and in doing so better focusing the motivation for the specific research questions in the introduction (please also add the missing column to the study design template)
3. Increasing the degree of methodological detail to justify specific design and analysis decisions and maximise computational reproducibility.
In addition to these points, the reviewers make numerous additional specific comments, so please be sure to address them all comprehensively in a revision and response.
I applaud the authors for taking on the challenge of using the Peer Community In Registered Reports (PCI-RR) framework to undertake interesting, largely exploratory, analyses. I find the proposed topic of preferences for speaking and singing voices interesting and valid in principle, though it needs some refinement in terms of theoretical and methodological framing. So in principle I support eventual Recommendation of an improved version via PCI-RR.
That said, I think the current manuscript needs substantial revisions before it meets the standards of a PCI-RR Stage 1 Protocol. In particular, one of the primary goals of RRs is to clearly separate confirmatory and exploratory analysis. Table 1 does this to an extent, but in the main text confirmatory and exploratory analyses are often mixed within the same section or paragraph such that they are hard to distinguish (e.g., Hypothesis 1.1.3 is described as ““not included in Table 1 because it is exploratory”, and I think the same might apply to hypotheses predicting liking ratings from acoustic features?). I think all exploratory analyses, variables, etc. should be moved to a separate section and clearly labeled (e.g., "Section 3: Exploratory analyses". I also have some concerns about the generalizability of the stimuli and experimental design, which I suggest the authors consider carefully before deciding whether to continue with the current design or revise.
[NB: In a recent submission from my lab (Hadavi et al., Under review), the PCI-RR recommender actually requested during pre-screening that we completely remove all exploratory analyses for this reason. Personally I think this may sometimes be too drastic and overly limit the ability of authors to use PCi-RR for exploratory research, so I have not recommended it for this case.]
I understand this may be a challenge for the current research, which the authors admit is largely exploratory. I would encourage the PCI-RR Recommender to discuss this issue with the Editorial Board, as the question of how best to use RRs for exploratory research is I believe still an ongoing one without a fixed answer. I strongly support making RRs as flexible as possible to allow for more exploratory work, so my review here is intended to provide constructive suggestions for how to achieve this. I will add that my lab has submitted three manuscripts to PCI-RR on related topics of song/speech/music cognition that are in different stages of the review process, and I encourage the authors to refer to these for ideas/templates for how to reorganize their manuscript/experimental design to make better use of the format (Chiba et al., In Press; Ozaki et al., Accepted In Principle; Hadavi et al., Under Review).
My main suggestions are as follows:
1) Move all details relating to exploratory analyses to a separate, dedicated section (see above)
2) Add figures visualizing the main confirmatory analyses. In my experience, it is worth collecting a small amount of pilot data (this can even be just from your three coauthors and/or lab members, colleagues) to show proof of principle. This could potentially be combined with the simulations recommended by Lisa De Bruine, although perhaps just the simulations alone might be enough. Even if you only use simulations, I still recommend plotting them in the manuscript.
3) Add a "Data/code/stimuli availability statement". Regarding the simulated data, I note that the authors say they have added R scripts simulating data, but I cannot find those scripts. I recommend uploading them to a repository (e.g., GitHub) and adding a "Data/code/stimuli availability statement" before the reference section incluing this link. I also recommend uploading the full stimulus set here (the manuscript links to a partial stimulus set at https://owncloud.gwdg.de/index.php/s/6IWIvTc828vB77R, but elsewhere says that “[the stimulus dataset, along with details about the validation experiment and acoustic analyses of the stimuli, will be, at the time of publication of this paper, available open access - currently work in progress]”. I think RR format best practice would be for the stimulus set to be fixed and open access before receiving In Principle Acceptance (IPA; cf. Chiba et al. In Press for an example).
4) Explicitly state how you are correcting for multiple comparisons. In your response to Lisa De Bruine, you say that the scripts do this, but Table 1 still just shows p<.05 without mentioning any correction.
5) More clearly connect the introduction, hypotheses, stimulus selection, and participant recruitment. At minimum, I believe PCI-RR requests that submissions use this template including an additional column: “Theory that could be shown wrong by the outcomes” ( https://osf.io/sbmx9).
[NB: I think the version used by the authors may have come from a different website (maybe linked from https://www.cos.io/initiatives/registered-reports?) - I recall having a similar discrepancy in the past and would encourage the PCI-RR Editorial Board to try to standardize this to avoid future confusion.]
While I find the general title topic of speaking/singing voice preferences of great interest, the current discussion of previous literature focuses on hypotheses about the evolution of infant-directed vocalization, but does not make it not clear how the proposed analyses would be interpreted with respect to these hypotheses. For example, it is very difficult to propose predictions that can uniquely falsify theories of lullaby evolution via credible signaling but not social bonding, and vice versa (cf. Savage et al., 2021). The authors propose 5 key conditions, mostly corresponding to infant-directed song, infant-directed speech, adult-directed song, and adult-directed speech, with adult-directed song further divided into operatic and popular. However, the theoretical rationale for dividing into operatic and popular is not clear to me, and even the division between infant-directed and adult-directed is not clear, especially since it seems that both the singer/speaker and the participants listening are probably all adults? Given this, it would seem more natural to focus the experimental design on the singing/speaking contrast, and not worry about infant-directed vocalization. Thus I'd recommend citing and discussing Ozaki et al. and many of the references cited therein (e.g., Albouy et al., 2023; Livingstone & Russo, 2018; Ding et al., 2017; Sharma et al., 2021) regarding singing/speaking contrasts, without so much discusson of infant-directed vocalization. This may also help to refine the confirmatory hypotheses/analyses a bit.
On a related note: who are the 22 females who provided the recordings, and why were they chosen? Are they all trained opera singers (as I would guess from listening to some stimuli)? What languages do they speak? Are they intended to be representative of some broader general population? Should there be some control group(s) to show what effects sex, musical training, language, etc might have? Again, cf. Chiba et al. In Press for examples of selecting stimuli of different types to test hypotheses (in Chiba et al.'s case, low vs. high variance in performer quality; Western classical vs. Japanese folk instruments).
The same goes for participant recuitment: I see the authors have added a statement about this in response to Lisa De Bruine's point, but I didn't see many more details about them in this statement beyond "Participants will be recruited from the participant database of the Max Planck Institute for Empirical Aesthetics’s, in Frankfurt, Germany, which consists mostly of lay listeners, with a preponderance of students and retired subjects". Does this database only include adults? (If so, see above regarding whether this is appropriate for testing theories of infant-directed vocalization). Are they all German speakers? Do we need to think about gender? (I suspect male ratings of preference for female vocalization may be quite different from female!)
On the issue of language, I was very surprised to hear that all vocalizations only included lexically meaningless "lu" vocables. While this may avoid issues of language confounds, it also doesn't seem to me to be an appropriate proxy for "speech", which by definition uses lexically meaningful words. It sounds like there are also recordings with words as well - my suggestion would be to use those if you have to choose. You could also try running pilot experiments both ways (real lyrics and vocables) to get a sense of feasibility - and perhaps even include both if needed (though I imagine this may be logistically challenging for a long experiment). Cf. Ozaki et al. (Accepted In Principle) for ideas for comparing singing, the same lyrics recited as speech, and conversational speech (which has slightly different acoustic profiles from recited lyrics).
On a less core but also important note - it sounds like all stimuli are restricted to "only use one of the melody excerpts, the first phrase from “Chove Chuva” (by Brazilian artist Jorge Ben Jor)", which I believe is shown in Fig. 1. This seems like it will pretty dramatically limit the generalizability of the results to other songs and genres. And why choose a Brazilian piece (in Portuguese?) to test German participants? Consider expanding /changing the stimulus sample.
These are all pretty core issues that all may affect the ability to reach meaningful conclusions after collecting and analyzing data. The great thing about RRs is that it is not too late to change this design before you do this! I strongly encourage you to consider my comments here and revise some or all of your experimental design and hypotheses appropriately. (Not saying at all you need to implement all my suggestions, but I do recommend considering them carefully.)
Minor points:
I also have a few more minor points I'd recommend considering:
-Some statements need references (e.g., “Voice attractiveness has been shown to co-vary with sexually dimorphic traits.”)
-Paragraph beginning: “A first step in this direction was taken in a previous study (Bruder et al., 2021a, 2021b/in preparation)…”: Is it appropriate to rely so heavily on unpublished/in prep. conference presentations when readers don’t have access to them to confirm? I’d suggest not doing this unless there is a written preprint available that people can consult for details if needed.
-““pop singing is defined here as singing without any specific type of technique” - I think pop singers may be offended by this, as most pop music does use a variety of genre-specific techniques - could this be phrased differently?
-Please explicitly specify independent variable(s) (vocalization type?) and dependent variable(s) (liking rating?)
-“performed one fifth higher as pop and lullaby stimuli” - what does this mean and why was it done?
I hope these suggestions are constructive, and wish you good luck in trying to appropriately revise the project!
Signed,
Patrick Savage
PS For transparency, I wish to disclose that two of these three authors are coauthors with me on a mega-collaboration with over 70 coauthors on the topic of speech and song that has received In Principle Acceptance from PCI-RR (Ozaki et al. Accepted In Principle; names bolded below). I have not otherwise collaborated with or otherwise have conflicts of interest with any of the authors. I confirmed with PCI-RR before accepting the review that such mega-collaboration coauthorship does not disqualify me from serving as a peer reviewer.
PPS: The linked PDF did not appear to incorporate the changes from the previous revision - fortunately I downloaded the tracked change file which did! But in future please try to ensure the revised version is correctly uploaded.
References:
Chiba, G., Ozaki, Y., Fujii, S., & Savage, P. E. (In Press). Sight vs. sound judgments of music performance depend on relative performer quality: Cross-cultural evidence from classical piano and Tsugaru shamisen competitions [Stage 2 Registered Report]. Collabra: Psychology. Preprint: https://doi.org/10.31234/osf.io/xky4j (Peer Community In Registered Reports editorial recommendation and peer review: https://doi.org/10.24072/pci.rr.100351)
Hadavi, S., Kuroda, J., Shimozono, T., & Savage, P. E. (Under Review). Cross-cultural relationships between music, emotion, and visual imagery: A comparative study of Iran, Canada, and Japan [Stage 1 Registered Report]. PsyArXiv preprint: https://doi.org/10.31234/osf.io/26yg5
Ozaki, Y., Tierney, A., Pfordresher, P. Q., McBride, J., Benetos, E., Proutskouva, P., Chiba, G., Liu, F., Jacoby, N., Purdy, S. C., Opondo, P., Fitch, W. T., Rocamora, M., Thorne, R., Nweke, F., Sadaphal, D., Sadaphal, P., Hadavi, S., Fujii, S., Choo, S., Naruse, M., Ehara, U., Sy, L., Parselelo, M. L., Anglada-Tort, M., Hansen, N. Chr., Haiduk, F., Færøvik, U., Magalhães, V., Krzyżanowski, W., Shcherbakova, O., Hereld, D., Barbosa, B. S., Varella M. A. C., van Tongeren, M., Dessiatnitchenko, P., Zar Zar, S., El Kahla, I., Farwaneh, S., Muslu, O., Troy, J., Lomsadze, T., Kurdova, D., Tsope, C., Fredriksson, D., Arabadjiev, A., Sarbah, J. P., Arhine, A., Ó Meachair, T., Silva-Zurita, J., Soto-Silva, I., Maripil, J., Millalonco, N. E. M., Ambrazevičius, R., Loui, P., Ravignani, A., Jadoul, Y., Larrouy-Maestri, P., Bruder, C., Aranariutheri, M. I., Teyxokawa, T. P., Kuikuro, K., Matis, T. U. P., Natsitsabui, R., Irurtzun, A., Sagarzazu, N. B., Raviv, L., Zeng, M., Varnosfaderani, S. D., Gómez-Cañón, J. S., Kolff, K., Vanden Bosch der Nederlanden, C., Chhatwal, M., David R. M., I Putu Gede Setiawan, Lekakul, G., Borsan, V. N., Nguqu, N., & Savage, P. E. (Accepted In Principle). Globally, songs are slower, higher, and use more stable pitches than speech [Stage 2 Registered Report]. Peer Community In Registered Reports. Preprint: https://doi.org/10.31234/osf.io/jr9x7 ((Peer Community In Registered Reports editorial recommendation and peer review: https://rr.peercommunityin.org/articles/rec?id=316)
Savage, P. E., Loui, P., Tarr, B., Schachner, A., Glowacki, L., Mithen, S., & Fitch, W. T. (2021). Authors’ response: Toward inclusive theories of the evolution of musicality. Behavioral and Brain Sciences, 44(e121), 132–140. https://doi.org/10.1017/S0140525X21000042
Dear authors, dear editors,
In the following, I'm reviewing the registered report "Voice preferences across contrasting singing and speaking styles" by Bruder, Frieler & Larrouy-Maestri. Generally, I think this is a well-planned study which can extend previous knowledge and lead to interesting findings! However, I have some suggestions that should be taken into consideration before conducting the study.
Formalities:
DOI or URL of the report: https://osf.io/826sj?view_only=506d243a6e7a4d3680c81e696ca81025
Version of the report: 1
I've carefully read your submission and have decided to request some revisions before sending it out to review (these revisions should save time in the long run).I'm enthusiastic about this project, and I think these revisions will make it much easier for the reviewers to assess this registered report.
I'd like to recommend you have a look at this paper and see if it might apply to your study:
Taylor, J. E., Rousselet, G. A., Scheepers, C., & Sereno, S. C. (2021, August 3). Rating Norms Should be Calculated from Cumulative Link Mixed Effects Models. https://doi.org/10.3758/s13428-022-01814-7
I'd like you to include R scripts with your proposed analyses, run on simulated data. You can simulate entirely null effects (just use simulate 45 observations for each perceptual rating for each stimulus with realistic distributions on the Likert scale) or simulate data with a mixed effects structure. I've included some example code below to get you started, based on a faux tutorial.
Please use the simulated data to write explicit analysis code for your planned analyses. It is my experience that describing analyses with prose is too ambiguous for a registered report and leaves all parties open to misunderstanding. For example, I'm not very familiar with MM1 measures and the extent to which they suffer from pseudoreplication; explicit code would be helpful for comparisons with other methods. Please also be explicit about any corrections for multiple comparisons (e.g., include your critical alpha for concluding significance for each test in the code).
The simulation method can also help you to do power calculations for analyses that don't have an analytic solution (e.g., in GPower). I have a tutorial guide to this method using mixed models, but the technique is applicable to any analysis (just write a function that simulates the data, runs the analysis, and returns the p-value of interest; then run this function ~1000 times and report the proportion of p-values less than your critical alpha).
If possible, include explicit code and decisions rules for determining if raters should be excluded for inattentiveness (although I acknowledge that sometimes they do things that are obvious only in retrospect, and would absolutely support modification of the criteria is warranted).
I acknowledge that this is a big ask, but it's all code that will have to be written for the study eventually, so if you write it now, it will only take minutes to run the analyses after you collect the data, and there will be much less chance of misunderstadnings with the reviewers.
I'd also like a more detailed justification of the raters. It seems probable that consistency of rating things like timbre or resonance would be much more consistent for raters with musical experience (especially choir experience) and some of your results may be entirely dependent on the type of raters you recruit. you state that the subject poolis "mostly lay listeners". It would be good to be more explicit about whether and how raters' experince is expected to affect the ratings and how this might affect the generalisability of your results.
library(faux)
library(dplyr)
library(tidyr)
# simulate multilevel (crossed) data
ratings <- add_random(voice = 22) |>
add_random(rater = 45) |>
add_within(style = c("pop", "opera", "lullaby", "adult", "child"),
rating = c("liking", "attack", "breathiness", "loudness", "resonance", "tempo", "timbre"),
time = c("T1","T2")) |>
# add random effects structure (induces correlations)
add_ranef(.by = "voice", b0_v = 1) |>
add_ranef(.by = "rater", b0_r = 1) |>
add_ranef(err = 5) |>
mutate(score = 4 + b0_v + b0_r + err,
# 1-7 likert with proportional probabilities
score = norm2likert(score, prob = c(1, 2, 4, 6, 4, 2, 1))) |>
select(-b0_v, -b0_r, -err)
# calculate average scores
avg <- ratings |>
group_by(voice, style, rating, time) |>
summarise(score = mean(score), .groups = "drop")
# correlation between T1 and T2
avg |>
pivot_wider(names_from = time, values_from = score) |>
group_by(style, rating) |>
summarise(value = cor(T1, T2)) |>
pivot_wider(names_from = style)