How well do "non-WEIRD" participants in multi-lab studies represent their local population?

ORCID_LOGO based on reviews by Zoltan Dienes, Patrick Forscher and Kai Hiraishi
A recommendation of:

The WEIRD problem in a “non-WEIRD” context: A meta-research on the representativeness of human subjects in Chinese psychological research


Submission: posted 07 September 2021
Recommendation: posted 03 April 2023, validated 07 April 2023
Cite this recommendation as:
Yamada, Y. (2023) How well do "non-WEIRD" participants in multi-lab studies represent their local population?. Peer Community in Registered Reports, .


In this protocol, Yue et al. (2023) aim to clarify whether the sample of non-WEIRD countries included in multi-lab studies is actually representative of those countries and cultures. Focusing on China, this study will compare Chinese samples in several multi-lab studies with participants in studies published in leading national Chinese journals on various aspects, including demographic data and geographic information. This work will provide useful information on the extent to which multi-lab studies are able to deal with generalizability, especially as they intend to address the generalizability problem.
The Stage 1 manuscript was reviewed by three experts, including two with an interest in the WEIRD problem and a wealth of experience in open science and multi-lab research, plus an expert in Bayesian statistics, which this manuscript uses. Following multilpe rounds of peer review, and based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).
URL to the preregistered Stage 1 protocol:
Level of bias control achieved: Level 4. At least some of the data/evidence that will be used to answer the research question already exists AND is accessible in principle to the authors (e.g. residing in a public database or with a colleague) BUT the authors certify that they have not yet accessed any part of that data/evidence.
List of eligible PCI RR-friendly journals: 
Yue, L., Zuo, X.-N., & Hu, C.-P. (2023) The WEIRD problem in a “non-WEIRD” context: A meta-research on the representativeness of human subjects in Chinese psychological research, in principle acceptance of Version 7 by Peer Community in Registered Reports.
Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviewed by ORCID_LOGO, 03 Apr 2023

I am happy with the authors' reply. I can;t seem to access the paper; I am not sure if it is a problem my end. But so long as the information in the reply appears in the paper, than my point has been addressed.

Evaluation round #5

DOI or URL of the report:

Version of the report: MD5:6e54b5c53f06135960169a7b3a117a12

Author's Reply, 29 Mar 2023

Decision by ORCID_LOGO, posted 23 Mar 2023, validated 23 Mar 2023

Thank you very much for your effort to further improve the manuscript.
The remaining issues seem to have been focused on the age analysis. The reviewer was, very thankfully, extremely prompt in providing comments on the manuscript. I regret that I have to ask you to revise the manuscript several times, but please revise it again so that the proper analysis can be conducted.

Yuki Yamada, Recommender

Reviewed by ORCID_LOGO, 23 Mar 2023

The authors show they actually can pick up reasonable proportion differences using counts (which is what should be used for a multinomial). But for the age case, they have misunderstood me: I in no way meant they should use a t-test. I meant they should indicate what sort of effects thier procedure can pick up, using the same logic as they have e..g used here: "an
article reported 30 participants, with age = 23.3 ± 3.5, we estimate the approximate number of
participants under 20 is 5 (r code: `round((pnorm(20, mean = 23.3, sd =3.5) * 30))`), the number
of participants aged between 21 ~ 30 is 24 (r code: `round((pnorm(30, mean = 23.3, sd =3.5) *
30)) - 5`), and participant aged between 30 ~ 40 will be 1." Sure a multinomial analysis can pick up more than just mean differences in age. But one still needs to show the model of H1 as a uniform has sensible properties, given such a model does not reflect a relevant theory. So one way of showing the sensitivity of the analysis is to imagine a diference only in means of a certain amount, translate that into the age bins, run the multinomial analysis, and see what it can pick up. At the moment all we know - from what is provided in the actual paper - is that it can get evidence for H1 when the sample data are generated from a uniform; but that is such an unlikely position for the real data to be in that it does not tell us much. Sorry to go on, but my point has not been addressed.

Evaluation round #4

DOI or URL of the report:

Version of the report: MD5:17e9d57ee757b4d34f0d74f00e6700b7

Author's Reply, 23 Mar 2023

Decision by ORCID_LOGO, posted 20 Feb 2023, validated 20 Feb 2023

​​​​​​​​​​​​​​​​​​​​​​​​​​​​​Again, the reviewer commented. I think the discussion is getting quite mature, but there are still some insufficiencies in the in-text explanation and rationale of the justification for priors. In a normal peer review, I would prefer not to go back and forth too many times because it consumes both parties' resources, but I think it is important that key points are agreed upon between the authors and the reviewer, and I would like to continue for another round. I hope you will find those points of agreement, taking into account the past comments of this reviewer.


Yuki Yamada, Recommender

Reviewed by ORCID_LOGO, 18 Feb 2023

The authors have made a major step in addressing my point. (Incidentally I didn't find the supplementary materials, so I don't comment on them.) I think however the discussion of this point should be in the main text when they introduce the prior (model of H1) that they use and they need to elaborate more to justify the scientific relevance of this model of H1 given these results. In effect if their reference sample had 50% males, their comparison would need about 67% or more males (or 33% or less) to be detected as different  -is this reasonable in this context? (The model of H1 presumes any difference in proportion is as probable as any other - and one consequence of that vague assumption is a large difference is needed to detect any difference.) I am not sure I find that reasonable. (That is why I always use scientifically informed priors.) Can the authors argue for a reasonable prior or argue that this prior is reasonable *given their particular scientific/empirical context*?

For the age result, the average reader will not be able to interpret what this means in terms of what age differences could be detected. That is why I suggested modeling age as a normal and showing what diference in mean ages could be detected. Find a way to present the sensitivity of the method to age differences that is intuitively graspable.

Evaluation round #3

DOI or URL of the report:

Version of the report: MD5:4fa349f4ca2ede8882c7b09e38cbe4f2

Author's Reply, 11 Feb 2023

Decision by ORCID_LOGO, posted 21 Jan 2023, validated 21 Jan 2023

Thank you for your thorough revisions. As you can see, many issues have been resolved and several of the reviewers are satisfied.
Nevertheless, there remain issues related to the analysis that require further consideration. That is, the setting of the prior distribution and the inferences that can be drawn from it. Please read the reviewer's comment carefully for more details. I agree that this issue should be resolved before granting IPA. I look forward to seeing the revised manuscript again.

Yuki Yamada

Reviewed by ORCID_LOGO, 20 Jan 2023

The planned analyses are now more clearly presented. But a main issue I raised has not been dealt with. The authors are using uniform priors. Such default priors do not typically reflect what a plausible theory would predict; which means no plausible theory is typically being tested. The way to answer the concern is to show that in this situation the default uniform is reasonable. How can one show that?  One way is to indicate how the BF performs on imaginary data showing plausible ways H1 may be true. For example, give a sex ratio that deviates from the H0 by a relatively small amount - what is the smallest amount for which the BF still gives good evidence for H1? Age is more complex as it is multi-df; but one may proceed as the authors have by assuming age is normally distributed and find the smallest difference in means for which one just gets evidence for H1. (If one assumes normality one could argue one should use Bayesian t-tests for age. But one could also treat normality as just one possibility for checking how the test behaves.) This approach would be the simplest. More thorough would be showing what population effect would lead to say a 80% chance of being detected. Conversely one should show that if there is no effect, there is sufficient N to obtain evidence for H0. This is of course unlikely to be a problem with the planned sample size, but one should always show this in a registered report (the logic of planning for a severe test is given here:

The authors responded to my point by saying they will check robustness with a different prior. This suggestion in itself leaves open inferential flexibility: What conclusions will be reached if the different priors lead to different conclusions? So the authors need to be clear on what basis they will draw particular conclusions. Simplest would be to justify one prior as most suitable (e.g. because of the way it behaves as described above), indicate all conclusions will be wrt this prior, and the other is for background information only.

Reviewed by , 13 Jan 2023

Reviewed by , 04 Jan 2023

I have read the revised manuscript and the authors' response to reviewers. I was largely satisfied with the previous draft of the manuscript and I'm also satisfied with this one. I'd like to see this manuscript accepted so that I can read the authors' results. :)

This is a great project -- one I'll certainly be keeping an eye on!

Patrick S. Forscher

Associate Director

Busara Center for Behavioral Economics

Evaluation round #2

DOI or URL of the report:

Version of the report: MD5:248b05d98f61aaf0c70a236c20e6df8c

Author's Reply, 03 Jan 2023

Decision by ORCID_LOGO, posted 06 Oct 2022

Both of the previous two reviewers responded favorably to the revisions made by the authors, and I agree with them. Thank you very much for making substantial revisions and resolving many of the issues. I would then again like to invite the authors to submit a revised manuscript.

We first added a third reviewer in this round who will focus exclusively on the part of Bayesian statistics. Then, considering all the peer review comments, the authors should provide a more specific explanation and justification for the hypotheses they are setting here. 

Also, the individual peer review comments make some points about the target population. Since this is also involved in the hypothesis setting, I still think further justification is needed.

As the third reviewer pointed out, there is still some ambiguity in the design table. Please make it more specific so that the hypothesis, analysis, and interpretation correspond in a straight line.

This study has the potential to be interesting and important, and I would like to encourage the revision of it.

Yuki Yamada, Recommender

Reviewed by , 02 Oct 2022

Reviewed by , 26 Sep 2022

I think the authors have done a thorough and admirable job addressing my comments. I only have two remaining (potential) and concerns.

First, I still wonder whether we should expect research samples to exactly represent the general population from which they were drawn. To take an extreme example, let's imagine that a research community goes through a period of doing lots of research on anxiety. One would expect that research field to include lots of highly anxious participants -- moreso than one would expect in the general population. But this lack of representativeness is intended and, I think, justified because it is necessary to achieve the researchers' goals. If this research community stays fixated on anxiety for an extended period of time, one could critique that community for being too focused on one topic at the expense of other valuable topics that are relevant to non-anxious people, but I think periods of focus on one topic can be justified.

To be fair to the authors, they have included some codes of the researchers' intended generalization -- but I think the findings will need to be interpreted carefully with the relationship between samples, populations, and research goals in mind. So, I don't see a strong need for revision -- this is just something to keep in mind for the discussion section (with, perhaps, a few tweaks to the framing of the paper's goals).

Second, I do have some lingering concerns about the analysis plan. One part of this concern is linked to my comments about whether one expects exact representativeness in research samples, as this expectation will be encoded in the prior. I just wonder whether the Bayes factors are comparing the right models. Maybe they are, as long as the discussion section contextualizes the results appropriately (ie it makes clear that sampling decisions are or should be a function of research goals) -- so perhaps no action is needed on this point. My other concern about the analysis plan is that it may need to be critiqued by someone with more Bayesian expertise than either I or the other reviewer can provide. I think this is an issue for the editor to decide.

At any rate, I think this is an interesting and valuable project and I'm looking forward to seeing where the authors go with it.

I sign all my reviews,

Patrick S. Forscher

Research Lead, Busara Center for Behavioral Economics

Reviewed by ORCID_LOGO, 14 Sep 2022

I will comment just on the choice of analysis.

p 10: "and the Ha is not H0" and also footnote 2 "for others, the Bayesian hypothesis testing can be done without specifying the alternative hypothesis"

In fact, a Bayes factor always requires a specification of H1 because one has to calculate the probability of the data given H1, and this can only be done if H1 is some particular distribution. Where it seems not to be done, e.g. in the Hoijtink reference, it is done implicitly; and in the current case of a default, there is an explicit distribution, it is just that it is chosen without reference to the specific scientific problem. In this case, the authors themselves claim there is a distribution for H1, so the statements cited made above should be deleted. However, I think the model of H1 used as a default by the authors is not exactly equal fixed probabilities in each cell, as might be read from their description. Rather it is the distribution of probabilities in each cell is the same. What the authors need to do is say what this distribution is, and briefly justify its relevance.

Incidentally, to see that there is a distribution involved, try in JASP "Bayesian multinomial test" which I think is what the authors are using, if one specifies the same expected counts as the counts for the prior (model of H1), the Bayes factor is not 1. That is because the prior/model of H1 uses a distribution of probabilities in each cell.

In terms of justifying their model, the authors can show what Bayes factors are obtained for different deviations from expected proportions.  This will indicate what size deviations their analysis is sensitive to, given their Ns. They should do this in order to show the severity of their tests: Is it likely that the tests will find evidence against their hypotheses, given reasonable assumptions about what size deviations there might be?

The Design Table needs to be more specific. List each hypothesis that will be tested, giving the exact test, and stating under what conditions the hypothesis will be deemed supported or refuted (e.g. what BF threshold).

Evaluation round #1

DOI or URL of the report:

Author's Reply, 12 Sep 2022

Decision by ORCID_LOGO, posted 28 Jul 2022

Thank you very much for granting PCI-RR the opportunity to peer-review your paper.
At the same time, I sincerely apologize for the long delay in responding to you.

This manuscript was reviewed by two very experienced researchers who are interested in the WEIRD issue. Frankly speaking, the purpose of this study is favorably viewed by the reviewers as well as by myself, and it is desirable that this study be properly carried out.

This will require careful elaboration by a major revision, especially with respect to sample representativeness, coding, and sampling methods, as the reviewers have pointed out.

One reviewer also raised concerns about the placement of hypotheses and the setting of prior distributions in the Bayes factor analysis. It would be good to have this point clarified, but if necessary, please let us know that we can ask an expert in Bayesian statistics to check this point as an additional reviewer. In that case, we will ask them to focus the scope of their review on this point, so we do not expect it to take as long as it has so far.

I am looking forward to your revised manuscript.

Yuki Yamada

Reviewed by , 04 May 2022

Reviewed by , 27 Jul 2022

The authors propose to assess the representativeness of participants in Chinese psychology research. To this end, they propose to compare samples from five different sources:

  1. Samples from five mainstream Chinese journals
  2. Chinese samples from large-scale international collaborations
  3. Non-Chinese samples from large-scale international collaborations
  4. The National Bureau of Statistics of China
  5. The Chinese Family Panel Study

They pursue their goal of assessing the representativeness of participants in research in China with five activities, which use the samples illustrated in 1-5: 

  1. Compare samples from Chinese journals (1) to Chinese samples in large-scale collaborations (2);
  2. Compare samples from Chinese journals (1) to census data (4 and 5);
  3. Compare Chinese samples from international collaborations (2) to non-Chinese samples in international collaborations (3) 

I love the concept of this project. Psychology has long paid too little attention to sampling. Most of the time this problem gets described under the umbrella of the “WEIRD problem”, but it can be construed more broadly as a “generalizability problem”. The problem can even be construed more deeply as an issue with who defines the samples and topics that are interesting to study and how we draw conclusions about those samples and topics. I think this project could advance our understanding of these kinds of problems.

I don’t have many specific problems with the proposed study – instead, I have some suggestions for the authors to consider. Some of these suggestions may broaden the scope of the research. I think it would be fine if the authors declined some of these – so consider these suggestions as possibilities that the editor and authors can think about together as the protocol is revised.

My comments are divided into four sections:

  1. Broad aims
  2. Coded characteristics
  3. Data sources
  4. A note on the analysis plan

Broad aims

Although I love the topic of the project, I could imagine a skeptic wondering whether anyone would expect Chinese samples to perfectly represent the Chinese population. Researchers are supposed to choose the sampling methods that allow them to accomplish their research goals. Sometimes this involves random sampling to accomplish representativeness, but sometimes it doesn’t – as is the case when researchers simply want to show that a psychological phenomenon exists in any population at all. This defense is actually the very one offered by Mook (1983) in response to early claims that psychology has a generalizability problem (

However, there are some powerful responses to Mook’s argument:

  • Although some of psychology research involves existence proofs, many research topics require going beyond existence proofs. This is especially the case with research that has applied aspirations – it doesn’t really matter if you can get something to work in the lab if it doesn’t work in the real world
  • If researchers focus too much on the experiences and concerns of a narrow sub-population, they will miss phenomena that are experienced outside of that sub-population (see I like to think of these missed phenomena as “unknown unknowns” – research psychologists can’t even know that they are missing them because their measures and datasets don’t include the necessary information to know this 
  • Researchers choose research priorities based on their own experiences. If researchers are also drawn from a narrow subpopulation, they will choose research priorities that are important to that subpopulation, creating a distorted view of human psychology (see 

I think the authors should consider Mook’s arguments and the responses to it. Doing so might inform the aims and design of this study, as well as the information that is coded from each data source (see my next point, below).

Coded characteristics

The characteristics that are coded should be selected to accomplish the project’s broad aims. Because I think these broad aims might need a bit of adjustment, and because the specific adjustments should be decided by the authors, I won’t be too prescriptive with my suggestions about what to code. However, I do have a few thoughts that will, hopefully, help the authors think through what sorts of characteristics to select.

If the authors want to show that research in China is too focused on a specific research aim, such as the existence proof, they might consider coding some characteristics that capture the match between aim and sampling method. This might include, for example, the type of sampling the authors implemented (convenience, online panel, probability, etc), the type of research (exploratory or confirmatory), and/or the setting (lab, field, online, etc). They might find some ideas of what to code in this article on Arabic social psychology by Saab and colleagues (2020; 

If the authors want to capture the types of topics the authors select (and maybe, compare the topics in Chinese language journals to those in big team science initiatives), it might be worth coding something about the broad topic of study. Ideally, this would use a pre-existing coding system (such as the article keywords) to lower burden on the coders. This was a focus in a commentary I co-wrote on African psychology (; we didn’t do a systematic coding of topics, but instead tried to give a holistic sense of how African priorities might differ from Western priorities.

If the authors want to assess who’s setting the research priorities, they might want to code where the lead authors of each article are from and/or what their background is. This is an approach taken in Thalmeyer and colleagues (2021; – a more recent update to Arnett (2008) that the authors might find useful to scan.

Another similar possibility is to code the abstracts of the papers the authors sample for whether the source of the sample is mentioned, which could tell the authors who researchers take as the implicit “default participant”. This is an approach taken by Kahalon and colleagues (2021; 

Data sources

I two brief notes on the data sources the authors have chosen.

  • Chinese journals. I must admit to ignorance as to the landscape of Chinese-language psychology journals, so I can’t really evaluate whether the five Chinese-language journals are a good representation of this landscape. For the benefit of readers like me, can the authors provide some description of how these journals were chosen – and maybe, of the landscape of Chinese journals generally?
  • Big team science initiatives. The landscape of ManyLabs-style initiatives (or, as I like to call them, “big team science” initiatives; see has grown a lot since the first ManyLabs studies. If you need a list of possible data sources for this style of study, you might want to consult this spreadsheet (, which is compiled and maintained by Dwayne Lieck and Daniel Lakens

A note on the analysis plan

The proposed analysis is very detailed and uses Bayesian methods that I don’t feel qualified to review in detail. However, I felt generally that the specific analyses may be too focused on evaluating whether Chinese samples are “exactly representative” of the Chinese population. I would advise more thought on the broad goals of the research and the characteristics that need to be coded to achieve those broad goals, then revising the analysis plan. 


I love the topic of this proposal and want to see the finished product. I don’t have strong views on the direction authors ought to take the protocol – though I do think they might benefit from reflecting a bit on the project’s broad aims. This would give them the opportunity to sharpen the specific goals and research activities so that their project is as impactful as possible.

I sign all my reviews,

Patrick S. Forscher

Research Lead, Busara Center for Behavioral Economics

(PS: I noticed a few minor English usage mistakes. These didn’t factor into my evaluation at all, which is why I am writing this note at the bottom. However, if the authors want someone to do some quick copy edits whenever they’re looking to submit this to a journal, I’d be willing to help them with this)

User comments

No user comments yet