Does self regulation by gaming companies for the use of loot boxes work?

ORCID_LOGO based on reviews by Chris Chambers, Lukas J. Gunschera and Andy Przybylski
A recommendation of:

Assessing compliance with UK loot box industry self-regulation on the Apple App Store: a 6-month longitudinal study on the implementation process


Submission: posted 27 August 2023
Recommendation: posted 25 March 2024, validated 25 March 2024
Cite this recommendation as:
Dienes, Z. (2024) Does self regulation by gaming companies for the use of loot boxes work?. Peer Community in Registered Reports, .


Video games may provide the option of spending real money in exchange for probabilistically receiving game-relevant rewards; in effect, encouraging potentially young teenagers to gamble. The industry has subscribed to a set of regulatory principles to cover the use of such "loot boxes", including 1) that they will prevent loot box purchasing by under 18s unless parental consent is given; 2) that they will make it initially clear that the game contains loot boxes; and 3) that they will clearly disclose the probabilities of receiving different rewards.
Can the industry effectively self regulate? Xiao (2024) will evaluate this important question by investigating the 100 top selling games on the Apple App Store and estimating the percentage compliance to these three regulatory principles at two time points 6 months apart.
The Stage 1 manuscript was evaluated over one round of in-depth review. Based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).
URL to the preregistered Stage 1 protocol:
Level of bias control achieved: Level 2. At least some data/evidence that will be used to answer the research question has been accessed and partially observed by the authors, but the authors certify that they have not yet observed the key variables within the data that will be used to answer the research question.
List of eligible PCI RR-friendly journals:
1. Xiao, L. (2024). Assessing compliance with UK loot box industry self-regulation on the Apple App Store: a 6-month longitudinal study on the implementation process. In principle acceptance of Version 3 by Peer Community in Registered Reports.
Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Evaluation round #2

DOI or URL of the report:

Version of the report: 1

Author's Reply, 29 Jan 2024

Download author's reply Download tracked changes file

Please find (i) my response to the recommender comments and (ii) the manuscript file with all changes tracked separately attached below. Thank you!

All files (including a clean version of the manuscript with all changes confirmed) are available as one document via (version 1).

Decision by ORCID_LOGO, posted 28 Jan 2024, validated 29 Jan 2024

Dear Dr. Xiao,


Thank you for your submission to PCI-RR. I now had the opportunity to read your revised manuscript and the response to the reviewers. Let me first apologise for the considerable delay in my response. I understand that this may have caused changes to the timeline of your project, with the original date of the first sampling of games being January 18. There were unfortunate private and professional circumstances on my side that I failed to manage to provide a timely response. Again, my sincerest apologies.


I have decided not to send out again your revised manuscript for re-review. The points raised by the reviewers in round 1 were quite clear, and it felt like an undue use of their time to ask them again to check your response to them.


Overall, I am quite happy with your points, counterpoints, and changes to the manuscript. Referring to the major points in my previous letter:


1. I no longer see any issues with using Ukie and non-Ukie games, given their claims that you cite, but also their normative influence on the gaming industry in the UK. I also appreciate the changes you made to the title. Further, I also find it acceptable that you do not scale your sample against the number of players per games, and essentially give each game the same “weight” in your assessment. However, I expect that this importance point will receive extensive attention in the discussion of your findings, whatever they will be. 


2. You offer a useful definition of what you consider a lootbox “event” in your study. This is now much clearer, and also how you will proceed playing and encountering lootboxes, if they exist.


3. To be frank, I still have some concerns with regards to your use of hypothesis tests and cutoffs. In your response letter, you argue that the cutoffs are arbitrary, but that you will use them regardless to prevent yourself from changing your interpretation the final %, as different people might interpret the same rate quite differently (industry person vs advocacy group). Further, you argue that the observed rate should not be generalised beyond the top 100 grossing games, and you argue that it is not a sample, but a population.

 I can follow your reasoning for the first point, though I would have expected an elaboration of this point in the manuscript itself rather than merely the response letter. Certainly, one could argue that rules are rules, and that therefore any deviation from 100% is a failure of self-regulation. But then: I am sure that the same industry, or other industries, also do not perfectly self-regulate with regards to other issues, although one might generally say that, on average, the system works. It would probably be best to offer your line of reasoning to the reader.

Regarding the second point, I will admit I remain unconvinced. I can only imagine that it will be excruciatingly difficult for you to write the results section of your manuscript in a way that does not generalise beyond the two populations examined (the January and the July top 100 grossing games). Simply considering the title of the current manuscript, “Assessing compliance with UK loot box industry 1 self-regulation on the Apple 2 App Store”, readers might not get the impression this is merely a study on two specific lists of games at two arbitrary points in time. Similarly, the abstract says “[c]onclusions will be drawn as to whether the measures have been complied with by companies to an adequate degree”, which certainly suggests you seem to think your findings can be used to infer something about companies in this industry generally (which I’d agree with). Finally, although this may be more a linguistic habit, you do refer to these lists of games as “samples” throughout the manuscript  (the word is mentioned 19 times, whereas “population” is not mentioned once).

I do not think this point is super important for the manuscript or the empirical work. Looking at the top 100 grossing games makes sense, as these are certainly the games where compliance with the regulation would be more important. However, I do not see what is special about the top 100 grossing games that would justify excluding games not on this list from your interpretation. At the very least, you have not yet provided a compelling argument for this. 


4. Finally, I appreciate that you have shifted from a programmatic RR to a standard one.


With kind regards

Malte Elson

Evaluation round #1

DOI or URL of the report:

Version of the report: 1

Author's Reply, 24 Nov 2023

Download author's reply Download tracked changes file

Please find (i) my response to the recommender and reviewer comments and (ii) the manuscript file with all changes tracked separately attached below. Thank you!

All files (including a clean version of the manuscript with all changes confirmed) are available as one document via (version 1).

Decision by ORCID_LOGO, posted 26 Oct 2023, validated 26 Oct 2023

Dear Dr. Xiao,


Thank you for your submission to PCI-RR. I now had the opportunity to read the paper in-depth, and the excellent reviews provided to evaluate the merit of your research proposal.


All three reviewers – Dr. Przybylski, Dr. Chambers, and Dr. Gunschera – mention they found your proposed study timely, and the research question worthy of an in-depth investigation as suggested. I too, share this view: Lootbox regulation, and industry compliance with it, are a topic of increasing attention within the gaming community and the public sphere. As such, a study as the one proposed could easily become a material piece of evidence in the evaluation of policy effectiveness, and perhaps even affect compliance with regulation itself. 


However, all three reviewers raised important concerns with the proposed design of the study. I fully concur with them, and will add a few of my own observations below. Some of these points you might disagree with, and I invite you to provide counterarguments in a response letter. Others might be addressed by providing more details and improving clarify of the manuscript. And yet others, I believe, will require changes to your study protocol. The reviewers have offered guidance how the study design and the writing in the manuscript might be improved – please consider these points as you prepare a revision of your research protocol.



Dr. Przybylski has remarked on the choice to include games not represented by Ukie, and that this weakens the severity of your test. I agree with this point: If this study is designed to test compliance with self-regulation principles by an industry trade body, then it does not seem ideal to include studies that do not fall under this self-regulation, and whose developers are not represented by Ukie. Whether there is a difference in regulation compliance between Ukie and non-Ukie games may itself be an interesting empirical question. I will leave it up to you to decide whether to pursue this or not, but if you do, then you need to account for this in your sample size and sampling strategy somehow. For example, if only 10% of the top 100 games are actually represented by Ukie (or vice versa), a serious empirical estimate of the difference would probably not be within reach. If resources are an issue, as you state, then it may be advisable to only include those games that are represented by Ukie, at the price of narrowing generalisability of your findings.

On this point, I also agree with the reviewer that the focus on the UK market should be represented in the title and conclusions of the paper. Going further, I believe it would also be appropriate to highlight the focus on mobile games, as the sample is restricted to games in the Apple store.



Dr. Chambers and Dr. Gunschera both raised aspects that regard the definition of lootboxes in your study. Whereas Dr. Chambers asks whether one hour is enough to “encounter” a lootbox, Dr. Gunschera raises concerns regarding the focus on lootboxes that can be bought with real currency rather than in-game currency. Both of these points are important, and I believe they concern a mutual point: What is a lootbox, empirically, in your study? Surely it is not the virtual representation as a box, nor can it be any in-game purchase, nor any chance-based event. As such, I invite you to provide further details how you define and identify lootboxes in games, and by which means: You mention each game will be played for an hour. Does that mean “typical” game actions will be performed (as if you were a regular player), or will you just have the app open for this time? I am asking because it is conceivable that certain in-game actions are linked to lootbox drops. Overall, the manuscript lacks procedural and methodological details that the readers of the paper would surely appreciate. 



There is another important point by Dr. Przsybylski regarding the sampling framework as it affects the conclusions from your observations: Are you studying games or gamers? That is, if only games that with a small following are noncompliant, then surely we would have to conclude that the problem is smaller than if the top games (by number of “encounters” with lootboxes) were noncompliant. I think this is a conceptual problem that deserves further attention, and that may not be easily “fixed” given that even obtaining reliable numbers on the games’ market share might be difficult to obtain. 



Dr. Przybylski and Dr. Gunschera have both remarked on the somewhat arbitrary choice of cutoffs to determine the compliance level. I, too, was confused where they came from, and to be honest I was wondering about the utility of defining cutoffs for the purpose of making a dichotomous decision in a hypothesis framework when just knowing about the empirical rate itself is of great interest (though I am happy to be convinced otherwise, maybe this just needs some justification). Further exacerbating, the point estimates you propose using will suffer from substantial uncertainty. For example, an incidence rate of 95 in a sample of 100 games has a 95% confidence interval of 76.861 to 116.133, the lower bound being below your cutoff for “inadequate compliance”. Of course, I understand this is not a random sample of an unknown population of games: the top 100 are the top 100. Then again, I am sure you would prefer generalising your findings to games not included in the sample. 



Dr. Chambers raises a concern regarding your proposal to register this study as a programmatic RR. To be honest, I overlooked this point until I read his review, but I tentatively agree that I currently do not see the value or necessity to have two separate publications rather than one comprehensive paper that encompasses all research questions and data. Of course, I cannot stop you from writing two papers rather than one, but if you do insist on submitting this as a programmatic RR rather than a single RR, please consider the guidance offered by the reviewer, and highlight the different contributions of each paper, and why it is important or sensible to treat these differently.


With kind regards

Malte Elson

Reviewed by , 26 Oct 2023

Question 1A. The scientific validity of the research question(s)
Reply 1A. The question of whether companies comply with statutory or suggested regulatory initiative is an interesting one to me. I approach reading this believing that there is very low compliance, the report suggests I should expect 1 in 3 games might comply if the UK is like the US. I am not quite sure that they research questions that are research questions in the classic academic sense. It is some form of policy or programme evaluation to my reading. I will defer to the editor on this point but note that the UK focus should be consistent from title to interpretation. 
Question 1B. The logic, rationale, and plausibility of the proposed hypotheses (where a submission proposes hypotheses)
Reply 1A. Given the UK-specific focus that justifies the research questions (and by extension the hypotheses) I am concerned by the framing of the research questions and how they’re translated into testable hypotheses. If this is indeed a study of industry practices in the UK and premised on principles articulated by UKIE, shouldn’t these hypotheses be focused on paid loot boxes in games that are represented by UKIE?
I do not believe it is a fair test of the principles if they don’t only apply to companies who are represented by UKIE. Like social media, and online safety conversation more generally, this is a thorny problem. How might we regulate global tech industries (e.g. porn, social media, games) when these firms and the decisions they take are determined in Beijing or Palo Alto? I think the VGRF and these principles are very good ideas but I don’t think it’s a fair test of their local effectiveness to examine top grossing games in the UK if they’re creators are based in the USA (ESA), or EU (VGE), or other jurisdictions. I believe the UK has many smaller developers, but I am not sure if they’re represented in the top 100 or more likely to be on mobile or console/pc platforms. Is this the case?
Similarly, I’m not sure that 100 top grossing makes sense given that I doubt these are equally profitable or popular games in the UK. For example, it might be the case that the top 4 or 5 games accounts for 80% of the play volume and spending. And the remaining 95% of the top 100 are just 20% of the market. If these 5 games were 100% compliant with the principles would you count this as 80% compliance or 5%? I think this materially effects all of the research question including the incidence/prevalence of probability disclosures. 
Finally, without knowing the base rate of “ask to buy” I find it difficult to assess how well-justified disregarding this feature is. I know I use this feature with our under18s and I would not allow our children to use the app store at all without it. I think that this would introduce an unknown source of error or uncertainty in any of the point estimates which would be reported in the work. 
Question 1C. The soundness and feasibility of the methodology and analysis pipeline (including statistical power analysis or alternative sampling plans where applicable)
I think this is a feasible way to approach it if the above base rate and geography issues are tackled. I am not really sure where the 95, 80, and below 80 levels come from though. Reading earlier in the report, I might expect the rate to be 35%. I could envision this being the standard and movement starting from this level (to I would hope something much higher) being the standard. My sense is that the author is interested in improvement, so how much improvement would be needed to know if progress is being made in the UK?
The author might also consider starting with a prior belief there is a 50/50 chance that a UK game creator is getting things right is the correct starting point and seeing if this is true at the start of the data collection and if this has improved at the 6 month mark. 
I do not understand how (or who) at DCMS or UKIE would preregister their hypotheses (lines 473 and 474) or what the value of this would be. I don’t think most video game researchers would be able to do this. 
Question 1D. Whether the clarity and degree of methodological detail is sufficient to closely replicate the proposed study procedures and analysis pipeline and to prevent undisclosed flexibility in the procedures and analyses.
I do not believe so. I think a detailed protocol with its own figure would be helpful. 
Question 1E. Whether the authors have considered sufficient outcome-neutral conditions (e.g. absence of floor or ceiling effects; positive controls; other quality checks) for ensuring that the obtained results are able to test the stated hypotheses or answer the stated research question(s).
I do not think so, but I am not sure that is a problem. As the study is framed currently, I do not see a situation where the hypotheses won’t be confirmed.

Reviewed by ORCID_LOGO, 20 Oct 2023

I enjoyed reviewing this Stage 1 RR – it tackles a timely and important research question and clearly spells out the rationale, hypotheses and proposed methodology. I am not a researcher in this area and will defer to experts for specialist assessments. Instead I focus my evaluation on issues that are generally relevant across most Stage 1 RRs. I hope my comments are helpful.
1. On the issue of sampling bias, I think you make a good point that we cannot know whether compliance is driven by the current changes or prior external intervention (pp8-9); and that consequently this makes it difficult to generalise the eventual results to compliance rates more broadly. To address this point specifically, could it be useful to include an exploratory analysis at Stage 2 within the subset top-100 games for which no previous intervention was known? Could a comparison be useful (even descriptively) between games subject to prior intervention vs no prior intervention?
2. You have allocated 1 hour per game to detect loot boxes. How confident are you that this is long enough to detect loot boxes where they exist? I would recommend including some justification for this specific period. Ideally, the sensitivity of this test could be confirmed through evidence rather than intuition: e.g. the strongest case would be previous data confirming that in cases where loot boxes are known to exist, 1 hour is sufficient time to always detect them (and if the detection rate is less than 100%, then what consequence will this have on the sensitivity of the current design to test the hypotheses).
3. p15: “Stakeholders (specifically, the DCMS and Ukie) will be invited to preregister how they will interpret different potential results that may be found by the present study.” If possible, I would suggest inviting them to do this now and then including this pre-specification in the revised Stage 1 RR – that way they will be as bound by their prospective interpretation of the findings as you are.
4. On the issue of delisting resulting in loss of apps: To ensure an adequate sample size, I suggest anticipating the likely delisting rate and overrecruiting in Jan 2024 by that amount to maximise the probability that the July 2024 sample still includes the top 100 at that time (e.g. if a 5% delisting rate were to be expected then take top 105 games in Jan 2024).
5. Precision of hypotheses. The hypotheses are generally clear but I would recommend two changes. First, they should make explicit mention of the two time periods and whether the same predictions are made at each point. Second, even though there is no inferential statistical analysis, this is still quantitative hypothesis testing so the manuscript should include a study design template.
6. The Jul 2024 period is very reasonably at the conclusion of the implementation period. I am wondering however if there would be any value in pushing this back to Aug 2024 to capture any possible delays in compliance? I don’t know enough about this area or the regulatory frameworks that operate, but is there any possibility that a company could intend to comply but just be a few weeks late? By allowing a post-implementation “grace period” of e.g. 1 month (from Jul to Aug), would the demonstration of low compliance rates be a more powerful signal to stakeholders and provide less wriggle room for non-compliant companies to plead minor delays? I will defer to the author’s judgment on this point and note it for consideration only.
7. My final comment is about the programmatic nature of the submission. I can certainly see the value of separately evaluating compliance during and following the implementation period. However, it also seems to me that the final results will be more coherent as a single encapsulated Stage 2 RR rather than two RRs. I am also not sure that the pre vs post implementation components are sufficiently substantive to justify 2 x Stage 2 outputs under Stage 1 criterion 1C (though I concede I am viewing this through a non-specialist lens and do not intend to devalue the amount of labour involved). Also: A programmatic Stage 1 RR typically includes separate sections to explain which specific parts of the proposal will be presented in the different outputs, sometimes going as far as to indicate different font colours to show which text will go in which manuscripts, and these details are always specified in advance (e.g. see here and here for examples). So, in the event that the submission ends up being programmatic, some similar structural work will be needed here.
Lines 152-155: I struggled to parse this sentence.

Reviewed by ORCID_LOGO, 16 Oct 2023

The manuscript at hand addresses an important issue, the compliance of the mobile game industry with UK self-regulation loot box measures. This work is timely and will make a great contribution to literature and policy concerning gaming consumer protection. That being said, I have found that the manuscript may be improved in the following areas.


1) The scope of the present manuscript concerns loot boxes purchased with real currencies as opposed to in-game obtained currencies. Although this distinction is common in the literature, I believe it warrants elaboration and think the proposed work would benefit from recording data on all possible avenues of purchasing loot boxes (i.e., whether players have the option to purchase the loot box with in-game currencies in addition to real currencies). I believe this is informative due to the fact that the gambling-like characteristics of loot boxes persist irrespective of the currency used to obtain them. The value of any currency, whether real or virtual, is learned. Therefore, beyond the concerns for parents’ wallets, the psychological effects of loot box purchasing may span across the currencies used to purchase them.

Furthermore, the psychological effects of loot boxes may even be strengthened for purchases with in-game obtained currencies, as opposed to money. Players who have invested many hours into obtaining the said in-game currency may perceive this to be a much larger investment than money, especially when the money comes from their parent’s wallet. While I understand that the distinction between real-world and in-game currencies is common, I believe it would be worthwhile collecting information on the currencies that can be used to obtain loot boxes (money, in-game, both) for each of the 100 mobile games (ll. 87-93, 364-368).


2) Despite resource constraints and stakeholders’ heightened interest in the highest-grossing mobile games, the sample size rationale is insufficient. A power analysis/simulation would help determine which effects the study would be sensitive to, especially in consideration of the fact that precise decision cut-offs are given for all hypotheses (ll. 245-254).


3) For Hypothesis 4 the decision criterion is different to the preceding hypotheses. Please add a brief explanation for this change (ll. 227-230).


4) Overall, the manuscript would benefit from some type-editing. This includes breaking up long and convoluted sentences; using accessible language as opposed to unnecessarily complex words; and using precise and objective wording. Some examples below: 

ll. 152-155 Convoluted sentence structure

ll. 171-174 Complicated wording

ll. 196-198 Subjective/moral wording

Download the review

User comments

No user comments yet