URL to the preregistered Stage 1 protocol: https://osf.io/5mxp6
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.
List of eligible PCI RR-friendly journals:
DOI or URL of the report: https://osf.io/f3yab/?view_only=33e5875516d144ed98509c7871242b31
Version of the report: v4
Dear Leon Xiao,
Thank you for submitting the revised Stage 1 of the manuscript. You’ve done comprehensive work and carefully responded to all requests. All the reviewers are satisfied with the revisions, and I agree that the manuscript is now closer to the IPA. However, I must request a few more revisions before we meet the PCI RR standards.
1. Regarding the philosophy of hypothesis testing in general, there should always be a good reason for testing a hypothesis (PCI RR criterion 1B). Although a lot of relevant information is included in the manuscript, we are still lacking explicit justifications, i.e., *why* does the study expect that each hypothesis will be true. With H1 and H2, for instance, I would agree that because loot boxes are ‘banned’ in the country, there is a good reason to expect virtually no loot boxes at all. But you should explicitly provide such rationale, e.g. “Based on loot boxes being banned in Belgium, there will be no loot boxes in Belgium”. The justification can be inside or before/after the hypothesis, but there must always be a clear justification.
2. As an example of the above issue: on page 8, after stating the hypotheses, a conflicting interpretation of H1 and H2 seems to occur: “Hypotheses 1 and 2 mean that a Belgian loot box prevalence rate of *less*than or equal to 2% will be found”, while the actual H1 and H2 state a *higher* prevalence. Lower prevalence, again, is suggested on page 13, but higher is suggested in the table at the end of the MS. Please clarify and justify which result is expected. Also, please note that, as currently H6 expects that at least 3 games will include loot boxes, this would conflict with expecting less than 3% loot box prevalence (H1 and H2). Clearly stating the rationale in each case will solve this.
3. Justification is also needed in Type 1 error control. It currently reads: “The rate of 2% was chosen instead of 0% to provide type 1 error control.” But one needs to justify why 0.02 was chosen as a control. I provided one such example in my previous letter (a previous study found 1 false positive, which was doubled to be safe). As I have now read your new commentary paper, which demonstrates no less than 22.9% (!) disagreement rate between two studies, it might be justifiable to use even more control than 2% (one of the examples in the commentary reminds me of the problematics in assessing the level of skill, when evaluating a mechanic to be considered gambling/loot box, see:
Lipton, M. D., Lazarus, M. C., & Weber, K. J. (2005). Games of skill and chance in Canada. Gaming Law Review, 9(1), 10-18.)
(I mention this paper just it in case it might be useful later)
To be clear, it is up to you to decide what error control to use, and you can definitely keep 2%, but whatever number you use, please justify it (at least briefly explain why it was chosen). For these hypotheses, perhaps also address the possibility that some loot boxes operate with a license and, if such evidence is found, they will not count in H1/H2. In general, it could be useful to refer to “unlicensed loot boxes” maybe?
4. Regarding H4 and H5, there seems to be no justification and, as you say, the numbers are based on mere intuition (theoretically, you could craft 101 unique hypotheses for the prevalence and one of them would be true!). Therefore, I must ask you remove these hypotheses (unless a good justification is found). When a research field is not yet at a stage where hypotheses can be crafted based on good existing knowledge, more work is needed before hypothesis testing can be started. See e.g.,
Scheel, A. M., Tiokhin, L., Isager, P. M., & Lakens, D. (2021). Why hypothesis testers should spend less time testing hypotheses. Perspectives on Psychological Science, 16(4), 744-755.
That said, you can surely provide an estimation of how many loot boxes there might be, but that wouldn’t be hypothesis testing (there would be no hypothesis to test, but RRs can also be used for transparent estimation in order to provide unbiased estimates).
Although this goes beyond the present study, I could imagine that reviewing the global literature and data on gambling regulation in general could yield a reasonable prior concerning the effect of successful regulation strategies. For instance, we do know that illegal gambling exists around the world, but perhaps less so in successfully regulated countries. Whether loot box regulation is useful or successful could be, in the future, based on this previous knowledge of regional (online) gambling regulation. But again, I highlight that using this kind of (or similar) approach in the present study could lead to new issues and the need for new reviews, for which removing H4 and H5 is likely the better option (again: you may still discuss the level of prevalence when the results are known, but you cannot make related *confirmatory* claims).
5. Binomial testing in H3 should be ok now and the justification for 0.65 is reasonable. However, I am still concerned about the power analysis. 0.15 has been chosen as the effect size but there is still no justification for why is that effect suitable? So again, we need an explicit justification (PCI RR criterion 1C). What would be the smallest effect that is meaningful for this study and why? The paper I referred to previously also provides a good overview regarding different ways for defining a meaningful effect.
Dienes, Z. (2021). Obtaining evidence for no effect. Collabra: Psychology, 7(1), 28202.
In other words, how few (= how much less than 0.65) loot boxes should there be in Belgium for that number to be relevant at all? This question needs to be answered before statistical power can be calculated.
6. Still related to H3, there seem to be 2-sided and 1-sided tests, both, carried out for no reason. Please delete the 1-sided tests, which are duplicates.
7. Regarding RQ1 and RQ2, I would suggest combining them along the following lines: “Has the Belgian ban succeeded in eliminating paid loot boxes from mobile games?” You can answer this RQ via both H1 and H2.
8. Regarding RQ3, I would suggest simplifying it along the lines “Has the Belgian ban on paid loot boxes been effective?” This is what you test with H3.
9. If you remove H4 and H5, you can also remove RQ4. Of course, the results will still include the prevalence rate, so scholars can speculate about the exact effectiveness of the ban post hoc. RQ5 is good.
10. … going briefly back to the hypotheses, with error control now in H1 and H2, you have rephrased them as “More than two…” Please note that hypotheses are not statistical statements, but they apply to the world in general. Essentially, we’re expecting the absence of loot boxes, and error control only reflects our awareness that testing and methods aren’t prefect. So, the hypotheses can well be “The highest-grossing iPhone games in Belgium do not contain paid loot boxes” -- and this will be accepted even if 1 or 2 do contain loot boxes, but only because we acknowledge the possibility of error in analysis/methods (alternatively you can include the justification directly inside the hypothesis, as exemplified in #1)
11. The above applies to H3 as well. I would suggest following Macey’s suggested wording with the necessary modification: “Of the highest-grossing iPhone games, fewer will contain paid loot boxes in Belgium than in countries that have not banned loot boxes.”
12. If you wish, H6 could also be clarified more (but this is optional, as it’s not making explicit statistical statements): “Games known to contain paid loot boxes will continue to offer them for sale even when the phone is within geographical and jurisdictional Belgium.”
13. Following the above, it is also not clear what outcome will corroborate H6 (or null). E.g., if one game continues to offer loot boxes but 2 games do not, what would be the conclusion? The criteria for interpreting the results for a hypothesis must always be clear.
A few smaller notes/suggestions, which may be considered.
- On page 2 it reads: “there are two types of loot boxes” --> perhaps rephrase into “loot boxes can be divided in two types” (because there are dozens of different types of loot boxes)
- On page 3 it reads: “and therefore does not possess real-world monetary value” --> consider specifying e.g., “direct real-world monetary value” (because accounts can still be sold onward, right?)
- It reads that “The following hypotheses will be preregistered at <[OSF registry link]>” but since this is an RR, you don’t need to separately register hypotheses.
- On page 11 it reads: “A ‘paid loot box’ will be defined as being either an Embedded-Isolated random reward mechanism or an Embedded-Embedded random reward mechanism” --> please explain to the reader what these concepts mean
- On page 11 it reads: “95.4% of games were coded through gameplay and only 4.6% of games had to be coded through internet browsing.” Does this refer to games or games with loot boxes? As the % of how many games in general must be coded via internet depends on the prevalence rate (e.g., with 0.5 prevalence one would need to code 50% of data with internet), the % of loot boxes found would be more informative.
- On page 15, I would still suggest removing the following sentence: “… and conclude that the Belgian measure was likely ineffective.” Already the anecdotal evidence cited in the manuscript shows that the ban has had some effect (= some companies adjusted their design), so it feels wrong to conclude that the measure was likely ineffective, unless direct evidence is found (and justified what effectiveness would be). The section reads very well otherwise, so I suggest just dropping this sentence.
I hope this feedback is helpful and I believe we are very close to IPA after the above revisions have been implemented. Please let me know if any of the comments are unclear -- I will be happy to clarify.
- Veli-Matti Karhulahti
DOI or URL of the report: https://osf.io/8fvt2/
Dear Leon Xiao,
Thank you for submitting your Stage 1 manuscript to PCI RR. To my knowledge, this is the first RR in the domain of law, and as such a highly interesting manuscript to handle. I have now received all three reviews, collectively representing expertise of gaming and law as well as the related methods. The reviews are very positive, but also highlight issues that need revision. In general, the reviewers are consistent and do not express conflicting views, for which I will merely follow-up on some of their points and add a few comments of my own. I start by moving chronologically through the MS with minor issues, and in the end, I discuss some bigger methodological issues.
1. On page 2, I would remove the part “rather than, e.g., wealthy players” because wealthy players can also be at-risk players. Later the same page, a parenthesis is not closed (“and therefore be…”).
2. On page 5, you introduce Netherlands for the first time, while previously addressing only Manx and UK law. It would improve readability to briefly note earlier that Netherlands is also a candidate (yet its role differs from the other two).
3. On page 6, you introduce (i) and (ii), but as one reviewer points out, they are not stated as RQs. Please reformulate them into explicit RQs. When doing that, carefully ensure that your hypotheses/methods match with and able to answer the RQs. If you have a reason to do otherwise, please explain in the response letter.
4. On page 7, you note how the Belgian Gaming Commission will be contacted and their response discussed. I expect the response will take time and it is possible that you will not have it by the time of Stage 2 review; thus, I suggest obtaining a permission to share their response publicly and storing it in the OSF when it comes (if after Stage 2). Alternatively, if no permission for sharing is gained, you could add a summary to the OSF so that future readers will find the information via DOI.
5. On the same page, a small typo (just before “because”)
6. Methods: as the reviewers point out, the time of data collection is very critical. If possible, I would suggest finding out the Top 100 list in Belgium for the compared time (June 2021). While post hoc analysis is not possible, it would at least allow the reader to assess the fluctuation of the titles on the list (and perhaps you to address that briefly in the discussion, if relevant).
7. On page 10, you say that “game will be assumed by the coder to contain paid loot boxes without the need for the coder to identify and screenshot such a mechanic.” I might have missed something, but I don’t see how third-party involvement would automatically ensure that loot boxes are present. E.g., if there is a known avenue for generating paid loot boxes in sandbox games that cannot be interfered by companies, please cite that.
8. On page 13, you have the ethics statement “No ethics approval will be required because the present study examines and records publicly available information.” Please elaborate, according to what university/country. Different universities and countries have difference ethics policies (e.g., according to the ethics policy of IT University of Copenhagen and the Danish Code of Conduct for Research Integrity, the study did not require ethics assessment).
a) I notice that the prevalence of 0.77 comes from a preprint that has not been reviewed yet. I am flagging it because if the paper remains un-reviewed at Stage 2, or if its peer review ends up affecting the result 0.77, there is a chance that this RR would have to be rejected based on criterion 2B (changes in hypothesis). As it is your own co-authored paper, we can proceed without changes; however, you should be aware that at Stage 2, if the results of the cited paper are still pending, we possibly cannot provide IPA.
b) Related to the above, I can see that in a previous study Zendle et al. (2020) found 0.59 in the UK, and this number is from 2019. Although I understand that you prefer to use a more recent prevalence rate, we do not have evidence that the change is due to time alone, i.e., there can be variation in samples for other reasons, too. Considering that your to-be RQs are interested in whether Belgium has a lower rate vs other countries, and previous studies have found UK 0.59/0.77, Australia 0.62, and China 0.91, a bit more justification is needed why 0.77 has been selected, and as one reviewer points out, why would 0.4 be a remarkably low prevalence.
c) There in issue with the method for testing H3. You are planning to use the binomial test, which assumes that the Belgian Top 100 list provides a random variable, but the compared UK Top 100 list is fixed. Yet since both lists produce outcomes as similar random variables, what you seem to need for testing is a 2x2 contingency table.
d) I also encourage thinking about the comment from one reviewer regarding the effect, i.e. what effect would be societally beneficial for a regulation like this to be useful in practice. Depending on how you proceed, you will then also need to recalculate power, depending on what your final RQ + hypotheses + method is. Please note that PCI RR does not demand any particular power as long as it is justified; however, some journals do, so you should double check that if you have a specific journal in mind (see the next point).
e) I would also like to highlight some parts of the conclusions. On page 12, it says: “if no significant difference is found, then the present study will conclude that the Belgian ban did not appear to affect paid loot box prevalence in Belgium, thus disconfirming Hypothesis 3. The present study will then conclude that the measure is likely ineffective and should not be adopted by other countries.” However, not being able to find effect is not the same thing as finding evidence for no effect. You cannot conclude ineffectiveness based on non-significance alone. If you want to obtain evidence for no effect, see e.g., Dienes (2021)
Dienes, Z. (2021). Obtaining evidence for no effect. Collabra: Psychology, 7(1), 28202.
f) This is a small issue, but in H1 and H2, one reviewer notes how absolute null might not be optimal, and I agree some type 1 error control would be appropriate here. I would suggest keeping it simple, e.g., considering that Zendle et al. (2020) found 1 false positive, you could just double that to be safe and corroborate H1/H2 if more than 2 cases occur (more than 2% prevalence). Although confirming the positives should be rather easy due to the sample size, some control seems reasonable because many things can affect the obtained the sample. If you wish, you may add that in case H1 or H2 is not corroborated but 1 or 2 instances are found, these games will be investigated in-depth as an exploratory analysis. I also encourage you to consider setting alternative/competing hypotheses, as suggested by one reviewer (but you can choose not to, if you so prefer).
Finally, I must ask you to revise the table at the end of the MS. Please include all tested hypotheses, and carefully think in each case what can and cannot be deduced from their outcomes. In addition to all the above, please see and respond to the reviewers’ respective feedback. Needless to say, if you disagree with some the requested revisions, you are free to justify alternative choices. Do not hesitate to contact me if something is unclear. I look forward to reading the next version, based on which I will see if another external review round is needed.