Evaluation of an immersive virtual reality wayfinding task

based on reviews by Conor Thornberry, Gavin Buckingham and 1 anonymous reviewer
A recommendation of:

Evaluation of spatial learning and wayfinding in a complex maze using immersive virtual reality. A registered report

Submission: posted 31 March 2023
Recommendation: posted 04 September 2023, validated 08 September 2023
Cite this recommendation as:
McIntosh, R. (2023) Evaluation of an immersive virtual reality wayfinding task. Peer Community in Registered Reports, .


The Virtual Maze Task (VMT) is a digital desktop 2D spatial learning task that has been used for research into the effect of sleep and dreaming on memory consolidation (e.g. Wamsley et al, 2010). One limitation of this task has been low rates of reported dream incorporation. Eudave and colleagues (2023) have created an immersive virtual reality (iVR) version of the VMT, which they believe might be more likely to be incorporated into dreams. As an initial step in validating this task for research, they propose a within-subjects study to compare three measures of spatial learning between the 2D desktop and iVR versions. Based on a review of relevant literature, the prediction is that performance will be similar between the two task versions. The planned sample size (n = 62) is sufficient for a .9 power test of equivalence within effect size bounds of d = -.47 to .47. Additional independent variables (gender, perspective-taking ability) and dependent measures (self-reported cybersickness and sense of presence) will be recorded for exploratory analyses.
The study plan was refined across four rounds of review, with input from two external reviewers and the recommender, after which it was judged to satisfy the Stage 1 criteria for in-principle acceptance (IPA).
URL to the preregistered Stage 1 protocol:
Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.
List of eligible PCI RR-friendly journals:
Eudave, L., Martínez, M., Valencia, M., & Roth D. (2023). Evaluation of spatial learning and wayfinding in a complex maze using immersive virtual reality. A registered report. In principle acceptance of Version 5 by Peer Community in Registered Reports.
Wamsley, E. J., Tucker, M., Payne, J. D., Benavides, J. A., & Stickgold, R. (2010). Dreaming of a learning task is associated with enhanced sleep-dependent memory consolidation. Current Biology, 20, 850–855.
† There is one minor change that the authors should make to the Methods section, which is sufficiently small that it can be incorporated at Stage 2: "if both tests reject the null hypothesis (observed data is less/greater than the lower/upper equivalence bounds), conditions are considered statistically equivalent" >> suggest changing "less/greater" to "greater/lesser" for correct correspondence with "lower/upper".
Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Evaluation round #4

DOI or URL of the report:

Version of the report: 4

Author's Reply, 02 Sep 2023

Download author's reply Download tracked changes file

Dear Editor,

Thanks for your comments and suggestions (not annoying at all). We understand our message and intention must be clear to everyone, so we've tried made modifications based on your comments, and hopefully the rationale is easier to understand now. If further changes need to be done please let us know (thanks for the patience, as well).

Luis (and team)

Decision by , posted 30 Aug 2023, validated 31 Aug 2023

Thanks for your further revision of this paper. We are getting closer to an acceptable version, but each time that I try to formulate my recommendation (which involves a close re-reading of the paper), I find that I remain confused or dissatisfied with parts of the manuscript/plan.

In the current version:

1) You have now relegated the statistical tests relating to cybersickness, sense of presence and perspective-taking to exploratory status, and removed these analyses from the design table. However, you still have a paragraph in Methods (p19) that describes the planned analyses, including the alpha level, approach to multiple comparisons etc. (You also refer here to "within-group comparisons", which I am not sure I understand.) This is an uneasy half-way house, because you are designating these analyses as exploratory, and yet you are effectively pre-registering them, but only in a relatively imprecise way (and without a priori consideration of their sensitivity). It would be preferable to preserve a clean distinction between registered and exploratory components by not descibing these analyses a priori.

I understand that in order for the reader to appreciate why the extra measures are included, you may wish to refer briefly to exploratory analyses of perspective taking ability, self-reported cyber-sickness and sense of presence, but you could leave the precise form of these analyses open until Stage 2.

2) It seems obvious why you would wish to compare cybersickness and sense of presence betweeen task versions, but it is much less clear why you would compare perspective-taking, which seems more like a (presumably) stable measure of spatial ability. I can't find any clear rationale for the inclusion of this as a dependent variable (although you do also mention its possible inclusion as a predictor for a regression analysis, which makes some more sense). Nor do I follow why you introduce the PTT as a test of 'spatial learning skill' - it does not seem to include any measure of learning.

3) Your power analysis for the equivalence test seems (as far as I can tell) to be appropriate, but the description is not coherent or detailed enough. You state that you determined the SEOSI as "the mean critical effect size (maximum effect size that would not be statistically significant) product of of previously mentioned comparisons, resulting in a SESOI of d = 0.47." Aside from the typo ("of of"), the "mean critical product of previously mentioned comparisons" is too hard to follow, and needs better explanation.

Similarly, when you describe the critical power analysis, you state "... we conducted a series of sensitivity power analyses based on the two one-sided tests procedure for equivalence testing (TOST, Lakens, 2017) for dependent samples. With a statistical power of 90% and an alpha set at 0.02, the estimated sample size is 62 pairs/participants." This description does not seem precise. For instance: What is "a series of sensitivity power analyses"? What do you mean by "sensitivity power" (is it a typo)? What do your analyses actually consist of? What is it that you have power to detect? Is it the power to detect equivalence within your equivalence bounds? If so, then what is your power to detect differences (since your tests will also include these)? How does your conclusion about equivalence depend upon the set of outcomes across your different dependent measures? Do they all need to be equivalent in order to conclude equivalence overall? etc...

These points are not well explained in the appendix, which merely refers the reader to a downloadable spreadsheet from Lakens that will allow them to recalculate power for themselves (once they can work out for themselves how to use the spreadsheet). In any case, it would not be acceptable to offload the explanation of the TOST procedure (and power for it) to an appendix, because this is a critical part of the study design.

4) When I try to formulate my recommendation text for your study, I find that I am somewhat at a loss to pinpoint exactly what the point of your study is. I understand that you are testing for equivalent spatial learning (within certain bounds) between desktop and iVR versions of the VMT, on the basis that if they are equivalent then the iVR version could be considered as a substitute for the desktop version in some experimental contexts. But what if the iVR version shows greater evidence of spatial learning than the dektop version? Or lesser evidence of spatial learning? Would these outcomes mean that it could not be used as a valid version of the VMT?

Overall, it is clear that your main purpose is to compare spatial learning between task versions, but it is less evident why, in that it is not clear what you will conclude about the appropriateness of the iVR version given each of the different possible outcomes (including non-equivalence).

I hope that these concerns are clear and I hope that they do not seem too annoying. I am afraid that I cannot write a recommendation for this study plan until I am sure that I understand it, and confident that another reader would likewise be able to. To this end, it seems that there is still further work required.

Best wishes,

Rob McIntosh


Evaluation round #3

DOI or URL of the report:

Version of the report: 3

Author's Reply, 30 Aug 2023

Download author's reply Download tracked changes file

Dear Editor and Reviewers, 

Attached you will find the reply and tracked manuscript files for Round 3. We'd like to thank once more your work and thoughtful suggestions which have definitely improved this project.

Luis (and team)

Decision by , posted 24 Aug 2023, validated 24 Aug 2023

Thank you for revising you Stage 1 RR to address previously identified issues of power. The manuscript has been seen again by both reviewers, who are happy with the changes made (GB reiterates the point that your chosen level of 0.8 power with alpha .05 will limit the set of possible publication venues, but I assume you have already considered this point).

However, we are still unable to provide IPA for the manuscript, for the reasons sketched below:

1) The power analysis relates to performance of the VRT, but this covers only one of the hypotheses. The second (set of) hypotheses relating to cyber-sickness and sense of presence ask quite different questions, but no consideration is given to the effect sizes of interest for these comparisons, or the adequacy of the sample size to these questions.

2) When considering this issue, you need to bear in mind the following further complications: (i) The tests are predicated on being able to detect a significant difference (in either direction) OR equivalence (by TOST), and your sensitivity for each of these possible outcomes may differ. (ii) Your conclusions will depend upon the combination of outcomes across multiple tests, specifically you state that the tests will be conducted on the 3 subscales of SSQ and the 4 subscales of ITC-SOPI. You need to make it explicit how your conclusions will be informed by the combination of outcomes across the subscales, and to correct the required alpha for multiple comparisons if appropriate. (iii) It is not clear where the PTT fits in to your research questions/hypotheses tests.

(If these complications cause too many problems, then you could consider relegating these further questions to a secondary exploratory status, and removing from the Stage 1 plan.)

3) The within-subjects design improves the power of your study, in principle, but it does create other issues. Specifically, is there a possibility of transfer effects between days; that is, might learning the task on one day (in one format) be expected to influence baseline performance (and thus opportunity for learning) on the second day? Unless I missed it, you do not even specify whether the same or different maze will be presented on each day, but these details seem potentially very important.

4) As a very minor issue, you state that participants will be randomly allocated to one of two task orders. I assume that allocation is not truly random if you intend to ensure that there are equal numbers for each order. Therefore, the allocation schedule may need to be stated more precisely.

Reviewed by , 08 Aug 2023

Thank you for the changes you have made to the protocol - I think the within subjects design seems like a more sensible approach in this case. My one further suggestion would be merely an advisory one (perhaps the editor can weigh in) about the power calculation (and thus sample size) - the eventual outlet ( may well have more stringent requirements than the 'bare minimum' alpha =.05 and power = 0.80 calcualted in the revision (e.g., Cortex uses alpha =.02 and power > 0.90). If this is a possible concern, then I'd recommend re-running the power calculation with 0.9 power, which seems approrpriate given the hypotheses.

Reviewed by , 23 Aug 2023

The authors have accurately addressed my concerns from before. 

The comments addressed by the authors from Round 2 have also in a way answered some of my other concerns.

I believe this will make for a useful and important manuscript for the VR/Spatial Cognition community. I have some final notes:

  • I now understand why the authors have used three trials. Considering this is based on previous research, it makes sense for this manuscript. However, I would strongly encourage any conclusions drawn about spatial learning to address this limitation. 

  • The introduction addressed my concerns clearly and also has a nice flow.

  • Thank you for providing OSF and GitHub links within the manuscript.

  • Gender differences are an important one, and I appreciate you including it. However, I also appreciate that this was not part of the proposed hypotheses. It would be interesting to see if they are having an impact as some Desktop software can eliminate the classic water maze gender effect.

  • I would like to praise the authors for the inclusion and construction of a Spanish version of the PTT. This is great, and I hope it will provide you with some interesting perspectives on variable spatial ability within different virtual environments (Does better PTT ability facilitate better desktop or iVR performance?). I understand this meant changing how the experiment would be run, so thank you for this.

I look forward to reading the completed manuscript.

Evaluation round #2

DOI or URL of the report:

Version of the report: 2

Author's Reply, 20 Jul 2023

Download author's reply Download tracked changes file

Dear Editor and Reviewers, 

This version (v3) includes a solution to the issue regarding the estimated SESOI (and equivalence bounds) and the sample size calculation. We hope that these changes, along with the modifications from the previous round of reviews, will make this study a more suitable candidate for a Registered Report.  

Decision by , posted 05 Jul 2023, validated 05 Jul 2023

Thank you for sending this revised Stage 1 manuscript, with replies to reviewers. Having looked at your replies, I see one potentially major issue, and I think it most sensible to return the manuscript directly to you for further consideration before asking for reviewers to devote more time to this.

At the first round, Reviewer#2 raised the critical issue that it seems relatively unimportant to test for equivalence, where this is defined by an effect size for the difference smaller than d = .77, given that this actually describes a very large effect size. That is, your test would be prepared to declare equivalence between tests even in a statistically large difference between the tests existed - it is hard to see this as a useful form of equivalence,

In your response, your main line of reasoning is that d=0.77 is appropriate for your purposes because VMT performance scores are "wildly variable within and between participants... so only when the difference is large enough we should expect to find significant effect". This comment seems to reflect a misapprehension of what your measure of effect size (d) represents. Cohen's d is a standardised measure of effect size that is expressed in units of SD (it is the mean difference between groups divided by the pooled standard deviation). Therefore, d of .77 remains a very large effect size, regardless of how variable the performance is between participants. (If the performance is more variable between-subjects, this just means that the mean difference that d of .77 represents is proportionally larger.)

Given this fact, as far as I can see, your response does not address the problem at all, and you remain in a position of having an equivalence test that could rule out only very large (seriously non-trivial) differences between tasks. You suggest that you may be at the edges of practicality of sample sizes required for testing smaller effect sizes than this, but that might simply indicate that you are not in a position to run a meaningfully useful study of the sort that you would like. (You might also want to think about whether the VMT task itself is worth trying to adapt to iVR if performance is, as you say, so wildly variable within and between participants.)

Alternatively, you may be able to improve statistical sensitivity to smaller effects by designing a within-subjects study? And/or perhaps you could consider running your study with lower power e.g. .8), although this would reduce the strength of conclusions you could draw (and also affect the range of possible destination journals).

In any case I am sending this back to you for reconsideration of this key point. Perhaps the following article from Zoltan Dienes could be useful in helping you think about how to define realistic effect sizes that might be worth ruling out (the article takes a Bayesian approach, but the guidance for defining meaningful effect sizes of interest applies equally for a frequentist approach):

Evaluation round #1

DOI or URL of the report:

Version of the report: 1

Author's Reply, 30 Jun 2023

Download author's reply Download tracked changes file

Dear Editor and Reviewers,

We thank your input, questions and suggestions which, in our point of view, have increased the quality (and clarity) of our study. Please find our responses in the attached files. 

Changes related to your questions and suggestions have been highlighted in the tracked document. Minor changes, such as typos, grammar or vocabulary correctors were not individually marked. A clean version of the manuscript can be found at


Decision by , posted 06 Jun 2023, validated 06 Jun 2023

Thank you for submitting your Stage 1 RR to PCI-RR. Your preprint has now been evaluated by two reviewers with relevant expertise, and I have read it myself.

Both reviewers clearly feel that the plan has promise as a potentially useful contribution to the literature, and that the hypotheses of your study are well stated and clear, although there are some queries over the details of your methods. Reviewer#1 has a number of constuctive suggestions to make, including the suggestion that gender be not only balanced but analysed as an additional variable of interest (or controlled as a covariate). I think that your response to this point should depend critically upon precisely what hypothesis you want to test and whether gender is a relevant consideration for that hypothesis.

Reviewer#2 has a number of stylistic comments to make, and I agree very much with this reviewer's impression that the multiple-framing of the Introduction (in terms of dream literature, and in terms of validation for new method) was quite unclear and potentially confusing. One might think that, if your purpose is to create a task more likely to be incorporated into dreaming then a critical part of its validation would include an assessment of the rates at which it is incorporated into dreaming; but the methods make it clear that this is not part of your purpose. Perhaps try to be more clear about your aims, and more linear in your introductory narrative to establish these aims. This reviewers' point 5 is also critically important from an RR point of view. Do we really believe that a meaningful 'equivalence' could be established by ruling out effects smaller than the very large target level? Would we really consider any differences between tasks that are below this level of effect size to be irrelevant? This seems somewhat unlikely. Perhaps rather than motivating your smallest effect size of interest from expectations based on prior literature, it would be more relevant to consider from first principles what size of difference you think would be of no practical consequence to know about if it exists. (Related to this is Reviewer#2's point 3, which asks why the tests are configured as tests of equivalence, when there would seem to be a priori reasons to expect that the iVR version might be superior.)

In any case, you should consider all of the reviewer comments carefully, and address all of them in any revised submission.

Your sincerely,

Rob McIntosh (PCI-recommender)

In passing, I noted a few linguistic oddities in the Abstract, which you may wish to amend (there may be more similar oddities in the main paper):

"commonground" >> "commonplace"

"understudied" >> "under-studied" or "little studied"

"One of such mazes" >> "One such maze"

"stimulant and engaging simulation" >> "stimulating and engaging experience" ?


Reviewed by anonymous reviewer 1, 15 May 2023

The authors present a pre-registration report to examine the learning ability and the usability of the Virtual Maze Task used originally by Wamsley et al. (2010). They will compare the Desktop version of the task to a more immersive VR version (using a HMD). The authors present an important research question that is often overlooked when using virtual tasks, do people actually learn better with greater immersion?

The authors outline their research questions and hypothesis well, demonstrating two clear and concise hypotheses that can be easily tested following the reading of the methodology. The protocol is well described, though I have some minor comments about this (see below for section breakdown). The sample size is calculation is efficient but again, there are some minor comments not about its calculation, but more so its demographic. I think the clear presentation of method and hypotheses would provide a reader with confidence that no additional analysis will be explored, as only what the authors propose (assessment of learning in VMT and questionnaire use) will actually answer their research question. Both are commonplace across the literature as a methodology for assessment.

Statistical analysis is well outlined and in my opinion, valid for the experiment proposed. It is straightforward and easily replicable from the description. Though perhaps a different approach (repeated-measures examining individual trials and how they vary) may be important here, and may actually reveal more about the task - as spatial "learning" may not actually occur until familiarity with the environment etc. increases. This is particularly important when there is no recall or retest (probe) trial being used. This is common in the human and animal literature. A single score of "improvement" may not be enough data to warrant a true behavioural measure of participants actually "learning" the space and goal location, particularly not after three trials. Though perhaps I have misunderstood the procedure.

Specific comments:


  • Authors mention the lack of ecological validity of desktop tasks, but do not evaluate the greater ecological validity that may stem from iVR tasks.
  • There are a lot of iVR studies out there, particularly some that also use iVR and real locomotion (e.g. Delaux et al., 2021). I think some mention of these and the above point would help your research question.


  • Authors mention they will achieve an equal number of male and female participants. However, I think gender should be recorded and analysed as an additional variable or control variable/co-variate. This is a well known effect in the human literature, even in virtual mazes (Mueller et al., 2008; Woolley et al., 2010). It has also been shown that it is specifically spatial learning that gender can have an impact on and not memory (Piber et al., 2018). I think this is an important aspect to examine or control for for both hypotheses.
  • Is three trials really sufficient to demonstrate "learning" without a memory trial? Many more trials are always used in the literature. I can also see the view that learning is clearly occuring - but I think this needs to be better justified (again, unless I misunderstood).
  • The authors do a really good job at controlling for and assessing cybersickness.
  • For replication purposes, are both of the tasks used available Open Access? If not authors should recommend alternatives
  • Groups are controlled for cybersickness and well-assessed for this DV. Why are groups not controlled or assessed for spatial learning ability (which is a cognitive skill we know differs amongst individuals: Coutrot et al., 2018). Perhaps the introduction of a spatial task (Perspective Taking Task for Adults (Frick et al., 2014)) or even just a cognitive assessment (Trail Making Test etc). Just something that I think is important.
  • I would be wary about the upper age-limit of your participants, over 18 may include older adults who may not be as familiar with the task. Either including them and controlling across groups, or setting a limit would be advised, as age has an impact on usability (Commins et al., 2020) and also spatial learning performance (see Figure 2E in Coutrot et al., 2018 for global data).

Nevertheless, this is a good and clean design to assess a very important question using a repeatedly used but rarely validated virtual spatial task from the literature. It is also an important, unanswered question which could eventually facilitate a framework for other researchers testing the reliability of their virtual tasks. It may also perhaps, save research teams time by implementing whatever version of the task fits into their setup and budget, if they have no real impact on learning. I would recommend this go forward with perhaps some additional thought in the areas mentioned above. 

Reviewed by , 30 May 2023

This interesting and timely article seeks to resolve the uncertainty in the literature about whether a VR version of a virtual maze task (VMT) compared favourably with a classic desktop version of the task. It is well-written throughout, and I appreciate the open materials narrative which is far too often overlooked in registered reports. As a disclaimer, I know little about the VMT literature, but have some expertise in VR and cognitive psychology in general. Some specific comments below.


1. The manuscript as it stands is written in a slightly awkward mixture of past and future tense – I presume this is to make the edits to the final version after data collection a bit easier, but it didn’t feel like there was much consistency in how these tenses where used throughout. I don’t have strong feelings about whether this requires changing, but the editor may.


2. I found the framing of the introduction confusing – the narrative around the VMT is compelling, but I found the link to dreams as the main motivator for using VR rather tenuous. Indeed, the framing of the manuscript sits awkwardly between a paper seeking to resolve a dispute in the literature and a validation of a new method. I’m not sure it succeeds with either narrative, but do not know the extant literature well enough to say which is a more appropriate goal of the paper.


3. RQ1 and H1 are obvious enough (although surprising to me that the authors would be predicting equivalent performance – I’d have assumed that iVR would outperform desktop due to immersion)


4. I was unable to open the build linked in the manuscript – this probably reflects my difficulties with unity hub rather than the application, but it would be worth uploading ‘fixed’ .exe builds for desktop and quest that will be used in the manuscript to streamline this process for more casual users


5. My biggest issue with this article is from the power calculation and associated sample size. The authors base the sample size calculation. The Wamsley papers from which the d=1.1 estimate is drawn from is a correlational analysis bears no resemblance to the methods of the current study. I understand the challenges of predicting an effect size, but would recommend the authors use an effect size derived from other studies comparing VR to desktop environments OR the smallest relevant example of how VMT performance can vary from one condition to another in a between-group design. As it stands, the current sample size seems far too low to provide anything like a resolution of the issues in the literature or a validation of VR in this context. As it stands, the TOST would miss any different that was just below a ‘classic’ large effect size of 0.8, which feels like very shaky grounds to declare equivalence. This is a pretty simple paradigm and I see no reason not to be more conservative in this regard.



User comments

No user comments yet