Assessing the Promise of Affordable, Mobile Eye-Tracking Devices: Evaluation of the Pupil Neon

Rima-Maria Rahal

Assessing the Promise of Affordable, Mobile Eye-Tracking Devices: Evaluation of the Pupil Neon

Rima-Maria Rahal based on reviews by Lisa Spitzer and Benedikt Ehinger

A recommendation of:

STAGE 1

Independent Comparative Evaluation of the Pupil Neon - A New Mobile Eye-tracker

Valentin Foucher, Alina Krug, Marian Sauter https://osf.io/jcu2t?view_only=2627595019be4f92abf0bf338c83ee68 version 3

Read report on server

Abstract

ZH-CN

Submission: posted 29 May 2024
Recommendation: posted 23 September 2024, validated 24 September 2024

Cite this recommendation as:
Rahal, R.-M. (2024) Assessing the Promise of Affordable, Mobile Eye-Tracking Devices: Evaluation of the Pupil Neon . Peer Community in Registered Reports, . https://rr.peercommunityin.org/articles/rec?id=796

Recommendation

Studying eye-gaze has long been employed as a central method for understanding attentional dynamics and cognitive processes in a variety of domains. The development of affordable, mobile eye-tracking devices, such as the Pupil Neon, promises new opportunities to extend this research beyond the contexts in which traditional eye-trackers have been available. But how good are such novel devices at detecting variables relevant for the study of eye movements and pupil dilations?

Foucher, Krug and Sauter (2024) propose an independent evaluation of the Pupil Neon eye-tracker using the Ehinger et al. (2019) test battery, comparing its performance with a traditional EyeLink 1000 Plus device. In an empirical study, participants will be asked to perform a wide variety of tasks while eye movements and pupil dilations are tracked using both devices. Results on the strengths and weaknesses, as well as potential use cases of the Pupil Neon will be informative for subsequent eye-tracking research.

The Stage 1 manuscript was evaluated over two rounds of in-depth review. Based on detailed responses to the recommender and reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).

URL to the preregistered Stage 1 protocol: https://osf.io/3kc5t

Level of bias control achieved: Level 6. No part of the data or evidence that will be used to answer the research question yet exists and no part will be generated until after IPA.

List of eligible PCI RR-friendly journals:

References:

1. Ehinger, B. V., Groß, K., Ibs, I., and König, P. (2019). A New Comprehensive Eye-Tracking Test Battery Concurrently Evaluating the Pupil Labs Glasses and the Eyelink 1000. PeerJ, 7, e7086. https://doi.org/10.7717/peerj.7086

2. Foucher, V., Krug, A., and Sauter, M. (2024). Independent Comparative Evaluation of the Pupil Neon - A New Mobile Eye-tracker. In principle acceptance of Version 3 by Peer Community in Registered Reports. https://osf.io/3kc5t

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Reviews

Evaluation round #2

DOI or URL of the report: https://osf.io/pgv39?view_only=2627595019be4f92abf0bf338c83ee68

Version of the report: 2

Author's Reply, 16 Sep 2024

Download author's reply Download tracked changes file

Decision by Rima-Maria Rahal, posted 09 Sep 2024, validated 09 Sep 2024

Dear Dr. Foucher,

both reviewers are now mostly satisfyed with your revision, and only minor updates have been requested. Please address the issues in a revised version of your manuscript.

Warmest,

Rima-Maria Rahal

Reviewed by Lisa Spitzer, 22 Aug 2024

I would like to express my appreciation to the authors, as I feel that all the points I have raised have been adequately addressed (especially sample size, description of analysis plans, description of open science practices). The only point that the authors should perhaps look at again is in the Study Design Table - here the interpretation given different outcomes is described with “Determine if neon fixation and saccade detection can be comparable to Eyelink 1000.” Since frequentist methods are used, it is not possible to test for similarity, but only for difference. I would recommend clarifying this in the text. Otherwise, I find the manuscript to be clear, well structured and now sufficiently transparent. I am looking forward to the results of this study and recommend awarding an IPA.

Best,
Lisa Spitzer

Reviewed by Benedikt Ehinger, 09 Sep 2024

Thank you for the updated Manuscript. I only have very few minor things below, and I trust the authors & editor to address them without me needing to see the manuscript again.

- L1 Abstract: "testtestest" - test was successful ;)

- Figure 1: I still find it unintuitive that the icons are different to small grid. Also, in our study we did not use a fixatoin cross, but this (supposedly) microsaccade reducing fixation symbol

- For the PL manual gaze correction, I think you should specify in the paper how it works (pupil labs docs will be vastly different in 5 years time). Is there no other way than using your finger on a mobile phone to adjust the circle? E.g. a QR marker or something that pupil lab can easily detect (I assume the answer is no - but that is a bit ridiculous by pupil labs;))?

- I would recommend putting in the filter settings for remodnav - or at least mention how you will find out what good filter settings are.

- Does PL return regularly sampled data? If no, will you generally resample them?

- L483 - there is a / missing in between nested subject / block in the LMM formula "| subject block)"

Evaluation round #1

DOI or URL of the report: https://osf.io/aqvhb?view_only=2627595019be4f92abf0bf338c83ee68

Version of the report: 1

Author's Reply, 13 Aug 2024

Download author's reply Download tracked changes file

Decision by Rima-Maria Rahal, posted 30 Jul 2024, validated 30 Jul 2024

Dear Dr. Foucher,

thank you for your submission "Independent Comparative Evaluation of the Pupil Neon - A New Mobile Eye-tracker" to PCI RR, for which I have now received two independent reviews by experts in the field. Based on these reviews and my own reading of your manuscript, I would like to invite you to revise the proposal. There is much to like about the manuscript, but I will highlight the most salient opportunities for further improvement below:

Clarify the analysis pipeline (review criterion 1C)
Consider increasing the sample size / time planned for the study (review criterion 1C)
Clarify the calibration / validation procedure for Pupil Labs Neon (review criterion 1C)

These issues fall within the normal scope of a Stage 1 evaluation and, in addition to responding to the reviewers' thoughtful further comments, can be addressed in a round of revisions.

Warmest,

Rima-Maria Rahal

Reviewed by Lisa Spitzer, 29 Jul 2024

I want to congratulate the authors for this interesting Stage 1 RR, which I enjoyed reading and reviewing very much.

Summary: The authors aim to provide detailed benchmark information for the new mobile eye-tracker Pupil Neon, using the EyeLink 1000 Plus as a reference. For this reason, they plan to utilize the extensive test battery provided by Ehinger et al. (2019), taking into account not only accuracy and precision, but a broad range of different eye-tracking parameters. Participants will absolve multiple blocks of this test battery, while their eye movements will be measured simultaneously with both eye-trackers.

I have summarized my comments on the study below.

Major points

The authors are planning to use only a very small sample size. Given that sample sizes are typically smaller in eye-tracking studies and no specific hypotheses are tested, this might be enough, however, since data quality can vary greatly between participants, I recommend targeting a slightly larger sample / increasing the time for data collection.
I strongly encourage the authors to be more precise in describing the models used for their data analyses. For example, they should report all (random/fixed) effects used for the rLMM for task 1/7/10; which methods will be used to compare the number of fixations, fixation durations, and saccadic amplitudes between eye-trackers for task 3, how exactly the number of blinks and blink durations will be evaluated for task 5, how the normalized pupil areas will be compared for task 6, how the accuracy will be compared in the movement tasks etc. In addition, the used analysis software, including packages and version numbers, should be reported.

Minor points

L84: an eye-tracker’s (spelling)
L170: Will participants with (hard/soft) contact lenses be excluded?
L172ff: Please provide an explanation why the study was deemed exempt from ethical approval.
L176: How can participants be excluded based on calibration accuracy for the Pupil Labs Neon glasses, if there is no calibration for this model?
L193-194: Why is the participant-monitor distance not yet determined?
L198: Why use monocular recording instead of binocular for the EyeLink?
In L193, the authors describe that a head-chinrest is used, but it is not described whether/how this is removed for the head movement tasks.
L253: its (spelling)
L324/327: Please add unit (degree).
L361: For the Stage 2 RR, I would like to know how much data had to be excluded due to data loss or corrupt data. Therefore, the authors might add a gap text or similar in which these results are reported or keep in mind to describe this in the results part of the Stage 2 RR.
L372: “The EyeLink 1000 reports blinks when the pupil is missing for several samples” – please be more specific.
L400ff: Please refer to the subsection “Task 2: Smooth pursuits” in L449 for details (at first, I thought that the explanation provided in L400-402 was the only description for the smooth pursuit task)
L422: I would make a distinction here between the “actual gaze point” and the fixation target - the actual gaze point may differ from the target position (e.g. due to misalignment of the fovea despite the subjective direction of gaze towards the target), but the target can be used (and is used here) as a proxy for the actual point-of-gaze
(How) was performance aggregated across blocks/participants – also incorporating winsorized means?
L442: When interpreting the results for task 10, please be clear that the accuracy will not only deteriorate due to time, but also because the movement tasks were performed before task 10, which likely changed the head positions, affecting the recording.
L442: It might be worth to also consider target distance from monitor center (as it was shown that performance might be worse in monitor corners, e.g., Spitzer & Mueller, 2022)
L476: Did you only analyze the fixation location itself, or also the accuracy? If accuracy was inspected, please report so.
Of course, the comparison is limited due to this study incorporating other participants, a different lab etc., but I would still be interested in a comparison of the results found here with the results of the Pupil Labs model measured by Ehinger et al. (2019), which the authors could possibly include in the discussion section of the Stage 2 RR
From personal experience (we performed a similar study, also using an adapted version of the Ehinger test battery), the experiment might take longer than 60 minutes. The authors might consider doing a test run, and planning longer time slots per participant.
I have not found a description of how/where the data and analysis scripts will be shared. Please add.
In addition to the study design template, please include a statement that you are not testing hypotheses in your manuscript text.
I understand that it makes sense to leave some of the study design template cells empty, but still I think that there is merit in providing some explanation in the column “interpretation given different outcomes” - e.g., what will be your conclusion given significant effects in the rLMM computed for task 1/7/10?

Overall, I think the study is addressing a very important topic, i.e., the comparability and reproducibility of eye-tracking research. It is well thought out and shows qualitative rigor, with the authors relying heavily on the study by Ehinger et al. (2019). In my opinion, the main points that should be addressed in a revision concern the sample size and the description of the analyses, which should more detailed. I hope that my comments will help the author improve their manuscript.

All the best,
Lisa Spitzer

Reviewed by Benedikt Ehinger, 26 Jul 2024

In this registered report, Foucher et al. will investigate the performance of the Pupil Neon eye tracking glasses against the current "gold standard", Eyelink 1000. For this, they closely follow our previously published EyeTracking benchmark.

The paper, and the choice of eye-tracker, is well motivated and a very valuable thing to investigate for the community. The paper further is well written and reasoned, and I have only some smaller comments.

I'm now very excited to see the outcome of this comparison.

Best, Benedikt Ehinger

etcomp2.0

We are currently re-using our benchmark and are in the analysis phase of comparing the ViewPixx Trackpix3 against an eyelink. Because of this, we updated & upgraded our analysis pipeline (we also made the stimulation code compatible with Octave, if this is interesting to the authors, they can contact me for the code, I dont think it is in the public repo) to new python version & packages. We identified one major breaking change (besides the typical renaming + adding documentation to the analysis functions):

The engbert mergenthaler implementation which we took from the Donner' Lab, has a bug in the code, slightly miscalculating the velocity threshold. Due to this reason, and some other more conceptual ones, in our new pipeline we switched to REMoDNaV, which is a successor in spirit of the Engbert-Mergenthaler algorithm. An argument could be made, that yet another class of event-classifier should be used (Drew & Dierkes 2024), but I think it would be ill-placed given the lab & head-fixed setup.

The improved analysis code can be found here: https://github.com/behinger/etcomp/tree/etcomp2/ . Note that we are still analyzing the data, and not all tasks have been fully analysed & ported to new versions of python. There are some drawbacks: We removed quite a bit of pupil-labs specific code, this is due to some internal misscommunication with the leading-author of that study. The authors should contact us in case they want to switch to the new pipeline so we can provide requirements/yml files we are using - we are just not there yet to have a proper project :)

Further, we introduced a reading task, given a collaboration with psycholinguists on this project - you can decide whether this is relevant/interesting for your study or not. For us we included it mostly because it allows some ecological validity tests for reading studies - but if it tells you more than the large-grid task, I cannot say.

Minor comments

L176: You write there are calibration accuracy limits for Pupil Labs Neon, but you nowhere describe any method to identify them (see question below)

Figure 1: Nice improvements! I only found the large grid illustration confusing. Why does it not look the same as the small grids ones? It seems one point is dropping of the screen

L267: Calibration/Validation of Pupil Labs Neon. Is there no settings whatsoever that you decide per subject that could influence accuracy? And, is there no recommended validation behavior?

L366: As stated above, I would probably move to remodnav due to the bug in the mergenthaler algoritm (or fix the bug)

L414: You argue to convert PupilLabs Pupil measurement to area similar to eyelink. But I would argue, that the pupil-labs 'mm' output, is the actual more interesting and relevant output. So maybe calibrating the pupil for eyelink should be the goal, rather than "deconverting" the pupil-labs output back to ellipses / areas?

L420: There is a mistake in the formula in our paper (2*atan2 should be atan2) - we have a correction request pending since January for this. sorry for the inconvenience. The code is correct though.

Open question for analysis plan: Do you (winsorized) average the accuracy values per block, then take the winsorized mean over blocks, then the (bootstraped) winsorized mean over subjects? Afaik this is how we did it. You could also disregard blocks and immediately go for subjects. Maybe I missed it in the manuscript.