Abstract
Listening in the real world involves both verbal and non-verbal inputs. However, second language (L2) listening activities in the classroom often lack non-verbal inputs and are removed from the situational and cultural contexts where they would naturally occur. Virtual reality (VR) technology offers the potential to create more authentic and engaging L2 listening experiences. This study examines the impact of immersive and interactive VR on L2 listening experiences (operationalized as flow) and comprehension among Chinese university-level English-as-a-foreign-language (EFL) learners. Drawing on a randomized experimental design and semi-structured interviews, the study found that while VR did not directly improve L2 listening comprehension, it contributed indirectly to L2 listening comprehension by enhancing learners' listening experiences. Furthermore, although VR enhanced listening experiences in both cognitive and affective terms, only the affective enhancement had a noticeable positive medium-sized effect on L2 listening comprehension. Cognitive benefits of VR, such as sustained concentration and heightened absorption, did not translate into better L2 listening comprehension. The observed relationships can be explained by the misalignment between VR's interactive elements and the cognitive demands of the listening task. The findings highlight the pedagogical value of VR in enhancing affective engagement in learning, underscore the need for instructional design to mitigate cognitive overload, and emphasize the importance of careful VR design to ensure that immersive features support, rather than distract from, cognitive engagement.
Full
Article
ERCT Criteria Breakdown
-
Level 1 Criteria
-
C
Class-level RCT
- Participants were randomized individually to VR vs. audio rather than by intact classes (and the intervention is not tutoring).
- Using computer-generated random numbers, they were randomly assigned to either a VR group (n = 42) or an audio group (n = 42).
Relevant Quotes:
1) "Using computer-generated random numbers, they were randomly assigned to either a VR group (n = 42) or an audio group (n = 42)." (p. 4)
Detailed Analysis:
Criterion C requires randomization at the class level (or stronger) to reduce cross-group contamination, unless the intervention is explicitly one-to-one tutoring/personal teaching.
The quoted methods text states that random assignment was done at the individual participant level into a VR group versus an audio group. The intervention is a listening task delivered via two modes; it is not described as tutoring or other one-to-one personal teaching that would trigger the tutoring exception.
Therefore, the unit of randomization does not meet the class-level (or school-level) requirement.
Criterion C is not met because randomization was at the individual participant level rather than by intact classes (and no tutoring exception applies).
-
E
Exam-based Assessment
- The main listening comprehension outcome uses a study-developed (tailor-made) test rather than a standardized exam.
- A listening comprehension test was developed based on Anne Frank House VR (see Appendix II for a sample of the test questions).
Relevant Quotes:
1) "Participants' listening proficiency was assessed before the experiment using the listening subtest of the Cambridge B2 First for Schools (Cambridge University Press & Assessment, 2025a)." (p. 5)
2) "A listening comprehension test was developed based on Anne Frank House VR (see Appendix II for a sample of the test questions)." (p. 5)
3) "This tailor-made test was validated in a pilot study involving 55 participants drawn from the same student population as the main study." (p. 5)
Detailed Analysis:
Criterion E requires that the outcome assessment be based on standardized, widely recognized exams (i.e., not created for the specific study).
The paper does use a standardized exam component (Cambridge B2 First for Schools listening subtest) to assess pre-experiment listening proficiency. However, the study's listening comprehension outcome for evaluating the intervention is measured using a test the authors explicitly describe as "developed" for the Anne Frank House VR content and as "tailor-made" (even though it was piloted and validated).
Under ERCT, a validated custom measure is still not a standardized exam-based assessment.
Criterion E is not met because the primary listening comprehension outcome is a study-developed (tailor-made) test rather than a standardized exam.
-
T
Term Duration
- Outcomes were measured immediately after a single-session task, not at least one academic term after the intervention began.
- Immediately after the listening task, both groups completed the EduFlow-2 scale again to measure their state flow based on the listening experience that they had just had.
Relevant Quotes:
1) "Immediately after the listening task, both groups completed the EduFlow-2 scale again to measure their state flow based on the listening experience that they had just had." (p. 7)
2) "First, the investigation was conducted as a single-session experiment, meaning that the effects observed may be short-lived and may not reflect outcomes over an extended period." (p. 13)
Detailed Analysis:
Criterion T requires outcome measurement at least one academic term (roughly 3-4 months) after the intervention begins, i.e., term-long tracking/follow-up from the start.
The paper explicitly describes measurement occurring "Immediately after the listening task" and later characterizes the study as a "single-session experiment." This is far shorter than a term and provides no term-long follow-up window for outcomes.
Criterion T is not met because the study is single-session with immediate post-task measurement rather than term-long follow-up.
-
D
Documented Control Group
- The control (audio) condition is described, group sizes are given, and baseline comparability on key measures is reported.
- The two groups did not differ in their pre-experiment listening proficiency measured by the listening subtest of the Cambridge B2 First for Schools (t[82] = 0.88, p = .384) or their trait flow scores (t[82) = 0.04, p = .970], suggesting comparable listening ability and general propensity to experience flow before the experiment.
Relevant Quotes:
1) "Using computer-generated random numbers, they were randomly assigned to either a VR group (n = 42) or an audio group (n = 42)." (p. 4)
2) "The two groups did not differ in their pre-experiment listening proficiency measured by the listening subtest of the Cambridge B2 First for Schools (t[82] = 0.88, p = .384) or their trait flow scores (t[82) = 0.04, p = .970], suggesting comparable listening ability and general propensity to experience flow before the experiment." (p. 4)
3) "The audio group, on the other hand, listened to the audio via a classroom loudspeaker system connected to a PC, which was pre-calibrated to ensure clear and consistent sound quality." (p. 6)
Detailed Analysis:
Criterion D requires that the control group be clearly documented, including who is in it, what they receive, and evidence supporting comparability.
The paper specifies the control as an "audio group," provides the control sample size (n = 42), and describes how the audio mode was delivered. It also reports baseline comparability between groups on pre-experiment listening proficiency and trait flow, which are relevant pre-intervention characteristics for interpreting group differences.
Criterion D is met because the control group is described with sample size, condition procedures, and baseline comparability evidence.
-
Level 2 Criteria
-
S
School-level RCT
- Assignment was at the individual participant level, not by schools (or equivalent sites/centers).
- Using computer-generated random numbers, they were randomly assigned to either a VR group (n = 42) or an audio group (n = 42).
Relevant Quotes:
1) "Using computer-generated random numbers, they were randomly assigned to either a VR group (n = 42) or an audio group (n = 42)." (p. 4)
Detailed Analysis:
Criterion S requires randomization at the school level (schools, sites, centers, or comparable implementation units), not merely at the level of individual learners.
The paper reports that individual participants were randomly assigned into VR versus audio conditions and does not describe any school/site-level randomization.
Criterion S is not met because the trial is not randomized at the school (or site/center) level.
-
I
Independent Conduct
- The VR intervention content was taken from a third-party VR application, and the study reports no funding or role of the application developers in the evaluation.
- The listening materials used in the experiment were taken from the free educational VR application Anne Frank House VR (Vertigo Games & Knucklehead, 2024).
Relevant Quotes:
1) "The listening materials used in the experiment were taken from the free educational VR application Anne Frank House VR (Vertigo Games & Knucklehead, 2024)." (p. 5)
2) "Funding" (p. 14)
3) "None." (p. 14)
Detailed Analysis:
Criterion I asks whether the study is conducted independently from the intervention designers/providers, reducing bias.
The paper indicates that the intervention content/materials come from a pre-existing, third-party VR application (Anne Frank House VR) attributed to Vertigo Games & Knucklehead, which are not the paper's authors. The paper also reports "Funding" as "None," and it does not describe any involvement by the VR application's creators in data collection, analysis, or reporting.
While the paper does not include an explicit statement such as "the app developers had no role in the study," the documented sourcing of the intervention from an external provider and the lack of any described provider role support independence.
Criterion I is met because the intervention provider is external to the author team and no provider involvement in evaluation is reported.
-
Y
Year Duration
- The study is a single-session experiment rather than tracking outcomes for at least 75% of an academic year.
- First, the investigation was conducted as a single-session experiment, meaning that the effects observed may be short-lived and may not reflect outcomes over an extended period.
Relevant Quotes:
1) "First, the investigation was conducted as a single-session experiment, meaning that the effects observed may be short-lived and may not reflect outcomes over an extended period." (p. 13)
2) "Immediately after the listening task, both groups completed the EduFlow-2 scale again to measure their state flow based on the listening experience that they had just had." (p. 7)
Detailed Analysis:
Criterion Y requires outcome measurement at least 75% of an academic year after the intervention begins.
The paper explicitly describes the study as a "single-session experiment" with outcome measurement occurring immediately after the listening task. This is not year-long (or near-year-long) tracking and does not meet the duration requirement.
Criterion Y is not met because the study does not track outcomes across an academic year.
-
B
Balanced Control Group
- The VR group received extra technology/orientation time, but these resources are integral to the treatment contrast (VR vs. audio), and the control provides a clear alternative listening mode.
- To mitigate VR's novelty effect and minimize technical issues during the experiment, participants in the VR group received a 30-min individual VR orientation.
Relevant Quotes:
1) "To mitigate VR's novelty effect and minimize technical issues during the experiment, participants in the VR group received a 30-min individual VR orientation." (p. 6)
2) "In the VR condition, participants used a VR head-mounted device (Oculus Quest 2) and two hand trackers to navigate the rooms and activate audio segments by clicking sequentially highlighted icons (see Fig. 3)." (p. 6)
3) "The audio group, on the other hand, listened to the audio via a classroom loudspeaker system connected to a PC, which was pre-calibrated to ensure clear and consistent sound quality." (p. 6)
4) "This introduction ensured that both groups received equivalent and minimal contextual grounding." (p. 7)
Detailed Analysis:
Criterion B compares the nature, quantity, and quality of resources (time, budget, materials, adult support) provided across conditions, and asks whether the control offers a comparable substitute for the intervention inputs unless the extra resources are the treatment variable being tested.
Extra resources are clearly present in the VR condition (VR device, hand trackers, and a 30-minute VR orientation). The audio group does not receive comparable VR equipment or orientation.
However, the explicit experimental manipulation is the listening mode itself ("VR vs. audio"). The VR equipment and orientation are integral to delivering the VR listening mode, i.e., the extra resources are part of the treatment definition rather than an optional add-on that could be balanced without changing the treatment being tested.
The paper also documents at least one deliberate balancing step unrelated to the VR hardware: both groups received the same brief PowerPoint introduction intended to provide "equivalent and minimal contextual grounding."
Criterion B is met because the additional resources (VR hardware and orientation) are integral to the VR treatment contrast being tested, and the control condition is a clearly documented, reasonable alternative mode (audio-only) rather than an undefined or under-resourced comparator.
-
Level 3 Criteria
-
R
Reproduced
- No independent replication of this specific study was found in the paper or via internet searching as of 2026-04-13.
- Researchers could also conduct cross-cultural replications and extend VR applications to other language skills (e.g., reading and speaking).
Relevant Quotes:
1) "Researchers could also conduct cross-cultural replications and extend VR applications to other language skills (e.g., reading and speaking)." (p. 13)
Detailed Analysis:
Criterion R requires an independent replication of the study by a different research team, in a different context, published in a peer-reviewed outlet.
The paper itself only discusses replications as a future direction, not as an existing replication ("could also conduct cross-cultural replications").
Internet searching (by DOI, title, and author names) on 2026-04-13 did not identify any published, peer-reviewed independent replication of this exact VR-vs-audio randomized experiment.
Criterion R is not met because independent replication evidence was not found.
-
A
All-subject Exams
- Criterion E is not met (no standardized outcome exam), so A is automatically not met; additionally, outcomes focus on L2 listening rather than all core subjects.
- A listening comprehension test was developed based on Anne Frank House VR (see Appendix II for a sample of the test questions).
Relevant Quotes:
1) "A listening comprehension test was developed based on Anne Frank House VR (see Appendix II for a sample of the test questions)." (p. 5)
2) "This tailor-made test was validated in a pilot study involving 55 participants drawn from the same student population as the main study." (p. 5)
Detailed Analysis:
Criterion A requires standardized, exam-based assessment across all main subjects, and ERCT rules specify that if criterion E is not met, then criterion A is not met.
Here, the main outcome measure is a custom listening comprehension test described as "developed" for the study and "tailor-made." Therefore, criterion E is not met. In addition, the study measures L2 listening (and related constructs) rather than assessing broad achievement across core academic subjects.
Criterion A is not met because the study lacks standardized outcome exams (failing E) and does not assess outcomes across all core subjects.
-
G
Graduation Tracking
- Criterion Y is not met (single-session, not year-long), so G is automatically not met; no graduation tracking or follow-up papers were found.
- First, the investigation was conducted as a single-session experiment, meaning that the effects observed may be short-lived and may not reflect outcomes over an extended period.
Relevant Quotes:
1) "First, the investigation was conducted as a single-session experiment, meaning that the effects observed may be short-lived and may not reflect outcomes over an extended period." (p. 13)
Detailed Analysis:
Criterion G requires tracking participants to graduation from the relevant educational stage, and ERCT rules specify that if criterion Y is not met, then criterion G is not met.
This study is explicitly "single-session" and does not track participants across an academic year, much less until graduation. Additional internet searching for subsequent follow-up publications by the same authors tracking this cohort to graduation did not find any such graduation-tracking reports as of 2026-04-13.
Criterion G is not met because the study does not track outcomes to graduation and it fails the year-duration prerequisite (Y).
-
P
Pre-Registered
- The paper reports IRB approval but provides no preregistration registry/ID/date, and no external preregistration record was found.
Relevant Quotes:
1) "This study was approved by the Institutional Review Board of The Hong Kong Polytechnic University (approval no.: HSEARS20240126007)" (p. 14)
Detailed Analysis:
Criterion P requires an explicit preregistration record (registry and identifier/link) and evidence that registration occurred before data collection began.
The paper documents ethics review/IRB approval, but it does not mention preregistration, does not provide a registry name (e.g., OSF), and does not provide a preregistration identifier or date. Searching the web using the study title/DOI and the IRB approval number did not reveal a public preregistration record as of 2026-04-13.
Criterion P is not met because no preregistered protocol record is provided or discoverable from the available information.
Request an Update or Contact Us
Are you the author of this study? Let us know if you have any questions or updates.