Abstract
This randomized controlled trial tested the effects of immersive Virtual Reality (VR) enhanced with artificial intelligence on English language development, operationalized as performance on an author-developed, CEFR-aligned language proficiency battery emphasizing grammatical and lexical performance, in undergraduate Chinese EFL learners (N = 477). Participants were assigned to NLP-enhanced VR, ML-enhanced VR, SA-enhanced VR, or a traditional instruction control condition. Posttest scores on an author-developed, CEFR-aligned proficiency measure were analyzed using mixed-effects ANCOVA to account for recurring laboratory sections. The NLP-enhanced VR condition yielded substantially greater grammatical and lexical gains than all other conditions (F(3,473) = 1139.45, p < .001, η2 = .88), with post hoc tests confirming its superiority. Communication competence and intercultural competence were measured only within the three VR arms. No reliable between-arm differences were detected for communication competence (F(2,354) = 0.02, p = .982) or intercultural competence (F(2,354) = 1.06, p = .349), so no causal claims are made versus the control group for these outcomes. Findings indicate that context-sensitive, NLP-driven conversational support in immersive VR can causally enhance foundational linguistic subsystems—vocabulary, grammar, and sentence-level syntax—as measured by the CEFR-aligned assessment, while the durability and communicative transfer of these gains require verification through delayed and independent measures.
Full
Article
ERCT Criteria Breakdown
-
Level 1 Criteria
-
C
Class-level RCT
- Randomization occurred at the individual student level rather than by intact classes (and the intervention is not one-to-one tutoring), so the class-level RCT requirement is not satisfied.
- Participants were randomized at the individual level to one of four arms using computer-generated permuted blocks (1:1:1:1).
Relevant Quotes:
1) "Participants were randomized at the individual level to one of four arms using computer-generated permuted blocks (1:1:1:1)." (p. 7)
2) "Instruction and data collection occurred in recurring sections/lab blocks (J = [insert]), which can induce within-section correlation. All confirmatory analyses therefore modeled section as a random intercept." (p. 7)
3) "Scheduling and clusters. Participants attended recurring sections/lab blocks (J = 16; mean cluster size mˉ = 29.8, SD = 2.1, range = 26–33)." (p. 7)
Detailed Analysis:
Criterion C requires random assignment at the class level (or stronger) to reduce contamination between treatment and control. The paper explicitly states individual-level assignment using permuted blocks, which indicates that students in the same recurring lab/section structure could be assigned to different arms.
The ERCT exception allowing student-level randomization applies to personal tutoring/one-to-one teaching interventions. Here the intervention is delivered in recurring lab blocks with shared settings and instructors, and is not described as one-to-one tutoring.
Final sentence: Criterion C is not met because the unit of randomization is individual students rather than intact classes (and no tutoring exception applies).
-
E
Exam-based Assessment
- The primary outcome is an author-developed, CEFR-aligned proficiency battery rather than a widely recognized standardized exam, so the exam-based assessment requirement is not satisfied.
- This randomized controlled trial tested the effects of immersive Virtual Reality (VR) enhanced with artificial intelligence on English language development, operationalized as performance on an author-developed, CEFR-aligned language proficiency battery emphasizing grammatical and lexical performance...
Relevant Quotes:
1) "This randomized controlled trial tested the effects of immersive Virtual Reality (VR) enhanced with artificial intelligence on English language development, operationalized as performance on an author-developed, CEFR-aligned language proficiency battery emphasizing grammatical and lexical performance..." (p. 1)
2) "Language Proficiency (Author-Developed, CEFR-Aligned; Pretest-Posttest). This author-developed test was based on the Common European Framework of Reference for Languages (CEFR)..." (p. 12)
3) "Items were adapted from CEFR-aligned standardized language assessments (e.g., Cambridge English Qualifications) to target VR-enhanced learning contexts." (p. 12)
4) "Because the author-developed CEFR-aligned battery is intentionally focused on grammatical accuracy and lexical range..." (p. 12)
Detailed Analysis:
Criterion E requires outcome measurement using standardized, widely recognized exams rather than researcher-built instruments. Although the measure is CEFR-aligned and draws on items adapted from standardized assessments, the paper repeatedly describes the primary outcome as "author-developed."
A CEFR alignment and item adaptation strategy can improve construct alignment, but it does not turn an author-assembled battery into a widely recognized standardized exam administered under a standard external testing program.
Final sentence: Criterion E is not met because the primary outcome is an author-developed, CEFR-aligned battery rather than a standardized external exam.
-
T
Term Duration
- The paper reports the dosage in sessions (15 × 90 minutes) but does not clearly document calendar start and outcome measurement dates showing at least one full academic term elapsed from start to measurement.
- Each experimental group underwent 15 sessions of 90 minutes each, focusing on specific pedagogical interventions tailored to address different facets of language teaching and learning.
Relevant Quotes:
1) "Each experimental group underwent 15 sessions of 90 minutes each, focusing on specific pedagogical interventions tailored to address different facets of language teaching and learning." (p. 6)
2) "The control group engaged in standard language learning activities, serving as a baseline for comparative analysis." (p. 6)
3) "The duration of the intervention (15 × 90 minute intervention sessions) allowed participants sufficient time to engage with the personalized exercises and assimilate the feedback." (p. 10)
Detailed Analysis:
Criterion T requires outcome measurement at least one full academic term after the intervention begins, which typically requires clear calendar dates (or an explicit term/semester framing) for the start of the intervention and the posttest (or other primary outcome measurement).
The paper specifies instructional dosage (15 sessions of 90 minutes) but does not provide, in the quoted text, explicit calendar start and end dates (e.g., month-to-month) or an explicit statement that the 15 sessions span a full semester/ term.
Final sentence: Criterion T is not met because the paper does not clearly document a term-length calendar interval from intervention start to outcome measurement.
-
D
Documented Control Group
- The control condition is clearly described (traditional instruction), with sample size and baseline/posttest descriptives reported in tables, satisfying the documented control group requirement.
- Participants were assigned to NLP-enhanced VR, ML-enhanced VR, SA-enhanced VR, or a traditional instruction control condition.
Relevant Quotes:
1) "Participants were assigned to NLP-enhanced VR, ML-enhanced VR, SA-enhanced VR, or a traditional instruction control condition." (p. 1)
2) "The study was anchored in a randomized experimental framework, incorporating three experimental groups and a control group." (p. 6)
3) "The control group engaged in standard language learning activities, serving as a baseline for comparative analysis." (p. 6)
4) "Table 1. Experimental Group Characteristics" with "Control Traditional instruction Grammar/Vocabulary Classroom lectures 15 × 90-min sessions No technology 120 (58/62)" (p. 9)
5) "Table 2. Descriptive statistics by group" includes language proficiency "Pretest" and "Posttest" values for "Control" with n, M, SD, and 95% CI. (p. 13)
Detailed Analysis:
Criterion D requires that the control group be sufficiently documented so readers can understand what the control condition received and assess baseline comparability.
The paper clearly identifies the control condition as traditional instruction / standard language learning activities. It reports the control group sample size and demographics (Table 1) and provides baseline and posttest descriptive statistics for the primary outcome (Table 2).
Final sentence: Criterion D is met because the control condition, sample size, and baseline/posttest descriptives are explicitly documented.
-
Level 2 Criteria
-
S
School-level RCT
- Randomization was not conducted at the school (or site) level; participants were randomized individually, so the school-level RCT requirement is not satisfied.
- Participants were randomized at the individual level to one of four arms using computer-generated permuted blocks (1:1:1:1).
Relevant Quotes:
1) "Participants were randomized at the individual level to one of four arms using computer-generated permuted blocks (1:1:1:1)." (p. 7)
2) "Participants were enlisted from a pool of undergraduates majoring in Teaching English as a Foreign Language." (p. 7)
Detailed Analysis:
Criterion S requires randomization at the level of the implementing institution/site (e.g., schools, centers, campuses, or other comparable delivery sites). The paper describes a single higher-education participant pool with individual-level randomization, not random assignment across multiple sites.
The mention of recurring lab blocks indicates clustering for analysis, but it does not indicate that lab blocks or sites were randomized as the unit of assignment.
Final sentence: Criterion S is not met because randomization was conducted at the individual participant level rather than at the institutional site level.
-
I
Independent Conduct
- The intervention platform was custom-developed for the study and there is no clear statement that an independent external evaluator conducted the trial, so independent conduct is not established.
- The fully immersive VR environment (360° headset-based) was custom-developed in Unity 3D specifically for this study and deployed on Oculus Quest 2 headsets (256 GB model).
Relevant Quotes:
1) "The fully immersive VR environment (360° headset-based) was custom-developed in Unity 3D specifically for this study and deployed on Oculus Quest 2 headsets (256 GB model)." (p. 8)
2) "Trained observers diligently documented participants’ communicative behaviors during interactions within the immersive VR environment." (p. 15)
3) "Theme refinement through peer debriefing with two independent researchers" (p. 17)
Detailed Analysis:
Criterion I requires clear evidence that the evaluation was conducted independently of the intervention designers/providers. The paper describes an intervention environment custom-developed specifically for this study, which strongly suggests the research team (or close collaborators) were involved in intervention development.
While the paper mentions trained observers and "two independent researchers" for qualitative peer debriefing, these statements do not establish that the overall evaluation (implementation, data collection, and/or analysis) was led by a third-party external evaluation team independent of the intervention development.
Final sentence: Criterion I is not met because the paper does not clearly document independent external conduct of the evaluation separate from the intervention development.
-
Y
Year Duration
- The paper does not provide start and measurement dates demonstrating outcome measurement at least 75% of an academic year after the intervention began, and criterion T is also not met.
- Each experimental group underwent 15 sessions of 90 minutes each, focusing on specific pedagogical interventions tailored to address different facets of language teaching and learning.
Relevant Quotes:
1) "Each experimental group underwent 15 sessions of 90 minutes each, focusing on specific pedagogical interventions tailored to address different facets of language teaching and learning." (p. 6)
2) "The duration of the intervention (15 × 90 minute intervention sessions) allowed participants sufficient time to engage with the personalized exercises and assimilate the feedback." (p. 10)
Detailed Analysis:
Criterion Y requires outcome measurement at least 75% of an academic year after the intervention begins, which requires clear calendar start and outcome measurement dates (or an explicit academic-year span). The quoted text provides session dosage but does not provide a calendar interval.
Additionally, per the ERCT dependency rule, if criterion T is not met then criterion Y is not met.
Final sentence: Criterion Y is not met because year-scale timing is not documented in dates and because criterion T is not met.
-
B
Balanced Control Group
- Instructional time appears matched across arms (15 × 90-minute sessions) and the added technology resources are integral to the intervention being tested versus business-as-usual, so the balanced control requirement is satisfied.
- Table 1 shows the control condition had "15 × 90-min sessions" and the VR conditions also had "15 × 90-min sessions."
Relevant Quotes:
1) "Each experimental group underwent 15 sessions of 90 minutes each..." (p. 6)
2) "The control group engaged in standard language learning activities, serving as a baseline for comparative analysis." (p. 6)
3) "All experimental groups used identical core VR environments and scenario-based activities, differing solely in their specified technological augmentations to ensure fair comparison." (p. 6)
4) "Table 1. Experimental Group Characteristics" shows the control as "15 × 90-min sessions" and the VR groups as "15 × 90-min sessions" (p. 9)
5) "The fully immersive VR environment (360° headset-based) was custom-developed in Unity 3D specifically for this study and deployed on Oculus Quest 2 headsets (256 GB model)." (p. 8)
Detailed Analysis:
Criterion B evaluates whether differences in time/budget/material resources between intervention and control could confound the causal contrast, unless those additional resources are explicitly the treatment being tested.
Here, the VR arms clearly involve substantial extra material and infrastructure resources (VR headsets, custom software, cloud and AI components). However, these resources are not incidental; they define the intervention itself (AI-enhanced immersive VR) in contrast to traditional instruction.
Importantly, the paper indicates comparable instructional time exposure across arms via the common "15 × 90-min sessions" dosage reported for both the VR groups and the control group, which reduces the most common imbalance (extra time-on-task).
Final sentence: Criterion B is met because session time is reported as matched across arms and the added technology inputs are integral to the treatment being tested against business-as-usual instruction.
-
Level 3 Criteria
-
R
Reproduced
- No independent replication by a different research team in a different context could be identified for this 2026 study.
Relevant Quotes:
1) (No statement in the paper excerpt indicates that this study has been independently replicated by another research team.)
Detailed Analysis:
Criterion R requires an independent replication of this study (or a clearly identified reproduction of its central experimental claim using the same intervention approach) by a different research team in a different context, published in a peer-reviewed outlet.
The provided paper excerpt does not report any prior replication. An internet search using the DOI and full title did not identify any clearly independent replication studies of this exact trial.
Final sentence: Criterion R is not met because no independent replication of this study could be found in the paper or via an internet search.
-
A
All-subject Exams
- Because the study does not meet criterion E (it uses an author-developed assessment rather than standardized exams), it cannot meet the all-subject standardized exams requirement.
- Language Proficiency (Author-Developed, CEFR-Aligned; Pretest-Posttest). This author-developed test was based on the Common European Framework of Reference for Languages (CEFR)...
Relevant Quotes:
1) "Language Proficiency (Author-Developed, CEFR-Aligned; Pretest-Posttest). This author-developed test was based on the Common European Framework of Reference for Languages (CEFR)..." (p. 12)
2) "This randomized controlled trial tested the effects ... operationalized as performance on an author-developed, CEFR- aligned language proficiency battery emphasizing grammatical and lexical performance..." (p. 1)
Detailed Analysis:
Criterion A requires standardized exam-based assessment across all main subjects and explicitly depends on criterion E being met. Here, the primary outcome is an author-developed battery, so criterion E is not met, which automatically prevents meeting criterion A.
Additionally, the study focuses on L2 language outcomes rather than assessing across all core subjects for the educational program.
Final sentence: Criterion A is not met because criterion E is not met and the study does not assess all subjects using standardized exams.
-
G
Graduation Tracking
- The study does not report tracking participants through to graduation, and because criterion Y is not met, graduation tracking is also not satisfied.
Relevant Quotes:
1) (No statement in the paper excerpt describes following participants until graduation from their program or educational stage.)
Detailed Analysis:
Criterion G requires follow-up tracking until graduation. The provided paper excerpt focuses on pretest-posttest outcomes and does not describe longer-term follow-up through participants’ graduation.
Per the ERCT dependency rule, if criterion Y is not met then criterion G is not met.
An internet search for follow-up publications by the same author reporting graduation outcomes for this cohort did not identify any such follow-up paper.
Final sentence: Criterion G is not met because graduation tracking is not reported, no follow-up paper with graduation outcomes was found, and criterion Y is not met.
-
P
Pre-Registered
- No explicit pre-registration statement or registry identifier is provided showing the protocol was registered before data collection began.
Relevant Quotes:
1) (No pre-registration link, registry name/ID, or registration date is stated in the paper excerpt.)
Detailed Analysis:
Criterion P requires a clearly identified, time-stamped pre-registration in a registry (e.g., OSF Registrations, ClinicalTrials.gov, ISRCTN), with registration occurring before data collection began.
The provided paper excerpt contains detailed methods (including randomization and analysis plans) but does not include a pre-registration statement, registry name, registration ID, or a registration date.
An internet search using the DOI and title did not reveal a clearly linked public preregistration record for this study.
Final sentence: Criterion P is not met because no verifiable pre-registration record is cited in the paper and none was found via internet search.
Request an Update or Contact Us
Are you the author of this study? Let us know if you have any questions or updates.