Abstract
The emergence of Large Language Models (LLMs) has opened new possibilities for language learning through conversational interaction with chatbots. Yet, little empirical evidence exists on how students experience such interactions and how corrective feedback should be provided. Research suggests that immediate corrective feedback is generally more effective than delayed feedback. Nevertheless, learners’ perception of this effectiveness and their preferences for feedback timing, particularly in the domain of Computer-Assisted Language Learning (CALL), remain underexplored. This study investigates the feasibility of providing immediate feedback and examines the impact of feedback timing on user experience and grammar learning gains in English. An in-the-wild experiment was conducted with 66 L2 English learners, who integrated chatbot sessions into their English course as an extracurricular activity. Participants were randomly assigned to two groups receiving feedback either during or after the conversation. Findings reveal no significant difference in learning gains, but immediate feedback enhanced user experience, leading to overall positive perceptions of the chatbot. Additionally, we explore users’ perceptions of the chatbot’s social role and personality, offering a roadmap for future enhancements. These results provide valuable insights into the potential of LLMs and chatbots for language learning.
Full
Article
ERCT Criteria Breakdown
-
Level 1 Criteria
-
C
Class-level RCT
- Individual-level assignment is used, but the intervention is explicitly framed as one-to-one tutor-like chatbot practice, fitting the ERCT tutoring exception.
- We designed our chatbot to simulate written conversation practices between an English instructor (chatbot) and an English as a Foreign Language (EFL) learner (user),
Relevant Quotes:
1) "Participants were from four classes (A, B, C, D) and pseudo-randomly assigned to either of the two conditions." (p. 11)
2) "We designed our chatbot to simulate written conversation practices between an English instructor (chatbot) and an English as a Foreign Language (EFL) learner (user)," (p. 8)
3) "The chatbot was introduced to participants as “Alex” 3 without specifying gender or age, and positioned as a native English-speaking tutor." (p. 11)
Detailed Analysis:
Criterion C requires class-level (or stronger) randomization to reduce contamination, unless the intervention is personal teaching (e.g., tutoring), in which case student-level assignment is acceptable.
The paper documents a between-subject design with participants pseudo-randomly assigned, and the unit of assignment is individual participants (not whole classes or schools).
However, the intervention is explicitly described as a simulated instructor-learner interaction and the chatbot is positioned as a "tutor". This matches the ERCT exception for tutoring/personal teaching, where student-level assignment can still satisfy Criterion C.
Final summary sentence: Criterion C is met because the study is a tutoring-style, one-to-one chatbot intervention for which student-level assignment is acceptable under the ERCT exception.
-
E
Exam-based Assessment
- The study uses a custom, study-specific grammar task based on participants’ own sentences rather than a standardized exam.
- The participants were given 25 sentences extracted from their utterances and were instructed to identify and correct any mistakes they found.
Relevant Quotes:
1) "a comprehensive language test was not administered before and after the experiment." (p. 11)
2) "The participants were given 25 sentences extracted from their utterances and were instructed to identify and correct any mistakes they found." (p. 11)
Detailed Analysis:
Criterion E requires outcomes to be measured using standardized, widely-recognized exam-based assessments, rather than researcher-created or study-specific instruments.
The paper explicitly states that a comprehensive language test was not administered. Instead, it describes a custom assessment constructed from each participant’s own utterances (25 sentences), scored by how many errors were identified and corrected.
This is not described as a recognized standardized exam; it is a bespoke measure created for this study.
Final summary sentence: Criterion E is not met because outcomes are assessed using a custom task derived from participants’ utterances rather than a standardized exam.
-
T
Term Duration
- The study measures outcomes after about one month (4 weeks) from intervention start, which is shorter than a full academic term.
- Participants completed 12 sessions in total, conducting three per week for 4 weeks
Relevant Quotes:
1) "66 students interacted with an LLM-powered chatbot over 12 sessions distributed over a one-month period." (p. 3)
2) "Participants completed 12 sessions in total, conducting three per week for 4 weeks" (p. 12)
3) "After finishing their final chat session, participants were directed to the language assessment page." (p. 12)
4) "An in-the-wild experiment was conducted with 66 L2 English learners, who integrated chatbot sessions into their English course as an extracurricular activity over one semester." (p. 1)
Detailed Analysis:
Criterion T requires that outcomes be measured at least one full academic term (typically ~3–4 months) after the intervention begins.
The paper contains a higher-level statement in the abstract that the chatbot sessions were integrated "over one semester." However, the concrete schedule in the methodology is explicit: 12 sessions over a "one-month period" and "for 4 weeks", with assessment immediately after the final session.
Using the methodology’s explicit timeline, the elapsed time from the start of intervention exposure to primary outcome measurement is about one month, which is shorter than a term-length follow-up.
Final summary sentence: Criterion T is not met because outcome measurement occurs after roughly one month (4 weeks), not after at least one academic term.
-
D
Documented Control Group
- The paper clearly defines the two conditions (ICF vs. DCF) and reports group sizes and participant characteristics, documenting the comparison group adequately.
- Among participants who completed the experiment, 24 were in the Immediate condition and 30 in the Delayed condition.
Relevant Quotes:
1) "Among participants who completed the experiment, 24 were in the Immediate condition and 30 in the Delayed condition." (p. 11)
2) "Immediate Corrective Feedback (ICF): In this condition, participants were corrected immediately after committing mistakes during the conversation." (p. 8)
3) "Delayed Corrective Feedback (DCF): Participants were not corrected during the conversation in this condition. Instead, they received a summary of their errors and corrections immediately following the conversation." (p. 8)
4) "The analyzed sample comprised 43 males and 11 females, with ages ranging from 18 to 23 years" (p. 10)
Detailed Analysis:
Criterion D requires the control/comparison group to be documented (who they were and what they received), so that comparisons are interpretable.
This study compares two clearly defined conditions (ICF vs. DCF). The paper provides operational definitions of both conditions, reports the sample sizes per condition among completers, and provides baseline sample description (e.g., age and gender composition).
While the comparison group is not a "no-treatment" or business-as-usual control, it is still a documented comparison condition, and the paper provides the necessary descriptive information to understand what each group experienced.
Final summary sentence: Criterion D is met because the paper explicitly defines both conditions and reports group sizes and participant characteristics.
-
Level 2 Criteria
-
S
School-level RCT
- Assignment is at the participant level (across classes), not at the school/site level.
- Participants were from four classes (A, B, C, D) and pseudo-randomly assigned to either of the two conditions.
Relevant Quotes:
1) "Participants were from four classes (A, B, C, D) and pseudo-randomly assigned to either of the two conditions." (p. 11)
Detailed Analysis:
Criterion S requires randomization at the school (or institution/site) level.
The quoted description shows allocation of participants (from four classes) into conditions, i.e., participant-level assignment rather than randomization of separate schools/sites.
Final summary sentence: Criterion S is not met because randomization is not conducted at the school/site level.
-
I
Independent Conduct
- The authors built the chatbot system and also conducted and analyzed the study, with no documented independent evaluator leading the evaluation.
- We developed a web-application chatbot utilizing the OpenAI GPT-3 (Brown et al., 2020) API,
Relevant Quotes:
1) "We developed a web-application chatbot utilizing the OpenAI GPT-3 (Brown et al., 2020) API," (p. 8)
2) "AM: Conceptualization, Data curation, Formal analysis, Investigation," (p. 18)
3) "BT: Data curation, Formal analysis, Methodology, Software," (p. 18)
Detailed Analysis:
Criterion I requires the evaluation to be conducted independently from the intervention designers/providers to reduce bias in implementation, measurement, analysis, and reporting.
The paper explicitly states that the authors developed the chatbot. The author contributions indicate that the same author team performed the investigation and formal analysis (and software development).
No statement in the paper indicates that an external, independent evaluation team conducted the evaluation or led the analysis.
Final summary sentence: Criterion I is not met because the intervention developers also conducted and analyzed the study without documented independent evaluation.
-
Y
Year Duration
- The study lasts about one month (4 weeks), far short of 75% of an academic year; additionally, since T is not met, Y is not met by ERCT rule.
- Participants completed 12 sessions in total, conducting three per week for 4 weeks
Relevant Quotes:
1) "66 students interacted with an LLM-powered chatbot over 12 sessions distributed over a one-month period." (p. 3)
2) "Participants completed 12 sessions in total, conducting three per week for 4 weeks" (p. 12)
Detailed Analysis:
Criterion Y requires outcomes to be measured at least 75% of one academic year after intervention begins, and ERCT further specifies that if Criterion T is not met, Criterion Y is not met.
The quoted duration is approximately one month (4 weeks), which is far shorter than an academic year and also shorter than a term. Therefore it fails both the direct duration requirement and the ERCT dependency on T.
Final summary sentence: Criterion Y is not met because the intervention and follow-up span only about one month and T is not met.
-
B
Balanced Control Group
- The two conditions are explicitly described as having the same exposure and feedback generation, differing only in the timing of feedback.
- the amount of feedback generation was the same between the two conditions,
Relevant Quotes:
1) "the amount of feedback generation was the same between the two conditions, with the only difference being the time when the feedback was given to the participant" (p. 8)
2) "Each chat session was considered complete once the user generated 1, 000 characters to ensure consistent chat exposure across participants." (p. 12)
Detailed Analysis:
Criterion B compares the nature, quantity, and quality of resources provided to each condition, and asks whether any added time/budget/material advantages exist for only one group (unless explicitly the treatment).
This study compares two versions of the same chatbot system (ICF vs. DCF). The paper explicitly states the "amount of feedback generation" was the same between conditions and that the only difference was feedback timing.
It also standardizes exposure by defining a session completion threshold (1,000 characters) to ensure consistent exposure across participants.
These statements support that there is no systematic imbalance in time, materials, or educational resources between groups beyond the intended timing manipulation.
Final summary sentence: Criterion B is met because the paper explicitly documents equivalent exposure and feedback generation across conditions, with timing as the only difference.
-
Level 3 Criteria
-
R
Reproduced
- No independent replication of this specific experiment by a different research team was found during the ERCT check.
Relevant Quotes:
1) (No relevant statement found in the paper indicating an independent replication of this study.) (n/a)
Detailed Analysis:
Criterion R requires an independent replication of the study by a different team, in a different context, published in a peer-reviewed outlet.
The paper itself does not claim to be a replication study, and it does not report that another team has replicated this specific experiment.
As part of this ERCT check (dated 2026-04-14), web searching for replication studies that explicitly replicate this exact intervention and design did not identify any independent replications.
Final summary sentence: Criterion R is not met because no independent replication evidence was found.
-
A
All-subject Exams
- The study does not use standardized exams (E is not met), so all-subject standardized exam assessment (A) is automatically not met.
- a comprehensive language test was not administered before and after the experiment.
Relevant Quotes:
1) "a comprehensive language test was not administered before and after the experiment." (p. 11)
Detailed Analysis:
Criterion A requires standardized exam-based assessment across all main subjects, and ERCT specifies that if Criterion E is not met, then Criterion A is not met.
Since the study explicitly did not administer a comprehensive language test (and instead used a custom assessment), Criterion E is not met and therefore Criterion A cannot be met.
Final summary sentence: Criterion A is not met because E is not met and no standardized all-subject exam-based outcomes are reported.
-
G
Graduation Tracking
- The study ends after the post-session assessment/forms and does not track participants to graduation; additionally, since Y is not met, G is not met by ERCT rule.
- The experiment concluded once both forms were completed.
Relevant Quotes:
1) "The experiment concluded once both forms were completed." (p. 12)
2) "Participants completed 12 sessions in total, conducting three per week for 4 weeks" (p. 12)
Detailed Analysis:
Criterion G requires tracking participants until graduation, and ERCT specifies that if Criterion Y is not met, Criterion G is not met.
The study describes a short intervention with an immediate post-session assessment and concludes when the final forms are completed. There is no description of long-term follow-up, degree completion, or graduation outcomes.
Additionally, because the study duration is far shorter than a year, Criterion Y is not met, which also forces Criterion G to be not met under the ERCT dependency rule.
As part of this ERCT check (dated 2026-04-14), web searching did not identify any subsequent peer-reviewed follow-up papers by the same author team reporting graduation tracking for this cohort.
Final summary sentence: Criterion G is not met because the study ends after immediate post-intervention measures and provides no graduation tracking (and Y is not met).
-
P
Pre-Registered
- The paper links to OSF/GitHub for data and code but provides no preregistration identifier/date, and no preregistration record could be verified during the ERCT check.
- The data and analysis can be found at: OSF (https://osf.io/m9qgf/) and the code of the web-app of the project can be accessed at: GitHub (https://ali.mk/ChatBot2023).
Relevant Quotes:
1) "The data and analysis can be found at: OSF (https://osf.io/m9qgf/) and the code of the web-app of the project can be accessed at: GitHub (https://ali.mk/ChatBot2023)." (p. 18)
2) "Since our data collection in early 2022," (p. 7)
Detailed Analysis:
Criterion P requires a publicly verifiable preregistered protocol created before the study began (including a registry identifier and registration date that predates data collection).
The paper provides an OSF link and a GitHub link for data/code sharing, which supports openness, but it does not state that the study protocol was preregistered, and it does not provide any registration identifier or registration date.
The paper indicates data collection occurred in early 2022, so a valid preregistration would need to be time-stamped before that period. During this ERCT check (dated 2026-04-14), no preregistration record for this specific study could be verified from the information provided in the article.
Final summary sentence: Criterion P is not met because no verifiable preregistration ID/date is provided (and preregistration could not be confirmed via the provided OSF link).
Request an Update or Contact Us
Are you the author of this study? Let us know if you have any questions or updates.