Effects of AI generated and teacher feedback on EFL learners writing performance and emotional experience

Alireza Maleki

Published: Feb 8, 2026

ERCT Check Date: Apr 14, 2026

DOI: 10.1007/s44163-026-00935-8

Link

Download PDF

L2 languages
adult education
Asia
formative assessment
EdTech platform

C

Participants were randomized at the individual student level within one course, not by class (and no tutoring exception is explicitly stated).

"They were randomly assigned to two groups: an experimental group (n = 11) receiving AI-generated feedback and a control group (n = 11) receiving teacher feedback." (p. 4)
E

Outcomes were measured with researcher-created writing prompts and rubric-based ratings (adapted from IELTS descriptors), not a standardized exam.

"Essays were assessed by two independent EFL instructors using an analytic rubric adapted from the IELTS writing descriptors." (p. 4)
T

The outcome measurement occurred about two weeks after the intervention started, far shorter than an academic term.

"Two weeks later (Week 3), both groups completed the post-test writing task under the same conditions." (p. 5)
D

The control group is clearly defined (teacher feedback), with group sizes and baseline performance reported.

"They were randomly assigned to two groups: an experimental group (n = 11) receiving AI-generated feedback and a control group (n = 11) receiving teacher feedback." (p. 4)
S

Randomization was not at the school (or site) level; the study took place within a single institute setting with student-level assignment.

"The participants were 22 Iranian EFL learners (11 males and 11 females) aged 18–22, enrolled in an upper-intermediate writing course at a private language institute in Mashhad, Iran." (p. 4)
I

While raters were independent and blinded, the study does not document an independent external evaluation team conducting the trial overall.

"A.M was responsible for writing and supervising this paper." (p. 10)
Y

The study’s pre-to-post tracking spans only about two weeks, which is far less than 75% of an academic year.

"Two weeks later (Week 3), both groups completed the post-test writing task under the same conditions." (p. 5)
B

The study explicitly matched feedback structure and approximate length across groups, indicating comparable resources and dosage.

"The control group received teacher-written feedback from the same instructor using a pre-planned template mirroring the structure and approximate length of the AI feedback (one positive remark, three improvement points, two suggestions)." (p. 5)
R

No independent peer-reviewed replication of this specific study was identified, and the paper itself does not report replication.
A

Criterion E is not met, so criterion A is automatically not met; additionally, outcomes focus on writing only rather than all core subjects.

"Writing achievement was measured using two parallel writing tasks designed for the pre- and post-test phases." (p. 4)
G

Criterion Y is not met, so criterion G is automatically not met; the paper also reports only short-term outcomes and does not track learners to any graduation milestone.

"This study, however, is not without limitations. The small sample size and short intervention period limit generalizability." (p. 9)
P

The paper provides no pre-registration identifier or registry link, and it explicitly lists the clinical trial number as not applicable.

"Clinical trial number" / "Not applicable." (p. 10)

Abstract

Although considerable research has explored the role of feedback in second language writing, limited studies have compared the effects of AI-generated feedback and teacher-written feedback on both academic performance and emotional experience, particularly in EFL contexts. This mixed-methods brief report investigated the impact of AI-mediated feedback on Iranian EFL learners’ writing achievement and perceptions. Twenty-two upper-intermediate institute students were randomly assigned to either an AI feedback group (n = 11) or a teacher feedback group (n = 11). Both groups completed two parallel opinion paragraph writing tasks and received feedback accordingly. Quantitative analysis showed that the AI feedback group achieved significantly greater improvement in post-test scores than the teacher feedback group, highlighting the role of automated formative feedback in enhancing writing performance. Qualitative thematic analysis revealed that AI feedback promoted clarity, autonomy, and motivation, whereas teacher feedback provided emotional reassurance and interpersonal support. These findings suggest that AI systems, when designed and implemented thoughtfully, can complement human feedback by fostering both linguistic progress and learner autonomy in EFL writing instruction.

Full Article

ERCT Criteria Breakdown

Level 1 Criteria
- C
  Class-level RCT
  - Participants were randomized at the individual student level within one course, not by class (and no tutoring exception is explicitly stated).
  - "They were randomly assigned to two groups: an experimental group (n = 11) receiving AI-generated feedback and a control group (n = 11) receiving teacher feedback." (p. 4)
  - Relevant Quotes: 1) "Twenty-two upper-intermediate institute students were randomly assigned to either an AI feedback group (n=11) or a teacher feedback group (n=11)." (Abstract, p. 1) 2) "The participants were 22 Iranian EFL learners (11 males and 11 females) aged 18–22, enrolled in an upper-intermediate writing course at a private language institute in Mashhad, Iran." (p. 4) 3) "They were randomly assigned to two groups: an experimental group (n = 11) receiving AI-generated feedback and a control group (n = 11) receiving teacher feedback." (p. 4) Detailed Analysis: Criterion C requires randomization at the class level (or stronger) to reduce contamination, unless the intervention is clearly one-to-one personal tutoring where student-level randomization is acceptable. The paper describes one writing course at a private language institute and states that the 22 learners "were randomly assigned to two groups". This indicates individual-level randomization within a single instructional setting, not randomization of intact classes (or schools/sites). The intervention is feedback modality (AI-generated vs teacher-written feedback) delivered to students in the same course context, and the paper does not describe a one-to-one tutoring exception that would make student-level randomization the explicitly intended design under the ERCT exception. Criterion C is not met because randomization was at the individual student level rather than by class (and no tutoring exception is explicitly documented).
- E
  Exam-based Assessment
  - Outcomes were measured with researcher-created writing prompts and rubric-based ratings (adapted from IELTS descriptors), not a standardized exam.
  - "Essays were assessed by two independent EFL instructors using an analytic rubric adapted from the IELTS writing descriptors." (p. 4)
  - Relevant Quotes: 1) "Writing achievement was measured using two parallel writing tasks designed for the pre- and post-test phases." (p. 4) 2) "Prompt A (Pre-test): “Some schools are replacing traditional textbooks with online learning platforms. Do you agree or disagree? Give two reasons and examples to support your view.”" (p. 4) 3) "Prompt B (Post-test): “Many people say school uniforms help students concentrate on learning. Do you agree or disagree? Give two reasons and examples to support your view.”" (p. 4) 4) "Essays were assessed by two independent EFL instructors using an analytic rubric adapted from the IELTS writing descriptors." (p. 4) Detailed Analysis: Criterion E requires standardized exam-based assessment (i.e., a widely recognized standardized test administered as an exam), not an instrument assembled for the study. Here, the writing outcome is based on two study-selected prompts (Prompt A and Prompt B) and ratings using an analytic rubric. While the rubric is "adapted from the IELTS writing descriptors," the paper does not describe administering IELTS (or another standardized exam) under standardized exam conditions; instead, it uses researcher-provided prompts scored by raters. Criterion E is not met because the paper uses study-specific writing tasks and rubric scoring rather than a standardized exam.
- T
  Term Duration
  - The outcome measurement occurred about two weeks after the intervention started, far shorter than an academic term.
  - "Two weeks later (Week 3), both groups completed the post-test writing task under the same conditions." (p. 5)
  - Relevant Quotes: 1) "In Week 1, all participants completed the pre-test writing task under identical classroom conditions." (p. 5) 2) "Two weeks later (Week 3), both groups completed the post-test writing task under the same conditions." (p. 5) Detailed Analysis: Criterion T requires that outcomes be measured at least one full academic term after the intervention begins (typically about 3–4 months), even if the intervention itself is short. The procedure explicitly places pre-testing in "Week 1" and post-testing in "Week 3," which is approximately a two-week interval. This is far shorter than a term-long follow-up. Criterion T is not met because the pre-to-post interval is about two weeks, not one academic term.
- D
  Documented Control Group
  - The control group is clearly defined (teacher feedback), with group sizes and baseline performance reported.
  - "They were randomly assigned to two groups: an experimental group (n = 11) receiving AI-generated feedback and a control group (n = 11) receiving teacher feedback." (p. 4)
  - Relevant Quotes: 1) "They were randomly assigned to two groups: an experimental group (n = 11) receiving AI-generated feedback and a control group (n = 11) receiving teacher feedback." (p. 4) 2) "Both groups started at nearly the same mean level (AI group M=10.23, SD=1.18; Teacher group M=10.41, SD=1.11)." (p. 6) 3) "Table 2 Pre- and Post-test writing scores by feedback group" (p. 7) Detailed Analysis: Criterion D requires a well-documented control group, including a clear description of what the control condition received and baseline outcome information. The paper explicitly defines the control condition as the group "receiving teacher feedback" and reports its size (n = 11). It also reports baseline (pre-test) means and standard deviations for both groups and provides a table summarizing pre- and post-test outcomes by group. Criterion D is met because the control condition is explicitly described and baseline performance data are reported.
Level 2 Criteria
- S
  School-level RCT
  - Randomization was not at the school (or site) level; the study took place within a single institute setting with student-level assignment.
  - "The participants were 22 Iranian EFL learners (11 males and 11 females) aged 18–22, enrolled in an upper-intermediate writing course at a private language institute in Mashhad, Iran." (p. 4)
  - Relevant Quotes: 1) "The participants were 22 Iranian EFL learners (11 males and 11 females) aged 18–22, enrolled in an upper-intermediate writing course at a private language institute in Mashhad, Iran." (p. 4) 2) "They were randomly assigned to two groups: an experimental group (n = 11) receiving AI-generated feedback and a control group (n = 11) receiving teacher feedback." (p. 4) Detailed Analysis: Criterion S requires randomization among schools (or comparable sites), not just among individuals or classes within a single site. The paper describes one private language institute setting and random assignment of learners to conditions. It does not describe multiple institutes/schools randomized to conditions. Criterion S is not met because the unit of randomization is individual learners within one institute setting, not schools/sites.
- I
  Independent Conduct
  - While raters were independent and blinded, the study does not document an independent external evaluation team conducting the trial overall.
  - "A.M was responsible for writing and supervising this paper." (p. 10)
  - Relevant Quotes: 1) "Essays were assessed by two independent EFL instructors using an analytic rubric adapted from the IELTS writing descriptors." (p. 4) 2) "To ensure scoring reliability, both raters independently evaluated all scripts without access to group assignment." (p. 4) 3) "A.M was responsible for writing and supervising this paper." (p. 10) Detailed Analysis: Criterion I requires that the study be conducted independently from the authors who designed/implemented the intervention, to reduce bias in implementation, measurement, analysis, and reporting. The paper documents good practice in outcome scoring: it uses "two independent EFL instructors" and states they scored "without access to group assignment." However, independence of outcome raters does not by itself establish that the overall trial conduct (design, assignment, implementation, data management, and analysis) was performed by an independent external evaluation team. The paper provides no statement that an external evaluator ran the study, and it states the (single) author was responsible for writing and supervising the paper, which does not support independent conduct. Criterion I is not met because the study does not document independent external conduct of the overall evaluation (despite blinded independent raters).
- Y
  Year Duration
  - The study’s pre-to-post tracking spans only about two weeks, which is far less than 75% of an academic year.
  - "Two weeks later (Week 3), both groups completed the post-test writing task under the same conditions." (p. 5)
  - Relevant Quotes: 1) "In Week 1, all participants completed the pre-test writing task under identical classroom conditions." (p. 5) 2) "Two weeks later (Week 3), both groups completed the post-test writing task under the same conditions." (p. 5) Detailed Analysis: Criterion Y requires outcomes be measured at least 75% of one academic year after the intervention begins. The described tracking window is from Week 1 to Week 3, i.e., about two weeks. This is far below the required year-scale duration. Criterion Y is not met because the study tracks outcomes for only about two weeks rather than most of an academic year.
- B
  Balanced Control Group
  - The study explicitly matched feedback structure and approximate length across groups, indicating comparable resources and dosage.
  - "The control group received teacher-written feedback from the same instructor using a pre-planned template mirroring the structure and approximate length of the AI feedback (one positive remark, three improvement points, two suggestions)." (p. 5)
  - Relevant Quotes: 1) "Feedback was copied directly from the ChatGPT interface without modification and printed for each student." (p. 5) 2) "The control group received teacher-written feedback from the same instructor using a pre-planned template mirroring the structure and approximate length of the AI feedback (one positive remark, three improvement points, two suggestions)." (p. 5) 3) "This parallel structure was used to control for feedback quantity and organization, allowing the comparison to focus on feedback source rather than format." (p. 5) Detailed Analysis: Criterion B compares the nature, quantity, and quality of resources (time, materials, adult support, etc.) provided to intervention and control conditions, and asks whether the control condition provides a comparable substitute for the intervention’s inputs, unless additional resources are explicitly the treatment variable. In this study, both groups receive written formative feedback after the writing tasks. The paper explicitly states that the teacher-feedback condition used a template that mirrors the "structure and approximate length" of the AI feedback, and that this was done to "control for feedback quantity and organization." This is direct evidence that the authors attempted to balance the key resource input (feedback dosage and format) across groups. The treatment contrast is the source of feedback (AI vs teacher), not an uncontrolled difference in time or materials. Criterion B is met because the study documents a deliberate attempt to match feedback dosage/structure between groups, keeping resources balanced while testing feedback source.
Level 3 Criteria
- R
  Reproduced
  - No independent peer-reviewed replication of this specific study was identified, and the paper itself does not report replication.
  - Relevant Quotes: 1) (No statements in the paper indicate that this study is a replication of a prior RCT or that it has already been replicated.) (n/a) Detailed Analysis: Criterion R requires an independent replication by a different research team in a different context, published in a peer-reviewed outlet. The article does not present itself as a replication and does not report results from an additional independent trial. On 2026-04-14, targeted web searches using the exact DOI (10.1007/s44163-026-00935-8) and the full article title did not identify a peer-reviewed study explicitly replicating this specific Maleki (2026) experiment (same comparison, same core design). Criterion R is not met because independent peer-reviewed replication of this specific study was not found and is not reported in the paper.
- A
  All-subject Exams
  - Criterion E is not met, so criterion A is automatically not met; additionally, outcomes focus on writing only rather than all core subjects.
  - "Writing achievement was measured using two parallel writing tasks designed for the pre- and post-test phases." (p. 4)
  - Relevant Quotes: 1) "Writing achievement was measured using two parallel writing tasks designed for the pre- and post-test phases." (p. 4) 2) "Essays were assessed by two independent EFL instructors using an analytic rubric adapted from the IELTS writing descriptors." (p. 4) Detailed Analysis: Criterion A requires standardized exam-based assessment across all main subjects. The ERCT dependency rule also specifies that if criterion E is not met, then criterion A is not met. This study measures only EFL writing performance using study-specific prompts scored with a rubric, not standardized exam scores across subjects. Criterion A is not met because criterion E is not met and the study assesses writing only using non-standardized tasks.
- G
  Graduation Tracking
  - Criterion Y is not met, so criterion G is automatically not met; the paper also reports only short-term outcomes and does not track learners to any graduation milestone.
  - "This study, however, is not without limitations. The small sample size and short intervention period limit generalizability." (p. 9)
  - Relevant Quotes: 1) "In Week 1, all participants completed the pre-test writing task under identical classroom conditions." (p. 5) 2) "Two weeks later (Week 3), both groups completed the post-test writing task under the same conditions." (p. 5) 3) "This study, however, is not without limitations. The small sample size and short intervention period limit generalizability." (p. 9) Detailed Analysis: Criterion G requires tracking participants until graduation. The ERCT dependency rule also states that if criterion Y is not met, then criterion G is not met. The paper documents only a short pre-test to post-test window (Week 1 to Week 3) and does not describe tracking participants to any graduation milestone (e.g., course completion, program completion, credential completion, or school graduation). Instead, it explicitly frames the intervention period as short. On 2026-04-14, targeted web searches for follow-up publications by the same author that tracked this specific cohort beyond the reported short window did not identify any peer-reviewed graduation-tracking follow-up study for this experiment. Criterion G is not met because the study does not include graduation tracking and criterion Y is not met.
- P
  Pre-Registered
  - The paper provides no pre-registration identifier or registry link, and it explicitly lists the clinical trial number as not applicable.
  - "Clinical trial number" / "Not applicable." (p. 10)
  - Relevant Quotes: 1) "Clinical trial number" (p. 10) 2) "Not applicable." (p. 10) 3) "In Week 1, all participants completed the pre-test writing task under identical classroom conditions." (p. 5) Detailed Analysis: Criterion P requires a publicly pre-registered protocol (registry name and identifier) with registration occurring before data collection began. The paper does not provide any pre-registration statement (e.g., OSF, AEA RCT registry, ClinicalTrials.gov) and explicitly indicates the clinical trial number is "Not applicable." The procedure indicates data collection began in Week 1 with the pre-test. On 2026-04-14, targeted web searches for a public registration associated with this study (using the full title, DOI, and author name, including OSF and ClinicalTrials.gov queries) did not identify a pre-registered protocol record. Criterion P is not met because no public pre-registration (registry and timing before data collection) is provided or found.

Request an Update or Contact Us

Are you the author of this study? Let us know if you have any questions or updates.

Request Valuation Update

All Other Requests

Have Questions
or Suggestions?

Get in Touch

Have a study you'd like to submit for ERCT evaluation? Found something that could be improved? If you're an author and need to update or correct information about your study, let us know.

Submit a Study for Evaluation

Share your research with us for review
Suggest Improvements

Provide feedback to help us make things better.
Update Your Study

If you're the author, let us know about necessary updates or corrections.

Effects of AI generated and teacher feedback on EFL learners writing performance and emotional experience

Participants were randomized at the individual student level within one course, not by class (and no tutoring exception is explicitly stated).

Outcomes were measured with researcher-created writing prompts and rubric-based ratings (adapted from IELTS descriptors), not a standardized exam.

The outcome measurement occurred about two weeks after the intervention started, far shorter than an academic term.

The control group is clearly defined (teacher feedback), with group sizes and baseline performance reported.

Randomization was not at the school (or site) level; the study took place within a single institute setting with student-level assignment.

While raters were independent and blinded, the study does not document an independent external evaluation team conducting the trial overall.

The study’s pre-to-post tracking spans only about two weeks, which is far less than 75% of an academic year.

The study explicitly matched feedback structure and approximate length across groups, indicating comparable resources and dosage.

No independent peer-reviewed replication of this specific study was identified, and the paper itself does not report replication.

Criterion E is not met, so criterion A is automatically not met; additionally, outcomes focus on writing only rather than all core subjects.

Criterion Y is not met, so criterion G is automatically not met; the paper also reports only short-term outcomes and does not track learners to any graduation milestone.

The paper provides no pre-registration identifier or registry link, and it explicitly lists the clinical trial number as not applicable.

Abstract

ERCT Criteria Breakdown

Level 1 Criteria

Class-level RCT

Exam-based Assessment

Term Duration

Documented Control Group

Level 2 Criteria

School-level RCT

Independent Conduct

Year Duration

Balanced Control Group

Level 3 Criteria

Reproduced

All-subject Exams

Graduation Tracking

Pre-Registered

Request an Update or Contact Us

Have Questions or Suggestions?

Submit a Study for Evaluation

Suggest Improvements

Update Your Study

Have Questions
or Suggestions?