AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting

Greg Kestin, Kelly Miller, Anna Klales, Timothy Milbourne & Gregorio Ponti

Published:
ERCT Check Date:
DOI: 10.1038/s41598-025-97652-6
  • science
  • higher education
  • US
  • EdTech platform
  • online homework
0
  • C

    The study randomized students in small peer-instruction groups for a tutoring-style intervention, which satisfies the ERCT tutoring exception.

    "Through a design that involves targeted, content-rich prompt engineering, we developed an online tutor that uses GAI and best practices from pedagogy and educational psychology to promote learning in undergraduate science education." (p. 2)

  • E

    The study used custom lesson-specific pre- and post-tests rather than widely recognized standardized exams.

    "To prevent the specific test questions from influencing the teaching or AI tutor design, the tests were constructed by a separate team member from those involved in designing the AI or teaching the lessons." (p. 8)

  • T

    The study ran across two consecutive weeks with immediate post-tests, so it did not measure outcomes at least one academic term after the intervention began.

    "In this study, students were divided into two groups, each experiencing two lessons, each with distinct teaching methodologies, in consecutive weeks." (p. 2)

  • D

    The in-class active learning control condition and group characteristics are documented, including comparable demographics and baseline physics background knowledge.

    "The demographics of the two groups were comparable (see table S2A), as were previous measures of their physics background knowledge (see Table S2B)." (p. 7)

  • S

    Randomization occurred within one university course, not at the school level.

    "We conducted a randomized controlled experiment in a large undergraduate physics course (N=194) at Harvard University, with a student population broadly representative of those found across a range of institutions, to measure the difference between 1) how much students learn and 2) students’ perceptions of the learning experience when identical material is presented through an AI tutor compared with an active learning classroom." (p. 2)

  • I

    The AI tutor was designed by the authors and the paper does not describe independent third-party conduct of the study.

    "A subset of the best practices (i-iii) were incorporated into the AI pedagogy by careful engineering of the AI tutor’s system prompt." (p. 4)

  • Y

    Outcomes were measured within a two-week crossover design, not after a full academic year.

    "In this study, students were divided into two groups, each experiencing two lessons, each with distinct teaching methodologies, in consecutive weeks." (p. 2)

  • B

    The AI tutor condition substituted for the in-class lesson and did not add unmatched time or resources for the intervention group.

    "70% of students in the AI group spent less than 60 minutes on task, while 30% spent more than 60 minutes on task. The median time on task for students in the AI group was 49 minutes." (p. 2)

  • R

    No independent peer-reviewed replication of this specific study by another research team was found.

    "Nonetheless, studies that explicitly replicate known in-class active learning results27,28 would be valuable for confirming and refining the details of this transferability." (p. 6)

  • A

    Outcomes were limited to physics topics and the study did not use standardized exams across all main subjects (and criterion E is not met).

    "To establish baseline knowledge, students from both groups completed a pre-test prior to each lesson, focusing on surface tension in the first week and fluid flow in the second." (p. 2)

  • G

    The study reports only immediate post-lesson outcomes and does not track students until graduation (and criterion Y is not met).

    "Following each lesson, students completed post-tests to measure content mastery and answered four questions aimed at gauging their learning experience, including engagement, enjoyment, motivation, and growth mindset." (p. 2)

  • P

    The paper reports IRB approval but provides no evidence of a pre-registered protocol, and no matching public registry entry was found.

    "The present study was approved by the Harvard University IRB (study no. IRB23-0797) and followed a crossover design." (p. 7)

Abstract

Here we report a randomized, controlled trial measuring college students' learning and their perceptions when content is presented through an AI-powered tutor compared with an active learning class. The novel design of the custom AI tutor is informed by the same pedagogical best practices as employed in the in-class lessons. We find that students learn significantly more in less time when using the AI tutor, compared with the in-class active learning. They also feel more engaged and more motivated. These findings offer empirical evidence for the efficacy of a widely accessible AI-powered pedagogy in significantly enhancing learning outcomes, presenting a compelling case for its broad adoption in learning environments.

Full Article

ERCT Criteria Breakdown

  • Level 1 Criteria

    • C

      Class-level RCT

      • The study randomized students in small peer-instruction groups for a tutoring-style intervention, which satisfies the ERCT tutoring exception.
      • "Through a design that involves targeted, content-rich prompt engineering, we developed an online tutor that uses GAI and best practices from pedagogy and educational psychology to promote learning in undergraduate science education." (p. 2)
      • Relevant Quotes: 1) "What if an AI tutor could mimic the learning experience one would get from an expert (human) tutor?" (p. 2) 2) "Through a design that involves targeted, content-rich prompt engineering, we developed an online tutor that uses GAI and best practices from pedagogy and educational psychology to promote learning in undergraduate science education." (p. 2) 3) "Students were randomly assigned to two groups, respecting the constraint that students who regularly worked together in class during peer instruction were placed in the same group in order to maximize the effectiveness of their in-class learning." (p. 7) 4) "As mentioned above, keeping students with their peer-instruction groups meant that subjects were randomized at the level of these groups (2-3 students) rather than as individuals." (p. 8) Detailed Analysis: The ERCT C criterion requires randomization at the class level unless the intervention is personal tutoring (an explicit exception in the ERCT specification). This paper evaluates an AI tutor intended to mimic an expert human tutor and designed for individualized, on-demand feedback. The study randomized students at the level of small peer-instruction groups (2-3 students), not whole classes. Because the intervention is a tutoring system rather than a classroom-wide teaching method, the ERCT exception applies and student-level (or small-group) randomization is acceptable. Final sentence explaining if criterion C is met/not met because the intervention is a tutoring system and the study uses a valid tutoring exception to the class-level randomization requirement.
    • E

      Exam-based Assessment

      • The study used custom lesson-specific pre- and post-tests rather than widely recognized standardized exams.
      • "To prevent the specific test questions from influencing the teaching or AI tutor design, the tests were constructed by a separate team member from those involved in designing the AI or teaching the lessons." (p. 8)
      • Relevant Quotes: 1) "To establish baseline knowledge, students from both groups completed a pre-test prior to each lesson, focusing on surface tension in the first week and fluid flow in the second. Following each lesson, students completed post-tests to measure content mastery and answered four questions aimed at gauging their learning experience, including engagement, enjoyment, motivation, and growth mindset." (p. 2) 2) "To prevent the specific test questions from influencing the teaching or AI tutor design, the tests were constructed by a separate team member from those involved in designing the AI or teaching the lessons." (p. 8) 3) "To prevent details of the lessons or AI prompts from influencing the test of learning, the tests were written based on the learning goals for the lesson and not the specific lesson content." (p. 8) Detailed Analysis: Criterion E requires outcome measures to be standardized, widely recognized exams rather than custom assessments designed for the study. The paper explicitly uses lesson-specific pre-tests and post-tests and describes how these tests were constructed for this study, including by a separate team member and aligned to lesson learning goals. This indicates the outcome assessments were custom-made rather than an external standardized exam. Final sentence explaining if criterion E is met/not met because the study's outcomes were measured with custom pre- and post-tests rather than standardized exams.
    • T

      Term Duration

      • The study ran across two consecutive weeks with immediate post-tests, so it did not measure outcomes at least one academic term after the intervention began.
      • "In this study, students were divided into two groups, each experiencing two lessons, each with distinct teaching methodologies, in consecutive weeks." (p. 2)
      • Relevant Quotes: 1) "In this study, students were divided into two groups, each experiencing two lessons, each with distinct teaching methodologies, in consecutive weeks." (p. 2) 2) "Following each lesson, students completed post-tests to measure content mastery and answered four questions aimed at gauging their learning experience, including engagement, enjoyment, motivation, and growth mindset." (p. 2) 3) "The study took place during one of the two meeting of the class during the ninth and tenth weeks of the course." (p. 7) Detailed Analysis: Criterion T requires that outcomes are measured at least one full academic term after the intervention begins (roughly 3-4 months). The study uses a crossover design spanning two consecutive weeks and measures learning via post-tests administered immediately after each lesson. This is far shorter than one academic term from start to outcome measurement. Final sentence explaining if criterion T is met/not met because outcomes were measured within two consecutive weeks, not after at least one academic term.
    • D

      Documented Control Group

      • The in-class active learning control condition and group characteristics are documented, including comparable demographics and baseline physics background knowledge.
      • "The demographics of the two groups were comparable (see table S2A), as were previous measures of their physics background knowledge (see Table S2B)." (p. 7)
      • Relevant Quotes: 1) "During, the first week, group 1 engaged with an AI-supported lesson at home while group 2 participated in an active learning lesson in class." (p. 2) 2) "The demographics of the two groups were comparable (see table S2A), as were previous measures of their physics background knowledge (see Table S2B)." (p. 7) 3) "To make sure that the study design did not impact the effectiveness of in-person instruction during the experiment, students in class learned from the same instructors, with the same student:staff ratio, and in the same peer-instruction groups as they had throughout the course." (p. 8) Detailed Analysis: Criterion D requires a clearly documented control condition and enough detail to compare groups. The paper specifies what the control condition is (in-class active learning), when it occurred in the crossover, and that the two randomized groups had comparable demographics and baseline physics background knowledge (with referenced tables). It also describes how the in-class control was kept consistent with normal course operation, including instructors, staffing ratio, and peer-instruction groups. Final sentence explaining if criterion D is met/not met because the control condition and baseline comparability are described in sufficient detail.
  • Level 2 Criteria

    • S

      School-level RCT

      • Randomization occurred within one university course, not at the school level.
      • "We conducted a randomized controlled experiment in a large undergraduate physics course (N=194) at Harvard University, with a student population broadly representative of those found across a range of institutions, to measure the difference between 1) how much students learn and 2) students’ perceptions of the learning experience when identical material is presented through an AI tutor compared with an active learning classroom." (p. 2)
      • Relevant Quotes: 1) "We conducted a randomized controlled experiment in a large undergraduate physics course (N=194) at Harvard University, with a student population broadly representative of those found across a range of institutions, to measure the difference between 1) how much students learn and 2) students’ perceptions of the learning experience when identical material is presented through an AI tutor compared with an active learning classroom." (p. 2) 2) "The present study took place during the Fall 2023 semester in Physical Sciences 2 (PS2), which is an introductory physics class for the life sciences and is Harvard’s largest physics class (N=233)." (p. 7) Detailed Analysis: Criterion S requires randomization at the school level (different schools or sites assigned to conditions). This study is a within-course experiment in a single undergraduate class at Harvard University, with students assigned to conditions within that course. There is no school-level randomization. Final sentence explaining if criterion S is met/not met because randomization occurred within a single course rather than across schools.
    • I

      Independent Conduct

      • The AI tutor was designed by the authors and the paper does not describe independent third-party conduct of the study.
      • "A subset of the best practices (i-iii) were incorporated into the AI pedagogy by careful engineering of the AI tutor’s system prompt." (p. 4)
      • Relevant Quotes: 1) "We have built an AI-based tutor, engineered with appropriate prompts and scaffolding, that helps students learn significantly more in less time and feel more engaged and motivated compared with in-class active learning." (p. 7) 2) "A subset of the best practices (i-iii) were incorporated into the AI pedagogy by careful engineering of the AI tutor’s system prompt." (p. 4) 3) "While the time commitment for preparation of a single AI-supported lesson was very manageable, there was significant overhead." (p. 8) 4) "The most significant time commitment involved in preparing the AI-supported lessons was the development of an AI tutor platform software that took pedagogical best practices into consideration (e.g., structured around individual questions embedded in individual assignments), which took several months." (p. 8) Detailed Analysis: Criterion I requires that the study be conducted independently from the designers of the intervention. The paper describes the authors designing and building the AI tutor (including engineering the system prompt and building the AI tutor platform software). The paper does not identify an external, independent evaluation team for implementation or analysis. While the tests were constructed by a separate team member, this does not make the study independent from the intervention designers. Final sentence explaining if criterion I is met/not met because the authors designed the intervention and no independent third-party conduct is described.
    • Y

      Year Duration

      • Outcomes were measured within a two-week crossover design, not after a full academic year.
      • "In this study, students were divided into two groups, each experiencing two lessons, each with distinct teaching methodologies, in consecutive weeks." (p. 2)
      • Relevant Quotes: 1) "In this study, students were divided into two groups, each experiencing two lessons, each with distinct teaching methodologies, in consecutive weeks." (p. 2) 2) "Following each lesson, students completed post-tests to measure content mastery and answered four questions aimed at gauging their learning experience, including engagement, enjoyment, motivation, and growth mindset." (p. 2) Detailed Analysis: Criterion Y requires outcome measurement at least one full academic year after the intervention begins. This study measures outcomes immediately after lessons in a two-week crossover design, so it does not satisfy a year-long tracking requirement. Final sentence explaining if criterion Y is met/not met because the study's tracking window is two weeks rather than a full academic year.
    • B

      Balanced Control Group

      • The AI tutor condition substituted for the in-class lesson and did not add unmatched time or resources for the intervention group.
      • "70% of students in the AI group spent less than 60 minutes on task, while 30% spent more than 60 minutes on task. The median time on task for students in the AI group was 49 minutes." (p. 2)
      • Relevant Quotes: 1) "During a 75-minute period, the in-class students spent 15 minutes taking the pre- and post-tests; we assume 60 minutes spent on learning." (p. 2) 2) "70% of students in the AI group spent less than 60 minutes on task, while 30% spent more than 60 minutes on task. The median time on task for students in the AI group was 49 minutes." (p. 2) 3) "The content and worksheet for the control and experimental conditions were identical." (p. 7) 4) "Given the crossover design, all students experienced both conditions once during the study." (p. 8) Detailed Analysis: Criterion B asks whether the intervention condition received additional time or resources that were not matched in the control condition, unless those additional resources are explicitly the treatment variable. Here, the AI lesson substitutes for the in-class lesson within a fixed course schedule. Time on task for the AI condition is comparable to, and typically lower than, the in-class lesson duration (median 49 minutes vs an assumed 60 minutes of learning in class). Both conditions use identical content and worksheets, and the crossover design ensures all students experience both conditions. These points support that there is no systematic extra time or budget given only to the intervention group that would confound the effect estimate. Final sentence explaining if criterion B is met/not met because the AI tutor condition does not add unmatched instructional time or resources relative to the control condition.
  • Level 3 Criteria

    • R

      Reproduced

      • No independent peer-reviewed replication of this specific study by another research team was found.
      • "Nonetheless, studies that explicitly replicate known in-class active learning results27,28 would be valuable for confirming and refining the details of this transferability." (p. 6)
      • Relevant Quotes: 1) "Nonetheless, studies that explicitly replicate known in-class active learning results27,28 would be valuable for confirming and refining the details of this transferability." (p. 6) Detailed Analysis: Criterion R requires an independent replication of this study by other authors in a peer-reviewed venue. The paper itself frames replication as a future direction rather than reporting an existing independent reproduction. Using web search across scholarly and publisher sources, I did not find a peer-reviewed paper by a different research team that explicitly reports a replication of this specific PS2 Pal crossover experiment and compares its results to the original. Final sentence explaining if criterion R is met/not met because no independent peer-reviewed replication of this specific study was found.
    • A

      All-subject Exams

      • Outcomes were limited to physics topics and the study did not use standardized exams across all main subjects (and criterion E is not met).
      • "To establish baseline knowledge, students from both groups completed a pre-test prior to each lesson, focusing on surface tension in the first week and fluid flow in the second." (p. 2)
      • Relevant Quotes: 1) "To establish baseline knowledge, students from both groups completed a pre-test prior to each lesson, focusing on surface tension in the first week and fluid flow in the second." (p. 2) 2) "Following each lesson, students completed post-tests to measure content mastery and answered four questions aimed at gauging their learning experience, including engagement, enjoyment, motivation, and growth mindset." (p. 2) Detailed Analysis: Criterion A requires standardized exam-based assessment across all main subjects. This study measures learning only on the two physics lesson topics (surface tension and fluid flow) using pre- and post-tests constructed for the study. Additionally, ERCT requires that if criterion E is not met (no standardized exams), criterion A is not met either. Therefore, A is not met. Final sentence explaining if criterion A is met/not met because outcomes are limited to physics topics and were not measured via standardized exams across all main subjects.
    • G

      Graduation Tracking

      • The study reports only immediate post-lesson outcomes and does not track students until graduation (and criterion Y is not met).
      • "Following each lesson, students completed post-tests to measure content mastery and answered four questions aimed at gauging their learning experience, including engagement, enjoyment, motivation, and growth mindset." (p. 2)
      • Relevant Quotes: 1) "Following each lesson, students completed post-tests to measure content mastery and answered four questions aimed at gauging their learning experience, including engagement, enjoyment, motivation, and growth mindset." (p. 2) Detailed Analysis: Criterion G requires tracking participants until graduation. Per the ERCT rules, if criterion Y (year duration) is not met, criterion G cannot be met. This study's outcomes are measured immediately after lessons within a two-week crossover design and do not include long-term follow-up. I also searched for follow-up publications by the same authors that track this cohort to graduation and did not find any such papers describing graduation outcomes. Final sentence explaining if criterion G is met/not met because the study does not track students to graduation and does not meet the prerequisite year duration requirement.
    • P

      Pre-Registered

      • The paper reports IRB approval but provides no evidence of a pre-registered protocol, and no matching public registry entry was found.
      • "The present study was approved by the Harvard University IRB (study no. IRB23-0797) and followed a crossover design." (p. 7)
      • Relevant Quotes: 1) "The present study was approved by the Harvard University IRB (study no. IRB23-0797) and followed a crossover design." (p. 7) Detailed Analysis: Criterion P requires a publicly accessible, time-stamped pre-registration of the study protocol before data collection begins. The paper reports IRB approval but does not provide a registry name, registration identifier, or link to a pre-registered protocol. I searched for a corresponding entry on major public registries commonly used for pre-registration (for example OSF and AsPredicted) using the title, authors, and study identifiers, and did not find a protocol that can be confidently matched to this study and dated prior to data collection. Final sentence explaining if criterion P is met/not met because no pre-registration record is cited in the paper and none was found in common public registries.

Request an Update or Contact Us

Are you the author of this study? Let us know if you have any questions or updates.

Have Questions
or Suggestions?

Get in Touch

Have a study you'd like to submit for ERCT evaluation? Found something that could be improved? If you're an author and need to update or correct information about your study, let us know.

  • Submit a Study for Evaluation

    Share your research with us for review

  • Suggest Improvements

    Provide feedback to help us make things better.

  • Update Your Study

    If you're the author, let us know about necessary updates or corrections.