AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting

Greg Kestin, Kelly Miller, Anna Klales, Timothy Milbourne, Gregorio Ponti

Published:
ERCT Check Date:
DOI: 10.1038/s41598-025-97652-6
  • science
  • higher education
  • US
  • EdTech platform
0
  • C

    The study uses a randomized cross-over design at the peer-group level (student level), which is acceptable under the ERCT exception for personal tutoring interventions.

    "Students were randomly assigned to two groups, respecting the constraint that students who regularly worked together in class during peer instruction were placed in the same group..." (p. 7)

  • E

    Outcomes were measured using custom pre- and post-tests designed for the specific lessons, not standardized exam-based assessments.

    "Following each lesson, students completed post-tests to measure content mastery..." (p. 2)

  • T

    Outcomes were measured immediately following two single-lesson interventions, falling far short of the one-term duration requirement.

    "The study took place during one of the two meeting of the class during the ninth and tenth weeks of the course." (p. 7)

  • D

    The control group (in-class active learning) is well-documented, including pedagogy, student demographics, and baseline knowledge.

    "All in-class lessons employed research-based best practices for in-class active learning." (p. 7)

  • S

    The study was conducted within a single university course, not randomized across multiple schools or institutions.

    "The present study took place during the Fall 2023 semester in Physical Sciences 2 (PS2)... is Harvard's largest physics class" (p. 7)

  • I

    The study was designed, conducted, and analyzed by the authors, including the course instructors, without independent third-party conduct.

    "Author contributions... Methodology: G.K., K.M., A.K., G.P. ... Validation: G.K., K.M. Formal Analysis: G.K., K.M." (p. 9)

  • Y

    Outcomes were measured over a two-week period, not tracked for a full academic year.

    "The study took place during one of the two meeting of the class during the ninth and tenth weeks of the course." (p. 7)

  • B

    The intervention replaced the control activity without adding extra time; in fact, the intervention group spent less time on task than the control group.

    "The median time on task for students in the Al group was 49 minutes." (p. 2)

  • R

    No independent peer-reviewed replication of this specific AI tutoring intervention was found.

  • A

    The study only assessed physics content knowledge, not all main subjects.

    "focusing on surface tension in the first week and fluid flow in the second." (p. 2)

  • G

    The study tracks learning only for the duration of the lessons and does not follow students to graduation.

  • P

    The study mentions IRB approval but does not provide evidence of a pre-registered protocol on a public registry.

    "The present study was approved by the Harvard University IRB (study no. IRB23-0797)..." (p. 7)

Abstract

This study reports a randomized, controlled trial measuring college students' learning and their perceptions when content is presented through an AI-powered tutor compared with an active learning class. We find that students learn significantly more in less time when using the AI tutor, compared with the in-class active learning. They also feel more engaged and more motivated. These findings offer empirical evidence for the efficacy of a widely accessible AI-powered pedagogy in significantly enhancing learning outcomes.

Full Article

ERCT Criteria Breakdown

  • Level 1 Criteria

    • C

      Class-level RCT

      • The study uses a randomized cross-over design at the peer-group level (student level), which is acceptable under the ERCT exception for personal tutoring interventions.
      • "Students were randomly assigned to two groups, respecting the constraint that students who regularly worked together in class during peer instruction were placed in the same group..." (p. 7)
      • Relevant Quotes: 1) "Students were randomly assigned to two groups, respecting the constraint that students who regularly worked together in class during peer instruction were placed in the same group in order to maximize the effectiveness of their in-class learning." (p. 7) 2) "The structure of the experimental condition differed from the control condition in that all interactions and feedback were with an Al tutor, rather than with peer-instruction followed by instructor feedback." (p. 8) 3) "Working with an expert personal tutor is generally regarded as the most efficient form of education... What if an AI tutor could mimic the learning experience one would get from an expert (human) tutor?" (p. 2) Detailed Analysis: The study randomizes students (grouped by their small peer instruction clusters) to either the AI condition or the in-class condition. While this is technically not a "Class-level" randomization (as it occurs within a single course), the ERCT standard provides an exception for interventions designed for personal teaching or tutoring. The paper explicitly frames the intervention as "AI tutoring" intended to mimic "one-on-one tutoring." Therefore, the student-level (or small group-level) randomization is acceptable under the exception. Final sentence: The criterion is met because the intervention is a personal tutoring tool, allowing for the student-level randomization exception.
    • E

      Exam-based Assessment

      • Outcomes were measured using custom pre- and post-tests designed for the specific lessons, not standardized exam-based assessments.
      • "Following each lesson, students completed post-tests to measure content mastery..." (p. 2)
      • Relevant Quotes: 1) "To establish baseline knowledge, students from both groups completed a pre-test prior to each lesson... Following each lesson, students completed post-tests to measure content mastery" (p. 2) 2) "To prevent the specific test questions from influencing the teaching or Al tutor design, the tests were constructed by a separate team member... tests were written based on the learning goals for the lesson" (p. 8) Detailed Analysis: The study uses custom-created quizzes (pre- and post-tests) that are specific to the two lessons (surface tension and fluid flow). While the authors used the Force Concept Inventory (FCI) for baseline characterization, the FCI was not the outcome measure for the intervention. The standard requires widely recognized, standardized exams to measure the educational outcome. Custom lesson-aligned tests do not meet this requirement. Final sentence: The criterion is not met because the study relies on custom-designed post-tests rather than standardized exams to measure learning outcomes.
    • T

      Term Duration

      • Outcomes were measured immediately following two single-lesson interventions, falling far short of the one-term duration requirement.
      • "The study took place during one of the two meeting of the class during the ninth and tenth weeks of the course." (p. 7)
      • Relevant Quotes: 1) "The study took place during one of the two meeting of the class during the ninth and tenth weeks of the course." (p. 7) 2) "Following each lesson, students completed post-tests..." (p. 2) Detailed Analysis: The intervention consisted of two specific lessons occurring in consecutive weeks. The measurement (post-test) occurred immediately after each lesson. The ERCT standard requires that outcomes be measured at least one full academic term after the intervention begins. Here, the measurement was immediate, and the total duration of the study interaction was only two weeks. Final sentence: The criterion is not met because the interval between the start of the intervention and the measurement of outcomes was less than one academic term.
    • D

      Documented Control Group

      • The control group (in-class active learning) is well-documented, including pedagogy, student demographics, and baseline knowledge.
      • "All in-class lessons employed research-based best practices for in-class active learning." (p. 7)
      • Relevant Quotes: 1) "All in-class lessons employed research-based best practices for in-class active learning... First the instructor introduces an activity, then students work through the activity in self-selected groups..." (p. 7) 2) "The demographics of the two groups were comparable (see table S2A), as were previous measures of their physics background knowledge" (p. 7) Detailed Analysis: The paper provides a detailed description of the control condition, which is the standard "in-class active learning" format for the course. It describes the structure (intro, group work, feedback), the qualifications of the instructors, and the baseline characteristics of the students in that group (via FCI and CLASS scores). This satisfies the requirement for a documented control group. Final sentence: The criterion is met as the paper clearly documents the control group's composition, baseline characteristics, and instructional conditions.
  • Level 2 Criteria

    • S

      School-level RCT

      • The study was conducted within a single university course, not randomized across multiple schools or institutions.
      • "The present study took place during the Fall 2023 semester in Physical Sciences 2 (PS2)... is Harvard's largest physics class" (p. 7)
      • Relevant Quotes: 1) "The present study took place during the Fall 2023 semester in Physical Sciences 2 (PS2)... at Harvard University" (p. 7) 2) "Students were randomly assigned to two groups... within the course" (implied, p. 7) Detailed Analysis: The study was localized to one specific course at one university. To meet the School-level RCT criterion, randomization must occur among different schools or distinct educational sites. Randomizing peer groups within a single lecture course does not meet this threshold. Final sentence: The criterion is not met because randomization occurred within a single course rather than across multiple schools or sites.
    • I

      Independent Conduct

      • The study was designed, conducted, and analyzed by the authors, including the course instructors, without independent third-party conduct.
      • "Author contributions... Methodology: G.K., K.M., A.K., G.P. ... Validation: G.K., K.M. Formal Analysis: G.K., K.M." (p. 9)
      • Relevant Quotes: 1) "Author contributions: Conceptualization: G.K., K.M., A.K., T.W.M. Methodology: G.K., K.M., A.K., G.P. Software Conceptualization and Design: G.K." (p. 9) 2) "Videos were produced... and the instructor (GK) has a decade of experience..." (p. 6) Detailed Analysis: The primary authors (G.K. and K.M.) were responsible for the software design, the methodology, the validation, and the formal analysis. G.K. is also identified as an instructor. There is no evidence of an external evaluation agency or independent researchers conducting the data collection or analysis. Final sentence: The criterion is not met because the intervention designers and course instructors were responsible for conducting and analyzing the study.
    • Y

      Year Duration

      • Outcomes were measured over a two-week period, not tracked for a full academic year.
      • "The study took place during one of the two meeting of the class during the ninth and tenth weeks of the course." (p. 7)
      • Relevant Quotes: 1) "The study took place during one of the two meeting of the class during the ninth and tenth weeks of the course." (p. 7) 2) "We have found that when students interact with our Al tutor... they learn significantly more... than when they engage with the same content during an in-class active learning lesson" (p. 4) Detailed Analysis: The Year Duration criterion requires outcomes to be measured at least one full academic year after the intervention starts. This study focuses on immediate learning gains from two specific lessons. There is no longitudinal tracking mentioned that spans an academic year. Final sentence: The criterion is not met because the study duration and follow-up did not span a full academic year.
    • B

      Balanced Resources

      • The intervention replaced the control activity without adding extra time; in fact, the intervention group spent less time on task than the control group.
      • "The median time on task for students in the Al group was 49 minutes." (p. 2)
      • Relevant Quotes: 1) "During a 75-minute period, the in-class students spent 15 minutes taking the pre- and post-tests; we assume 60 minutes spent on learning." (p. 2) 2) "The median time on task for students in the Al group was 49 minutes." (p. 2) 3) "70% of students in the Al group spent less than 60 minutes on task" (p. 2) Detailed Analysis: The study compares two modes of instruction: AI tutoring vs. In-class active learning. The control group utilized human resources (instructors/TAs) and spent approximately 60 minutes on learning. The intervention group utilized software (AI) and spent a median of 49 minutes. The intervention did not add extra time or budget that was unbalanced; if anything, the intervention was more time-efficient. The "resources" (AI vs Humans) defines the treatment variable itself. There is no evidence that the treatment group received "extra" help (like double dosing) that the control didn't. Final sentence: The criterion is met because the intervention did not require additional time or non-integral resources compared to the control group; outcome differences cannot be attributed to extra time on task.
  • Level 3 Criteria

    • R

      Reproduced Results

      • No independent peer-reviewed replication of this specific AI tutoring intervention was found.
      • Relevant Quotes: (No quotes in the paper describe an independent replication). Detailed Analysis: A search for "Kestin AI tutor Harvard replication" and related terms yielded no independent studies published in peer-reviewed journals that replicate this specific experiment. As the paper is published in 2025 and describes a novel design, it appears to be the primary source. The ERCT standard requires an independent replication to meet this criterion. Final sentence: The criterion is not met because no independent replication of the study has been published.
    • A

      All Exams

      • The study only assessed physics content knowledge, not all main subjects.
      • "focusing on surface tension in the first week and fluid flow in the second." (p. 2)
      • Relevant Quotes: 1) "focusing on surface tension in the first week and fluid flow in the second." (p. 2) 2) "post-tests to measure content mastery" (p. 2) Detailed Analysis: The study focuses exclusively on Physics (specifically two topics within it). It does not assess Math, English, or other main subjects. Furthermore, since criterion E (Exam-based Assessment) was not met, criterion A is automatically not met as per the standard instructions. Final sentence: The criterion is not met because assessment was limited to specific physics topics and did not cover all main subjects using standardized exams.
    • G

      Graduation Tracking

      • The study tracks learning only for the duration of the lessons and does not follow students to graduation.
      • Relevant Quotes: (No quotes in the text mention tracking to graduation). Detailed Analysis: The paper reports on immediate post-test scores and student perceptions collected during the semester. There is no mention of tracking students until they graduate from the university. Given the recent publication date (2025), long-term graduation tracking is unlikely to have occurred or been reported yet. Final sentence: The criterion is not met because there is no evidence of tracking participants until graduation.
    • P

      Pre-Registered Protocol

      • The study mentions IRB approval but does not provide evidence of a pre-registered protocol on a public registry.
      • "The present study was approved by the Harvard University IRB (study no. IRB23-0797)..." (p. 7)
      • Relevant Quotes: 1) "The present study was approved by the Harvard University IRB (study no. IRB23-0797) and followed a crossover design." (p. 7) Detailed Analysis: While the study cites IRB approval, this is not the same as pre-registration of the analysis plan and hypotheses. A search of standard registries (OSF, ClinicalTrials.gov) for the specific authors and study description did not yield a pre-registered protocol stamped prior to data collection. The paper does not contain a statement linking to a pre-registration. Final sentence: The criterion is not met because there is no documented pre-registered protocol.

Request an Update or Contact Us

Are you the author of this study? Let us know if you have any questions or updates.

Have Questions
or Suggestions?

Get in Touch

Have a study you'd like to submit for ERCT evaluation? Found something that could be improved? If you're an author and need to update or correct information about your study, let us know.

  • Submit a Study for Evaluation

    Share your research with us for review

  • Suggest Improvements

    Provide feedback to help us make things better.

  • Update Your Study

    If you're the author, let us know about necessary updates or corrections.