AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms

LearnLM Team, Google & Eedi

Published:
ERCT Check Date:
DOI: 10.48550/arXiv.2512.23633
  • mathematics
  • K12
  • UK
  • blended learning
  • EdTech platform
  • digital assessment
  • formative assessment
0
  • C

    Randomization is at the student/session level, but the intervention is one-to-one tutoring, which satisfies the tutoring exception for criterion C.

    "During the trial period, we randomly assigned each student either to receive static pedagogical support (pre-written hints) or to enter an interactive one-to-one tutoring session (Figure 1; see also Appendix B)." (p. 2)

  • E

    Outcomes are measured using Eedi platform activity rather than widely recognized standardized exams.

    "We measured these outcomes using students’ standard, daily activities on the Eedi platform. This approach provided us with learning signals immediately, eliminating the need to develop and administer new trial-specific assessments, or to wait for the next round of standardized exams." (p. 6)

  • T

    Outcomes are measured within a seven-week study window, which is shorter than a full academic term.

    "Following the baseline period, we conducted the RCT over seven consecutive weeks, from May 13 to June 30, 2025 (the trial period)." (p. 14)

  • D

    The control condition (static hints) is clearly described and quantified, and the paper documents participant and school context.

    "First, we randomly assigned students to either the control condition (𝑁 = 91 students) or the tutoring condition (𝑁 = 74 students)." (p. 8)

  • S

    The study includes five schools, but randomization is at the student level rather than the school level.

    "During the trial period, we randomly assigned each student either to receive static pedagogical support (pre-written hints) or to enter an interactive one-to-one tutoring session (Figure 1; see also Appendix B)." (p. 2)

  • I

    The evaluation is conducted by the intervention-affiliated teams, and the paper does not document an independent third-party evaluation team.

    "This work represents a close collaboration between Google and Eedi." (p. 12)

  • Y

    The study lasts seven weeks, far less than 75% of an academic year, and Y also fails because T is not met.

    "Following the baseline period, we conducted the RCT over seven consecutive weeks, from May 13 to June 30, 2025 (the trial period)." (p. 14)

  • B

    The intervention conditions add substantial tutoring resources compared to static hints, but this resource increase is integral to the intervention contrasts being tested.

    "To support the tutoring condition, we scheduled a team of tutors to remain on-call in the Eedi platform during class hours on each day of the trial." (p. 8)

  • R

    No independent peer-reviewed replication of this specific RCT was found, and the paper itself only encourages future replication.

    "Teams seeking to replicate these findings or build similar experiences should now use Gemini 2.5 Pro." (p. 0)

  • A

    The study focuses on mathematics only and does not assess all core subjects using standardized exams; additionally, A cannot be met when E is not met.

    "Our study took place on the Eedi educational platform, an evidence- based learning ecosystem that provides students with both curriculum- aligned mathematics activities and one-to-one support from remote human tutors via online chat conversations." (p. 0)

  • G

    The study does not track students to graduation, and G also fails automatically because Y is not met.

    "Measuring substantive, longer-term effects on learning will require a different approach." (p. 6)

  • P

    The paper reports ethics review but provides no public pre- registration record, registry ID, or registration date.

Abstract

One-to-one tutoring is widely considered the gold standard for personalized education, yet it remains prohibitively expensive to scale. To evaluate whether generative AI might help expand access to this resource, we conducted an exploratory randomized controlled trial (RCT) with N = 165 students across five UK secondary schools. We integrated LearnLM—a generative AI model fine-tuned for pedagogy—into chat-based tutoring sessions on the Eedi mathematics platform. In the RCT, expert tutors directly supervised LearnLM, with the remit to revise each message it drafted until they would be satisfied sending it themselves. LearnLM proved to be a reliable source of pedagogical instruction, with supervising tutors approving 76.4% of its drafted messages making zero or minimal edits. This translated into effective tutoring support: students guided by LearnLM performed at least as well as students chatting with human tutors on each learning outcome we measured. In fact, students who received support from LearnLM were 5.5 percentage points more likely to solve novel problems on subsequent topics than those who received tutoring from human tutors alone.

Full Article

ERCT Criteria Breakdown

  • Level 1 Criteria

    • C

      Class-level RCT

      • Randomization is at the student/session level, but the intervention is one-to-one tutoring, which satisfies the tutoring exception for criterion C.
      • "During the trial period, we randomly assigned each student either to receive static pedagogical support (pre-written hints) or to enter an interactive one-to-one tutoring session (Figure 1; see also Appendix B)." (p. 2)
      • Relevant Quotes: 1) "During the trial period, we randomly assigned each student either to receive static pedagogical support (pre-written hints) or to enter an interactive one-to-one tutoring session (Figure 1; see also Appendix B)." (p. 2) 2) "Students in the tutoring condition experienced a further level of randomization: when a student entered a tutoring session, we randomly connected them either with an expert human tutor or with LearnLM (supervised by a human tutor)." (p. 2) 3) "The platform provides students with curriculum-aligned study units and a spectrum of personalized support, including two forms of assistance central to this RCT: carefully designed hints for common misconceptions in each study unit, and one-to-one guidance from trained, expert tutors via online chat interactions." (p. 2) Detailed Analysis: Criterion C requires a class-level (or stronger) unit of randomization to reduce contamination, unless the intervention is inherently personal teaching (e.g., one-to-one tutoring), in which case student-level randomization is acceptable. This study randomizes support at the student level, and (within the tutoring arm) randomizes at the tutoring-session level. The intervention is explicitly "interactive one-to-one tutoring" via online chat, which is personal teaching by design. Because the treatment is individualized, the ERCT tutoring exception applies and student/session randomization is appropriate for evaluating the intervention. Criterion C is met because the study is an RCT of one-to-one tutoring, which qualifies for the tutoring exception even though randomization is not at the class level.
    • E

      Exam-based Assessment

      • Outcomes are measured using Eedi platform activity rather than widely recognized standardized exams.
      • "We measured these outcomes using students’ standard, daily activities on the Eedi platform. This approach provided us with learning signals immediately, eliminating the need to develop and administer new trial-specific assessments, or to wait for the next round of standardized exams." (p. 6)
      • Relevant Quotes: 1) "We measured these outcomes using students’ standard, daily activities on the Eedi platform. This approach provided us with learning signals immediately, eliminating the need to develop and administer new trial-specific assessments, or to wait for the next round of standardized exams." (p. 6) 2) "In this RCT, we focused on student performance on its short study units, each designed to assess a specific mathematics topic and consisting of diagnostic multiple-choice questions with four response options (Figure 2)." (p. 7) Detailed Analysis: Criterion E requires outcome measurement using standardized, widely recognized exam-based assessments (not custom measures or platform-internal performance signals). The paper explicitly states that outcomes were measured using "standard, daily activities" on the Eedi platform and emphasizes that this eliminates the need to wait for standardized exams. The described measures are based on short platform study units and diagnostic multiple-choice questions, which are not presented as national/state standardized examinations or an established, externally validated standardized test instrument. Criterion E is not met because the primary outcomes are derived from platform-embedded activities rather than standardized exams.
    • T

      Term Duration

      • Outcomes are measured within a seven-week study window, which is shorter than a full academic term.
      • "Following the baseline period, we conducted the RCT over seven consecutive weeks, from May 13 to June 30, 2025 (the trial period)." (p. 14)
      • Relevant Quotes: 1) "Following the baseline period, we conducted the RCT over seven consecutive weeks, from May 13 to June 30, 2025 (the trial period)." (p. 14) 2) "Procedure We conducted the exploratory RCT over seven consecutive weeks (May through June 2025)." (p. 8) 3) "Immediately following this intervention, the platform prompts the student to retry the question that they originally missed." (p. 7) Detailed Analysis: Criterion T requires that outcomes be measured at least one full academic term (roughly 3–4 months) after the intervention begins, i.e., the start-to-measurement window must be at least a term even if the intervention itself is short. The paper states the RCT ran for seven consecutive weeks from May 13 to June 30, 2025, which is substantially shorter than a term. The key outcomes are also operationalized as immediate and near- immediate platform behaviors (including retrying a question immediately after an intervention), which are far shorter than a term-long follow-up. Criterion T is not met because the study’s intervention-to-outcome window is seven weeks, not at least one academic term.
    • D

      Documented Control Group

      • The control condition (static hints) is clearly described and quantified, and the paper documents participant and school context.
      • "First, we randomly assigned students to either the control condition (𝑁 = 91 students) or the tutoring condition (𝑁 = 74 students)." (p. 8)
      • Relevant Quotes: 1) "First, we randomly assigned students to either the control condition (𝑁 = 91 students) or the tutoring condition (𝑁 = 74 students)." (p. 8) 2) "Whenever a student in the control condition answered a question incorrectly, they received a pre-written message designed to prompt reflection on a specific misconception, based on which incorrect option they selected (a “static hint”). The platform then prompted them to retry the question." (p. 8) 3) "The trial included 𝑁 = 165 students in Year 9 and 10 (ages 13–15) from five UK secondary schools." (p. 13) 4) "The schools varied broadly in academic performance and socio- economic background." (p. 13) Detailed Analysis: Criterion D requires that the control group be well documented, including a clear description of what the control group received and enough context to interpret comparisons. The paper explicitly defines the control condition as receiving a pre-written "static hint" after an incorrect answer, followed by a prompt to retry the question. It also reports the number of students in the control arm (N = 91). Additionally, the paper documents the participant cohort and school context, supporting interpretation of the control condition’s setting. Criterion D is met because the control condition is clearly described, quantified, and contextualized in the paper.
  • Level 2 Criteria

    • S

      School-level RCT

      • The study includes five schools, but randomization is at the student level rather than the school level.
      • "During the trial period, we randomly assigned each student either to receive static pedagogical support (pre-written hints) or to enter an interactive one-to-one tutoring session (Figure 1; see also Appendix B)." (p. 2)
      • Relevant Quotes: 1) "We recruited 𝑁 = 165 students in Year 9 and 10 (ages 13–15) across five of these schools for the RCT (see Appendix A)." (p. 2) 2) "During the trial period, we randomly assigned each student either to receive static pedagogical support (pre-written hints) or to enter an interactive one-to-one tutoring session (Figure 1; see also Appendix B)." (p. 2) Detailed Analysis: Criterion S requires school-level randomization, meaning schools (as implementation sites) are randomized to treatment versus control. While the study recruits students across five UK secondary schools, the paper describes random assignment at the student level, not at the school level. There is no statement that entire schools were assigned to different conditions. Criterion S is not met because randomization is not conducted at the school level.
    • I

      Independent Conduct

      • The evaluation is conducted by the intervention-affiliated teams, and the paper does not document an independent third-party evaluation team.
      • "This work represents a close collaboration between Google and Eedi." (p. 12)
      • Relevant Quotes: 1) "This work represents a close collaboration between Google and Eedi." (p. 12) 2) "We connected the Eedi platform to LearnLM via a custom API created specifically for this trial." (p. 7) Detailed Analysis: Criterion I requires that study conduct and evaluation be independent from the intervention designers/providers to reduce bias in implementation, measurement, and analysis. The paper explicitly states the work is a close collaboration between Google and Eedi, and it describes bespoke technical integration created for the trial. It does not include a clear statement that an external, independent evaluation team led the trial’s conduct and analysis, nor does it describe independent governance beyond ethics review. Criterion I is not met because independence from the intervention teams is not documented.
    • Y

      Year Duration

      • The study lasts seven weeks, far less than 75% of an academic year, and Y also fails because T is not met.
      • "Following the baseline period, we conducted the RCT over seven consecutive weeks, from May 13 to June 30, 2025 (the trial period)." (p. 14)
      • Relevant Quotes: 1) "Following the baseline period, we conducted the RCT over seven consecutive weeks, from May 13 to June 30, 2025 (the trial period)." (p. 14) 2) "Procedure We conducted the exploratory RCT over seven consecutive weeks (May through June 2025)." (p. 8) Detailed Analysis: Criterion Y requires outcome tracking for at least 75% of an academic year after the intervention begins. The trial duration is explicitly seven consecutive weeks, which is far below an academic year and below the 75% threshold. In addition, ERCT rules specify that if criterion T is not met, then criterion Y is not met. Since the study does not meet term duration, Y necessarily fails. Criterion Y is not met because the study duration is seven weeks and does not reach year-long tracking (and T is not met).
    • B

      Balanced Control Group

      • The intervention conditions add substantial tutoring resources compared to static hints, but this resource increase is integral to the intervention contrasts being tested.
      • "To support the tutoring condition, we scheduled a team of tutors to remain on-call in the Eedi platform during class hours on each day of the trial." (p. 8)
      • Relevant Quotes: 1) "The trial leveraged these two forms of Eedi support—hints and chat-based tutoring (“hybrid tutoring” [20])—as baselines to assess the pedagogical efficacy of LearnLM (see Figure B.1 in Appendix B)." (p. 2) 2) "To support the tutoring condition, we scheduled a team of tutors to remain on-call in the Eedi platform during class hours on each day of the trial." (p. 8) 3) "Whenever a student in the control condition answered a question incorrectly, they received a pre-written message designed to prompt reflection on a specific misconception, based on which incorrect option they selected (a “static hint”). The platform then prompted them to retry the question." (p. 8) Detailed Analysis: Criterion B evaluates whether time/budget/resources are balanced between intervention and control, unless differences in resources are explicitly integral to what is being tested (i.e., the added resources are the treatment variable or core intervention package). Here, the tutoring condition clearly introduces additional human resources (a team of tutors on-call for interactive sessions), whereas the control condition receives only static pre-written hints. This is an intentional resource contrast: the study explicitly leverages "hints and chat-based tutoring" as baselines and compares interactive tutoring against static hints, and then compares human tutoring against supervised LearnLM tutoring. Because the additional tutoring time and expertise are integral to the defined interventions being tested (not an incidental, confounding add-on), the control group can remain business-as-usual for those resources. Criterion B is met because the resource differences are integral to the study’s intended treatment contrasts (static hints vs tutoring, and human tutoring vs supervised LearnLM tutoring).
  • Level 3 Criteria

    • R

      Reproduced

      • No independent peer-reviewed replication of this specific RCT was found, and the paper itself only encourages future replication.
      • "Teams seeking to replicate these findings or build similar experiences should now use Gemini 2.5 Pro." (p. 0)
      • Relevant Quotes: 1) "Teams seeking to replicate these findings or build similar experiences should now use Gemini 2.5 Pro." (p. 0) Detailed Analysis: Criterion R requires that the study be independently reproduced by a different research team and published in a peer-reviewed outlet. The paper includes an explicit invitation for others to replicate the findings, but it does not report a completed independent replication. An internet search conducted on 2026-04-12 did not identify a peer-reviewed independent replication study that explicitly reproduces this specific Eedi/LearnLM classroom RCT. Criterion R is not met because independent replication evidence for this specific RCT was not found.
    • A

      All-subject Exams

      • The study focuses on mathematics only and does not assess all core subjects using standardized exams; additionally, A cannot be met when E is not met.
      • "Our study took place on the Eedi educational platform, an evidence- based learning ecosystem that provides students with both curriculum- aligned mathematics activities and one-to-one support from remote human tutors via online chat conversations." (p. 0)
      • Relevant Quotes: 1) "Our study took place on the Eedi educational platform, an evidence-based learning ecosystem that provides students with both curriculum-aligned mathematics activities and one-to-one support from remote human tutors via online chat conversations." (p. 0) 2) "We measured these outcomes using students’ standard, daily activities on the Eedi platform. This approach provided us with learning signals immediately, eliminating the need to develop and administer new trial-specific assessments, or to wait for the next round of standardized exams." (p. 6) Detailed Analysis: Criterion A requires standardized exam-based assessments covering all main subjects (or a justified exception for specialized contexts). This study is explicitly situated in a mathematics platform context and reports outcomes tied to mathematics learning on Eedi, not across subjects. Additionally, ERCT rules specify that if criterion E is not met, criterion A is not met. Since the study explicitly avoids relying on standardized exams, A necessarily fails. Criterion A is not met because the study is mathematics-only and does not use standardized exams across core subjects (and E is not met).
    • G

      Graduation Tracking

      • The study does not track students to graduation, and G also fails automatically because Y is not met.
      • "Measuring substantive, longer-term effects on learning will require a different approach." (p. 6)
      • Relevant Quotes: 1) "Measuring substantive, longer-term effects on learning will require a different approach." (p. 6) 2) "Future research can overcome these limitations by assigning students to receive one consistent type of support for an entire study, ideally following their progress over several months and tracking their performance on external, standardized assessments." (p. 6) Detailed Analysis: Criterion G requires tracking the participant cohort through to graduation (for the relevant educational stage). The paper describes a short RCT and explicitly frames longer-term follow-up as future work (tracking over "several months" and via external standardized assessments), with no mention of tracking to graduation outcomes. An internet search conducted on 2026-04-12 did not identify follow-up publications by the same author group that track this cohort through graduation. Additionally, ERCT rules specify that if criterion Y is not met, criterion G is not met. Since Y is not met, G necessarily fails. Criterion G is not met because the study does not include (and no follow-up evidence was found for) graduation tracking, and Y is not met.
    • P

      Pre-Registered

      • The paper reports ethics review but provides no public pre- registration record, registry ID, or registration date.
      • Relevant Quotes: 1) "Our protocol underwent independent ethical review, with a favourable opinion from the Human Behavioural Research Ethics Committee at Google DeepMind (#25 003)." (p. 7) Detailed Analysis: Criterion P requires a publicly accessible pre-registration record (registry/platform, identifier, and timing before data collection). The paper documents independent ethics review, but it does not provide a pre-registration link, registry name, registry ID, or registration date. An internet search conducted on 2026-04-12 did not identify a public pre-registration entry (e.g., OSF, AEA RCT Registry, ISRCTN) that is explicitly linked to this RCT and dated prior to the start of the trial. Criterion P is not met because no public pre-registration evidence (registry/ID/date) is documented or was found.

Request an Update or Contact Us

Are you the author of this study? Let us know if you have any questions or updates.

Have Questions
or Suggestions?

Get in Touch

Have a study you'd like to submit for ERCT evaluation? Found something that could be improved? If you're an author and need to update or correct information about your study, let us know.

  • Submit a Study for Evaluation

    Share your research with us for review

  • Suggest Improvements

    Provide feedback to help us make things better.

  • Update Your Study

    If you're the author, let us know about necessary updates or corrections.