Abstract
High-dosage tutoring has the potential to substantially raise adolescent academic achievement. However, at scale, schools may not have the financial ability to deliver small-group tutoring frequently. In this paper, I test the relative importance of group size (quality) versus tutoring frequency (quantity). I evaluate the impact of an in-school math tutoring program in a middle school in the Midwestern United States. Students are randomized to either 1) control, 2) receive tutoring twice a week in 2-student groups, or 3) receive tutoring three times a week in 3-student groups. Importantly, the total cost per student is the same in both treatment conditions. I find that the 2-student group tutoring led to a significant improvement in math skills (0.23 SD), whereas the equal-cost, more frequent tutoring in the 3-student groups did not lead to a significant improvement in math skills.
Full
Article
ERCT Criteria Breakdown
-
Level 1 Criteria
-
C
Class-level RCT
- The study randomizes within classrooms, but it evaluates tutoring (small-group personal instruction), which meets the tutoring exception under Criterion C.
- "Stratified by classroom, students within each classroom were randomly assigned to one of three treatment conditions: 1) Control, 2) Two-student group tutoring twice per week, and 3) Three-student group tutoring thrice per week." (p. 4)
Relevant Quotes:
1) "A total of 12 classrooms – four in each of 6th, 7th, and 8th grade – participated in the study." (p. 4)
2) "Stratified by classroom, students within each classroom were randomly assigned to one of three treatment conditions: 1) Control, 2) Two-student group tutoring twice per week, and 3) Three-student group tutoring thrice per week." (p. 4)
3) "Students were pulled from an elective class during the school day to receive the math tutoring." (p. 5)
Detailed Analysis:
Criterion C requires randomization at the class level (or stronger) to limit within-class contamination, but the ERCT standard explicitly allows an exception when the intervention is "personal teaching like tutoring."
The paper clearly reports that students were randomized within each classroom (rather than randomizing entire classes). For a typical classroom-wide instructional intervention, this would raise contamination concerns and would not satisfy Criterion C.
However, this study evaluates tutoring delivered in very small groups (2-student or 3-student groups), i.e., a form of personal/small-group instruction. Under the ERCT tutoring exception, student-level randomization is acceptable.
Final: Criterion C is met because the intervention is tutoring and therefore qualifies for the tutoring exception despite within-class randomization.
-
E
Exam-based Assessment
- The study measures outcomes using the NWEA MAP Math assessment, which the paper describes as a widely used standardized test.
- "The outcome variable of this study is the MAP Math assessment, which is created by the NWEA." (p. 6)
Relevant Quotes:
1) "The outcome variable of this study is the MAP Math assessment, which is created by the NWEA." (p. 6)
2) "NWEA assessments are used by over 50,000 schools and districts in 149 countries." (p. 6)
3) "The school tests the students 3 times per academic year (September, January, and May)." (p. 6)
Detailed Analysis:
Criterion E requires a standardized exam-based assessment rather than a researcher-created test aligned to the intervention.
The paper identifies the MAP Math assessment (by NWEA) as the primary outcome and describes it as broadly used across many schools and countries. This supports that the assessment is standardized and externally developed rather than bespoke to this study.
Final: Criterion E is met because the primary outcome is a standardized exam-based assessment (NWEA MAP Math).
-
T
Term Duration
- Outcomes are measured from January 2024 (baseline) to May 2024 (endline), which spans roughly one academic term.
- "The experimental evaluation of this tutoring took place between January 2024 and May 2024." (p. 4)
Relevant Quotes:
1) "The experimental evaluation of this tutoring took place between January 2024 and May 2024." (p. 4)
2) "The treatment took took place from January to May of 2024." (p. 5)
3) "For this study, I use the January 2024 test results as the baseline score, and the May 2024 test results as the endline variable." (p. 6)
Detailed Analysis:
Criterion T requires that outcomes be measured at least one academic term after the intervention begins (commonly about 3–4 months), either because the intervention lasts that long or because follow-up extends that long.
The paper states the evaluation ran between January 2024 and May 2024 and that baseline MAP scores are from January 2024 while endline MAP scores are from May 2024. This span is approximately a semester-length period and satisfies the minimum term-duration requirement.
Final: Criterion T is met because baseline-to-endline timing spans January 2024 to May 2024, which is about one term.
-
D
Documented Control Group
- The control group is clearly defined and its baseline characteristics and sample size are documented in Table 1.
- "Table 1 shows the descriptive statistics, showcasing that there is no significant difference overall between the control and treatment groups across any of the relevant variables." (p. 6)
Relevant Quotes:
1) "Students are randomized to either 1) control, 2) receive tutoring twice a week in 2-student groups, or 3) receive tutoring three times a week in 3-student groups." (p. 1)
2) "Table 1: Descriptive Statistics and Balance Tests" (p. 7)
3) "Control (N=194) (N=62) (N=87)" (p. 7)
4) "The available administrative data for each student included the student’s gender, race, age, and baseline performance (measured in January 2024) on the MAP math assessment." (p. 6)
5) "Table 1 shows the descriptive statistics, showcasing that there is no significant difference overall between the control and treatment groups across any of the relevant variables." (p. 6)
Detailed Analysis:
Criterion D requires that the control group be well documented, including who is in it, its size, and baseline characteristics sufficient to judge comparability.
The paper clearly defines a control condition in the randomization. It then reports a balance/descriptive table (Table 1) that includes a control group sample size (N=194) and baseline demographics and baseline MAP Math performance, enabling assessment of baseline comparability.
Final: Criterion D is met because the control condition is explicitly defined and documented with baseline characteristics and sample size in Table 1.
-
Level 2 Criteria
-
S
School-level RCT
- The study is conducted in one school and does not randomize treatment assignment at the school level.
- "This paper reports on an RCT done with one KIPP school in Indiana." (p. 4)
Relevant Quotes:
1) "This paper reports on an RCT done with one KIPP school in Indiana." (p. 4)
2) "Stratified by classroom, students within each classroom were randomly assigned to one of three treatment conditions: 1) Control, 2) Two-student group tutoring twice per week, and 3) Three-student group tutoring thrice per week." (p. 4)
Detailed Analysis:
Criterion S requires school-level randomization (i.e., entire schools/sites randomized to conditions).
The paper describes an RCT conducted within a single KIPP school in Indiana, and randomization occurs within classrooms (students within each classroom), not between multiple schools.
Final: Criterion S is not met because the study does not randomize at the school level.
-
I
Independent Conduct
- The paper does not explicitly state that the evaluation was conducted by an independent third-party team separate from the author and implementation partners.
Relevant Quotes:
1) "A total of 7 tutors were hired through a local tutoring company." (p. 5)
2) "The tutors were college students, and the school provided training before the tutoring began, in addition to the training the tutors received from the tutoring company." (p. 5)
3) "This study was funded by Accelerate, with match-making support from J-PAL." (p. 1)
Detailed Analysis:
Criterion I requires clear, quoted evidence that the study’s conduct (especially data collection and analysis) was independent from the intervention designers/providers.
The paper provides information about implementation partners (the school, a tutoring company providing tutors, and funders), but it does not contain a clear statement that outcome data collection and analysis were performed by an independent evaluator, nor does it explicitly describe a separation between intervention provision and evaluation activities.
Final: Criterion I is not met because independent third-party conduct of the evaluation is not explicitly documented.
-
Y
Year Duration
- The evaluation runs from January 2024 to May 2024, which is substantially less than 75% of an academic year.
- "The experimental evaluation of this tutoring took place between January 2024 and May 2024." (p. 4)
Relevant Quotes:
1) "The experimental evaluation of this tutoring took place between January 2024 and May 2024." (p. 4)
2) "The treatment took took place from January to May of 2024." (p. 5)
3) "For this study, I use the January 2024 test results as the baseline score, and the May 2024 test results as the endline variable." (p. 6)
Detailed Analysis:
Criterion Y requires outcomes to be measured at least 75% of one full academic year after the intervention begins.
The quoted timeline is January 2024 to May 2024 (with January baseline and May endline), which is approximately one semester and does not meet the ERCT year-duration threshold.
Final: Criterion Y is not met because the study’s measurement window is roughly one semester rather than most of a school year.
-
B
Balanced Control Group
- The intervention provides additional tutoring time and tutor labor relative to business-as-usual, but these added resources are the treatment being tested, so an unbalanced business-as- usual control is acceptable under Criterion B’s intent check.
- "Students were pulled from an elective class during the school day to receive the math tutoring." (p. 5)
Relevant Quotes:
1) "Students were pulled from an elective class during the school day to receive the math tutoring." (p. 5)
2) "Tutors received $40 for each session, regardless of the number of students in that session." (p. 5)
3) "This makes the total cost per student $40 per week, regardless of whether they received tutoring twice per week in a 2-student group, or thrice per week in a 3-student group." (p. 5)
Detailed Analysis:
Criterion B compares the nature, quantity, and quality of resources (time, budget, staffing, materials) across treatment and control conditions, and asks whether the control offers a comparable substitute for the intervention’s inputs unless the extra resources are explicitly what is being tested.
This study’s treatment conditions add tutoring delivered during the school day (students are pulled from electives) and add paid tutor time (a costed resource). The control condition is business-as-usual and does not receive a comparable substitute tutoring resource.
However, the paper explicitly frames the intervention as tutoring and tests the trade-off between tutoring quality (smaller groups) and quantity (more frequent sessions) while equalizing cost per student across the two treatment arms. In other words, the additional tutoring time/labor is integral to the treatment definition rather than a separable confound that should be balanced by design.
Final: Criterion B is met because the extra time/labor is the core treatment being evaluated against business-as-usual (not a non-integral add-on requiring a matched control).
-
Level 3 Criteria
-
R
Reproduced
- No independent, peer-reviewed replication of this specific experiment was found during the ERCT check.
Relevant Quotes:
(No relevant quotes in the paper report an independent replication of this specific experiment.)
Detailed Analysis:
Criterion R requires evidence that the study (or its central experimental claim in a comparable design) has been replicated independently by other authors in a peer-reviewed outlet.
The paper does not describe a replication, and internet searching did not identify an independent peer-reviewed replication of this specific single-school tutoring RCT as of the ERCT check date.
Final: Criterion R is not met because no independent replication evidence was identified.
-
A
All-subject Exams
- The study uses a standardized assessment (MAP) but reports outcomes only for math rather than for all main subjects.
- "The outcome variable of this study is the MAP Math assessment, which is created by the NWEA." (p. 6)
Relevant Quotes:
1) "The outcome variable of this study is the MAP Math assessment, which is created by the NWEA." (p. 6)
2) "Overall, I find that the 2-student group tutoring led to a significant improvement on math skills (0.23 SD), whereas the equal-cost, more frequent tutoring in the 3-student groups did not lead to a significant improvement in math skills." (p. 2)
Detailed Analysis:
Criterion A requires standardized exam-based assessment across all main subjects (and requires Criterion E to be met, which it is here).
The paper reports the MAP Math assessment as the outcome and frames the findings in terms of math skills. It does not report standardized outcomes for reading, science, or other core subjects, nor does it provide a rationale for a specialized exception.
Final: Criterion A is not met because outcomes are measured only in mathematics rather than across all main subjects.
-
G
Graduation Tracking
- The study does not track students through graduation, and per the ERCT dependency rule, Criterion G cannot be met because Criterion Y is not met.
Relevant Quotes:
1) "The experimental evaluation of this tutoring took place between January 2024 and May 2024." (p. 4)
Detailed Analysis:
Criterion G requires follow-up tracking through graduation, and per the ERCT rule, if Criterion Y (year duration) is not met then Criterion G is not met.
The study’s measurement window is January 2024 to May 2024, which is far shorter than graduation tracking. Internet searching did not identify follow-up publications by the same author reporting graduation outcomes for this cohort as of the ERCT check date.
Final: Criterion G is not met because there is no graduation tracking and Criterion Y is not met.
-
P
Pre-Registered
- The paper states the study was pre-registered (AEARCTR-0012858), but the registry page could not be accessed to verify the exact registration date relative to study start.
- "This study was pre-registered on the AEA RCT Registry (AEARCTR-0012858) and overseen by the Social and Behavioral Sciences IRB from the University of Chicago (IRB24-0128)." (p. 1)
Relevant Quotes:
1) "This study was pre-registered on the AEA RCT Registry (AEARCTR-0012858) and overseen by the Social and Behavioral Sciences IRB from the University of Chicago (IRB24-0128)." (p. 1)
2) "The full set of randomization and analysis protocols is described in the Pre-Analysis Plan2." (p. 5)
3) "2https://www.socialscienceregistry.org/trials/12858" (p. 5)
Detailed Analysis:
Criterion P requires that the protocol be publicly registered before the study begins and that the registration can be verified (including timing relative to study start).
The paper explicitly states that the study "was pre-registered" and provides an AEA RCT Registry identifier (AEARCTR-0012858) as well as a trial URL. The paper also indicates the experimental evaluation occurred between January 2024 and May 2024, which provides a reference point for when the study was underway.
During this ERCT check, the AEA RCT Registry trial webpage could not be accessed (timeouts), so the exact registry posting date could not be independently confirmed from the registry itself.
Final: Criterion P is marked as met because the paper provides a specific registry ID and explicitly states pre-registration, although the registration date could not be independently verified during this check due to access issues.
Request an Update or Contact Us
Are you the author of this study? Let us know if you have any questions or updates.