How AI Impacts Skill Formation

Judy Hanwen Shen and Alex Tamkin

Published:
ERCT Check Date:
DOI: 10.48550/arXiv.2601.20245
  • adult education
  • EdTech platform
  • digital assessment
0
  • C

    The unit of randomization is individual participants (between-subjects), not intact classes or schools, and no tutoring exception applies.

    "We use a between-subjects randomized experiment to test for the effects of using AI in the coding skill formation process."

  • E

    Outcomes are measured using a researcher-designed quiz rather than a widely recognized standardized exam.

    "We designed a quiz with debugging, code reading, and conceptual questions that cover these 7 concepts."

  • T

    Outcomes are measured within a single short session (minutes to about an hour), not at least one academic term after the intervention begins.

    "The next stage is the Trio task stage, where participants have a maximum of 35 minutes to complete two coding tasks using Trio in the same coding platform."

  • D

    The control condition and key baseline/balance characteristics are documented, including a balance table and clear control vs treatment descriptions.

    "Table 1: Balance table of main study participants (n=52)."

  • S

    Randomization is conducted among individual participants rather than at the school (or equivalent site/institution) level.

    "We use a between-subjects randomized experiment to test for the effects of using AI in the coding skill formation process."

  • I

    The study does not document independent third-party conduct of the evaluation; the authors are Anthropic-affiliated and describe internal review.

    "The protocol was reviewed and approved by internal reviewers at Anthropic."

  • Y

    The study duration is about an hour rather than at least 75% of an academic year, and because T is not met, Y is necessarily not met.

    "Together, these tasks take a maximum time of 1 hour and 15 minutes with an average duration of 58.5 minutes."

  • B

    The only clear resource difference is access to the AI assistant, which is the explicit treatment variable; otherwise tasks and time limits are comparable across groups.

    "In a randomized controlled trial, participants were assigned to the treatment condition (using an AI assistant, web search, and instructions) or the control condition (completing tasks with web search and instructions alone)."

  • R

    No independent replication by other research teams was found, and the paper frames itself as an initial study that motivates future work.

    "Our work is a first step to understanding the impact of AI assistance on humans in the human-AI collaboration process."

  • A

    Because E is not met (custom quiz), A is not met; additionally, the study assesses only Trio/library-specific skills rather than all core subjects.

    "We designed a quiz with debugging, code reading, and conceptual questions that cover these 7 concepts."

  • G

    The study does not track participants to graduation, and because Y is not met, G is necessarily not met.

    "We measured skill formation for a specific Python library over a one-hour period."

  • P

    The paper links to an OSF pre-registration and states it was done before running the experiment, but the registry entry’s date could not be verified here to confirm it predates data collection.

    "2Pre-registration: https://osf.io/w49e7"

Abstract

AI assistance produces significant productivity gains across professional domains, particularly for novice workers. Yet how this assistance affects the development of skills required to effectively supervise AI remains unclear. Novice workers who rely heavily on AI to complete unfamiliar tasks may compromise their own skill acquisition in the process. We conduct randomized experiments to study how developers gained mastery of a new asynchronous programming library with and without the assistance of AI. We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average. Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library. We identify six distinct AI interaction patterns, three of which involve cognitive engagement and preserve learning outcomes even when participants receive AI assistance. Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation – particularly in safety-critical domains.

Full Article

ERCT Criteria Breakdown

  • Level 1 Criteria

    • C

      Class-level RCT

      • The unit of randomization is individual participants (between-subjects), not intact classes or schools, and no tutoring exception applies.
      • "We use a between-subjects randomized experiment to test for the effects of using AI in the coding skill formation process."
      • Relevant Quotes: 1) "We use a between-subjects randomized experiment to test for the effects of using AI in the coding skill formation process." (p. 6) 2) "In our main study, 52 participants completed the task, 26 for each of the control and treatment groups." (p. 7) Detailed Analysis: Criterion C requires random assignment at the class level (or stronger, e.g., school level) to reduce contamination across treatment and control students within the same instructional setting. The paper explicitly states the study is a "between-subjects randomized experiment" and describes a participant-level split (52 participants, evenly divided between control and treatment). This indicates individual randomization, not class- or school-level randomization. The tutoring exception does not apply here because the intervention is access to an AI assistant during an individual coding task, not a one-to-one tutoring program where student-level randomization is explicitly acceptable under the ERCT exception. Final Summary: Criterion C is not met because randomization is at the individual participant level rather than by intact classes (or schools).
    • E

      Exam-based Assessment

      • Outcomes are measured using a researcher-designed quiz rather than a widely recognized standardized exam.
      • "We designed a quiz with debugging, code reading, and conceptual questions that cover these 7 concepts."
      • Relevant Quotes: 1) "We designed a quiz with debugging, code reading, and conceptual questions that cover these 7 concepts." (p. 6) 2) "The final evaluation we used contained 14 questions for a total of 27 points." (p. 6) Detailed Analysis: Criterion E requires outcome measurement via standardized exam-based assessments that are widely recognized and not created specifically for the study. The paper describes an in-house evaluation instrument: the authors "designed a quiz" and specify its structure (14 questions, 27 points). This is not presented as an external standardized exam (e.g., a national test, industry certification exam, or other broadly standardized assessment with established external norms). Final Summary: Criterion E is not met because the study uses a custom researcher-designed quiz rather than a widely recognized standardized exam.
    • T

      Term Duration

      • Outcomes are measured within a single short session (minutes to about an hour), not at least one academic term after the intervention begins.
      • "The next stage is the Trio task stage, where participants have a maximum of 35 minutes to complete two coding tasks using Trio in the same coding platform."
      • Relevant Quotes: 1) "The next stage is the Trio task stage, where participants have a maximum of 35 minutes to complete two coding tasks using Trio in the same coding platform." (p. 7) 2) "After completing the Trio task, participants completed the evaluation stage where they take the quiz we described in the previous section..." (p. 7) 3) "Together, these tasks take a maximum time of 1 hour and 15 minutes with an average duration of 58.5 minutes." (p. 8) Detailed Analysis: Criterion T requires that outcomes be measured at least one academic term (roughly 3–4 months) after the intervention begins, to assess persistence beyond immediate short-run effects. The paper describes a time-bounded task (35 minutes) and an immediate post-task evaluation (the quiz), with the entire study taking about an hour on average (and at most 1 hour and 15 minutes). This is far shorter than a term and there is no term-later follow-up measurement described. Final Summary: Criterion T is not met because outcomes are measured immediately after a short session rather than at least one academic term after the start.
    • D

      Documented Control Group

      • The control condition and key baseline/balance characteristics are documented, including a balance table and clear control vs treatment descriptions.
      • "Table 1: Balance table of main study participants (n=52)."
      • Relevant Quotes: 1) "During the main Trio task, participants in the treatment group could use AI assistance to answer questions or generate code. All participants were not allowed to use AI in the comprehension check." (p. 7) 2) "Table 1: Balance table of main study participants (n=52)." (p. 7) 3) "In our main study, 52 participants completed the task, 26 for each of the control and treatment groups." (p. 7) Detailed Analysis: Criterion D requires that the control group be well documented, including who is in the control group, what they receive, and evidence that groups are comparable at baseline (e.g., demographics and/or baseline skill measures). The paper clearly distinguishes treatment (AI assistance available during the Trio task) from the control condition (No AI during the Trio task), and states both groups have no AI during the comprehension check/quiz. It also provides a balance table (Table 1) for the main study sample and reports the group sizes. While this is not a school/classroom study, the control condition is described with sufficient clarity for comparisons within this experiment. Final Summary: Criterion D is met because the paper clearly documents the control condition and provides baseline/balance information for the sample.
  • Level 2 Criteria

    • S

      School-level RCT

      • Randomization is conducted among individual participants rather than at the school (or equivalent site/institution) level.
      • "We use a between-subjects randomized experiment to test for the effects of using AI in the coding skill formation process."
      • Relevant Quotes: 1) "We use a between-subjects randomized experiment to test for the effects of using AI in the coding skill formation process." (p. 6) 2) "In our main study, 52 participants completed the task, 26 for each of the control and treatment groups." (p. 7) Detailed Analysis: Criterion S requires randomization at the school level (or an equivalent institutional/site unit implementing the intervention). The paper describes a between-subjects experiment with individual participants split into treatment and control groups. No schools, sites, classes, or other institutions are described as the unit of randomization. Final Summary: Criterion S is not met because the unit of randomization is the individual participant, not schools (or equivalent sites).
    • I

      Independent Conduct

      • The study does not document independent third-party conduct of the evaluation; the authors are Anthropic-affiliated and describe internal review.
      • "The protocol was reviewed and approved by internal reviewers at Anthropic."
      • Relevant Quotes: 1) "Work done as a part of the Anthropic Fellows Program, judy@anthropic.com" (p. 1) 2) "Anthropic, atamkin@anthropic.com" (p. 1) 3) "The protocol was reviewed and approved by internal reviewers at Anthropic." (p. 24) Detailed Analysis: Criterion I requires that the trial be conducted independently from the intervention designers/providers, with clear evidence of third-party independent evaluation/implementation/analysis. The authors are affiliated with Anthropic (as shown on the title page). The paper also describes protocol review by "internal reviewers at Anthropic," which indicates internal oversight rather than an external, independent evaluation team. The paper does not provide a statement that an independent third party conducted the experiment, data collection, or analysis. Final Summary: Criterion I is not met because the paper does not clearly document an independent third-party evaluation separate from the authors’ organization.
    • Y

      Year Duration

      • The study duration is about an hour rather than at least 75% of an academic year, and because T is not met, Y is necessarily not met.
      • "Together, these tasks take a maximum time of 1 hour and 15 minutes with an average duration of 58.5 minutes."
      • Relevant Quotes: 1) "Together, these tasks take a maximum time of 1 hour and 15 minutes with an average duration of 58.5 minutes." (p. 8) 2) "The next stage is the Trio task stage, where participants have a maximum of 35 minutes to complete two coding tasks..." (p. 7) Detailed Analysis: Criterion Y requires outcome measurement at least 75% of an academic year after the intervention begins. The quoted study duration is roughly one hour (with a 35-minute task stage and immediate quiz). This is far shorter than a school year. Additionally, ERCT rules state that if criterion T is not met, criterion Y is not met. Since T is not met, Y is automatically not met. Final Summary: Criterion Y is not met because the study is very short (about an hour) and does not include year-long tracking (and T is not met).
    • B

      Balanced Control Group

      • The only clear resource difference is access to the AI assistant, which is the explicit treatment variable; otherwise tasks and time limits are comparable across groups.
      • "In a randomized controlled trial, participants were assigned to the treatment condition (using an AI assistant, web search, and instructions) or the control condition (completing tasks with web search and instructions alone)."
      • Relevant Quotes: 1) "During this stage, participants in the AI assistance condition (treatment group) had access to coding help through a chat-based AI assistant..." (p. 7) 2) "In a randomized controlled trial, participants were assigned to the treatment condition (using an AI assistant, web search, and instructions) or the control condition (completing tasks with web search and instructions alone)." (p. 18) 3) "The next stage is the Trio task stage, where participants have a maximum of 35 minutes to complete two coding tasks..." (p. 7) Detailed Analysis: Criterion B evaluates whether time and resources are balanced across conditions, unless the extra resource is explicitly the treatment variable being tested. Here, the central treatment is access to an AI assistant (with both groups otherwise having web search and instructions). The paper explicitly frames the treatment vs control conditions in exactly those terms. Both groups operate under the same task constraints (including a 35-minute Trio task stage), so the main resource difference is the AI assistant access itself. Under the ERCT exception logic, when the additional resource is the treatment variable (AI assistance), the control group can remain business-as-usual without that resource. Final Summary: Criterion B is met because the added resource (AI assistant access) is the explicit treatment variable and other key inputs (tasks/time limits) are comparable across groups.
  • Level 3 Criteria

    • R

      Reproduced

      • No independent replication by other research teams was found, and the paper frames itself as an initial study that motivates future work.
      • "Our work is a first step to understanding the impact of AI assistance on humans in the human-AI collaboration process."
      • Relevant Quotes: 1) "Our work is a first step to understanding the impact of AI assistance on humans in the human-AI collaboration process." (p. 19) Detailed Analysis: Criterion R requires independent replication by a different research team in a different context, published in a peer-reviewed venue (replication evidence may appear in later papers by other authors). The paper explicitly frames itself as "a first step," which is consistent with the absence of established replication at the time of publication. Internet search (as of 2026-02-22) did not identify a peer-reviewed, independent replication study that explicitly attempts to reproduce this specific Trio-library RCT design and compares outcomes to the original findings. No suitable replication paper with verbatim evidence could be located. Final Summary: Criterion R is not met because no independent replication study could be identified and the paper presents itself as an initial contribution.
    • A

      All-subject Exams

      • Because E is not met (custom quiz), A is not met; additionally, the study assesses only Trio/library-specific skills rather than all core subjects.
      • "We designed a quiz with debugging, code reading, and conceptual questions that cover these 7 concepts."
      • Relevant Quotes: 1) "We designed a quiz with debugging, code reading, and conceptual questions that cover these 7 concepts." (p. 6) 2) "The two tasks in our study cover 7 core concepts from the Trio library." (p. 6) Detailed Analysis: Criterion A requires standardized exam-based assessment across all main subjects (and ERCT rules state that if criterion E is not met, criterion A is not met). The study evaluates mastery of a specific programming library via a custom quiz focused on debugging, code reading, and conceptual questions about Trio concepts. This is not an all-subject standardized assessment, and E is not met. Final Summary: Criterion A is not met because the outcomes are not standardized exams and the assessment is limited to a single domain rather than all core subjects (and E is not met).
    • G

      Graduation Tracking

      • The study does not track participants to graduation, and because Y is not met, G is necessarily not met.
      • "We measured skill formation for a specific Python library over a one-hour period."
      • Relevant Quotes: 1) "We measured skill formation for a specific Python library over a one-hour period." (p. 19) 2) "Future work should study real-world skill development through longitudinal measurement of the impacts of AI adoption." (p. 19) Detailed Analysis: Criterion G requires tracking participants through graduation in the relevant educational stage to assess long-run outcomes. This paper describes a short, single-session experiment and explicitly characterizes longer-term (longitudinal) measurement as future work, which implies graduation-level follow-up is not present. Additionally, ERCT rules state that if criterion Y is not met, criterion G is not met. Since Y is not met, G cannot be met. Follow-up-paper search: Internet searching did not identify subsequent follow-up publications by the same authors that track this cohort to any graduation milestone; no such follow-up paper with verbatim evidence was found. Final Summary: Criterion G is not met because there is no graduation tracking (the study is about a one-hour period) and Y is not met.
    • P

      Pre-Registered

      • The paper links to an OSF pre-registration and states it was done before running the experiment, but the registry entry’s date could not be verified here to confirm it predates data collection.
      • "2Pre-registration: https://osf.io/w49e7"
      • Relevant Quotes: 1) "We submitted the grading rubric for the quiz in our study pre-registration before running the experiment." (p. 6) 2) "2Pre-registration: https://osf.io/w49e7" (p. 9) Detailed Analysis: Criterion P requires that the study protocol be pre-registered and that the registration date is before data collection begins. The paper provides a pre-registration link and explicitly claims that at least the quiz grading rubric was submitted in the pre-registration "before running the experiment," which supports the authors’ intent to pre-register. However, verifying Criterion P requires checking the OSF registry record itself for a time-stamped registration date and comparing it to when data collection began. In this review, the OSF record could not be accessed in a way that allowed confirming the registration date and timing relative to the start of data collection. Therefore, the required timing verification step cannot be completed. Final Summary: Criterion P is not met because, although a pre-registration is cited, the registration date could not be verified here to confirm it predates data collection.

Request an Update or Contact Us

Are you the author of this study? Let us know if you have any questions or updates.

Have Questions
or Suggestions?

Get in Touch

Have a study you'd like to submit for ERCT evaluation? Found something that could be improved? If you're an author and need to update or correct information about your study, let us know.

  • Submit a Study for Evaluation

    Share your research with us for review

  • Suggest Improvements

    Provide feedback to help us make things better.

  • Update Your Study

    If you're the author, let us know about necessary updates or corrections.