Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task

Nataliya Kosmyna, Eugene Hauptmann, Ye Tong Yuan, Jessica Situ, Xian-Hao Liao, Ashly Vivian Beresnitzky, Iris Braunstein, Pattie Maes

Published:
ERCT Check Date:
DOI: 10.48550/arXiv.2506.08872
  • language arts
  • higher education
  • US
  • EdTech website
0
  • C

    Randomization was at the individual participant level rather than at the class (or school) level.

    "Participants were randomly assigned across the three following groups, balanced with respect to age and gender:" (p. 23)

  • E

    Outcomes were scored via teacher ratings and an internal AI judge, not via a standardized externally administered exam.

    "We performed scoring with the help from the human teachers and an AI judge (a specially built AI agent)." (p. 2)

  • T

    Outcomes were measured within short sessions rather than at least one academic term after the intervention began.

    "Total duration of the study (Stages 1-5) was approximately 1h (60 minutes)." (p. 28)

  • D

    The control condition and sample are clearly documented, including what the control group could and could not do.

    "Brain-only Group (Group 3): Participants in this group were forbidden from using both LLM and any online websites for consultation." (p. 23)

  • S

    The study did not randomize at the school (site) level.

    "Participants were randomly assigned across the three following groups, balanced with respect to age and gender:" (p. 23)

  • I

    The study was proposed, designed, executed, and analyzed by the author team, not by an independent evaluator.

    "The study was proposed, designed, and executed by NK." (p. 143)

  • Y

    The study does not measure outcomes over a full academic year, and criterion T is not met.

    "The study took place over a period of 4 months, due to the scheduling and availability of the participants." (p. 23)

  • B

    Time-on-task is held constant across groups and the tested difference is tool access, not extra instructional time or budget.

    "The participants were instructed to pick a topic among the proposed prompts, and then to produce an essay based on the topic's assignment within a 20 minutes time limit." (p. 27)

  • R

    No independent replication of this specific study was identified at the time of this ERCT check.

  • A

    The study does not use standardized exams across all main subjects, and criterion E is not met.

    "This study focuses on finding out the cognitive cost of using an LLM in the educational context of writing an essay." (p. 2)

  • G

    The study does not track participants to graduation and criterion Y is not met.

    "The study took place over a period of 4 months, due to the scheduling and availability of the participants." (p. 23)

  • P

    The paper reports IRB approval but no public preregistration record was identified.

    "The protocol was approved by the IRB of MIT (ID 21070000428)." (p. 23)

Abstract

With today's wide adoption of LLM products like ChatGPT from OpenAI, humans and businesses engage and use LLMs on a daily basis. Like any other tool, it carries its own set of advantages and limitations. This study focuses on finding out the cognitive cost of using an LLM in the educational context of writing an essay. We assigned participants to three groups: LLM group, Search Engine group, Brain-only group, where each participant used a designated tool (or no tool in the latter) to write an essay. We conducted 3 sessions with the same group assignment for each participant. In the 4th session we asked LLM group participants to use no tools (we refer to them as LLM-to-Brain), and the Brain-only group participants were asked to use LLM (Brain-to-LLM). We recruited a total of 54 participants for Sessions 1, 2, 3, and 18 participants among them completed session 4. We used electroencephalography (EEG) to record participants' brain activity in order to assess their cognitive engagement and cognitive load, and to gain a deeper understanding of neural activations during the essay writing task.

Full Article

ERCT Criteria Breakdown

  • Level 1 Criteria

    • C

      Class-level RCT

      • Randomization was at the individual participant level rather than at the class (or school) level.
      • "Participants were randomly assigned across the three following groups, balanced with respect to age and gender:" (p. 23)
      • Relevant Quotes: 1) "Participants were randomly assigned across the three following groups, balanced with respect to age and gender:" (p. 23) 2) "These 54 participants were between the ages of 18 to 39 years old (age M = 22.9, SD = 1.69) and all recruited from the following 5 universities in greater Boston area:" (p. 22) Detailed Analysis: Criterion C requires that randomization be done at the class level (or stronger, school level), unless the intervention is explicitly 1-on-1 tutoring. Here, the paper clearly states that participants (individual adults) were randomly assigned to tool conditions. Because the unit of randomization is individuals rather than classes or schools, the design does not satisfy the ERCT class-level randomization requirement. Criterion C is not met because assignment was individual-level, not class-level (or school-level).
    • E

      Exam-based Assessment

      • Outcomes were scored via teacher ratings and an internal AI judge, not via a standardized externally administered exam.
      • "We performed scoring with the help from the human teachers and an AI judge (a specially built AI agent)." (p. 2)
      • Relevant Quotes: 1) "We performed scoring with the help from the human teachers and an AI judge (a specially built AI agent)." (p. 2) 2) "We asked two English teachers to evaluate essays using different metrics like: Uniqueness, Vocabulary, Grammar, Organization, Content, Length and ChatGPT (a metric which says if a teacher thinks that essay was written with the help of LLM)." (p. 62) 3) "All the topics were taken from SAT tests." (p. 25) Detailed Analysis: Criterion E requires standardized exam-based assessment (a recognized standardized test administered and scored in a standardized way). The study uses SAT prompts as topics, but the outcome scoring is done through teacher ratings and a custom "AI judge". This is a bespoke research scoring pipeline rather than a standardized exam outcome. Criterion E is not met because assessment is not a standardized exam administration and scoring process.
    • T

      Term Duration

      • Outcomes were measured within short sessions rather than at least one academic term after the intervention began.
      • "Total duration of the study (Stages 1-5) was approximately 1h (60 minutes)." (p. 28)
      • Relevant Quotes: 1) "The participants were instructed to pick a topic among the proposed prompts, and then to produce an essay based on the topic's assignment within a 20 minutes time limit." (p. 27) 2) "Total duration of the study (Stages 1-5) was approximately 1h (60 minutes)." (p. 28) Detailed Analysis: Criterion T requires outcomes measured at least one academic term after the intervention begins (or term-long follow-up). The intervention and outcome measurement occur within short lab sessions: a 20-minute essay task within an approximately 1-hour protocol. The paper does not report term-delayed educational outcome measurement. Criterion T is not met because outcomes are measured immediately within short sessions, not after a term.
    • D

      Documented Control Group

      • The control condition and sample are clearly documented, including what the control group could and could not do.
      • "Brain-only Group (Group 3): Participants in this group were forbidden from using both LLM and any online websites for consultation." (p. 23)
      • Relevant Quotes: 1) "Brain-only Group (Group 3): Participants in this group were forbidden from using both LLM and any online websites for consultation." (p. 23) 2) "These 54 participants were between the ages of 18 to 39 years old (age M = 22.9, SD = 1.69) and all recruited from the following 5 universities in greater Boston area:" (p. 22) Detailed Analysis: Criterion D requires clear documentation of the control condition and sufficient description of participants. The paper explicitly defines the Brain-only (no tools) condition and provides key participant demographics and recruitment context. Criterion D is met because the control condition and sample are described with operational detail.
  • Level 2 Criteria

    • S

      School-level RCT

      • The study did not randomize at the school (site) level.
      • "Participants were randomly assigned across the three following groups, balanced with respect to age and gender:" (p. 23)
      • Relevant Quotes: 1) "Participants were randomly assigned across the three following groups, balanced with respect to age and gender:" (p. 23) 2) "These 54 participants were between the ages of 18 to 39 years old (age M = 22.9, SD = 1.69) and all recruited from the following 5 universities in greater Boston area:" (p. 22) Detailed Analysis: Criterion S requires school-level randomization (schools or sites as the unit of assignment). Although participants were recruited from multiple universities, the paper states that individuals were randomly assigned to conditions. There is no indication that universities (or other sites) were randomized. Criterion S is not met because randomization was not school-level.
    • I

      Independent Conduct

      • The study was proposed, designed, executed, and analyzed by the author team, not by an independent evaluator.
      • "The study was proposed, designed, and executed by NK." (p. 143)
      • Relevant Quotes: 1) "The study was proposed, designed, and executed by NK." (p. 143) 2) "NK and EH processed and analyzed both EEG and NLP data in this study." (p. 143) Detailed Analysis: Criterion I requires independent conduct, meaning implementation and evaluation are carried out (or overseen) by an organization independent of the intervention's designers. The author contributions section states that the study was proposed, designed, executed, and analyzed by the authors. The paper does not describe independent third-party oversight for running the study and analyzing outcomes. Criterion I is not met because the study was conducted and evaluated by the authors rather than an independent evaluator.
    • Y

      Year Duration

      • The study does not measure outcomes over a full academic year, and criterion T is not met.
      • "The study took place over a period of 4 months, due to the scheduling and availability of the participants." (p. 23)
      • Relevant Quotes: 1) "The study took place over a period of 4 months, due to the scheduling and availability of the participants." (p. 23) Detailed Analysis: Criterion Y requires outcome measurement spanning at least one academic year from intervention start. The paper reports a multi-session study occurring over about four months, which is shorter than an academic year. Also, per ERCT rules, if criterion T is not met then criterion Y is not met. Criterion Y is not met because the study is not year-long (and T is not met).
    • B

      Balanced Control Group

      • Time-on-task is held constant across groups and the tested difference is tool access, not extra instructional time or budget.
      • "The participants were instructed to pick a topic among the proposed prompts, and then to produce an essay based on the topic's assignment within a 20 minutes time limit." (p. 27)
      • Relevant Quotes: 1) "The participants were instructed to pick a topic among the proposed prompts, and then to produce an essay based on the topic's assignment within a 20 minutes time limit." (p. 27) 2) "LLM Group (Group 1): Participants in this group were restricted to using OpenAI's GPT-4o as their sole resource of information for the essay writing task." (p. 23) 3) "Search Engine Group (Group 2): Participants in this group could use any website to help them with their essay writing task, but ChatGPT or any other LLM was explicitly prohibited; all participants used Google as a browser of choice." (p. 23) 4) "Brain-only Group (Group 3): Participants in this group were forbidden from using both LLM and any online websites for consultation." (p. 23) Detailed Analysis: Criterion B compares resources (time, budget, instructional inputs) between intervention and control, unless the extra resource is the treatment variable being tested. The study holds writing time constant (20 minutes) and varies access to external tools (GPT-4o, web search, or no tools). There is no indication of additional instructional time, tutoring, or materials beyond tool access. Because tool access is the intended treatment and time-on-task is matched, the resource-balance requirement is satisfied. Criterion B is met because groups are balanced on time and the resource difference is the intervention itself.
  • Level 3 Criteria

    • R

      Reproduced

      • No independent replication of this specific study was identified at the time of this ERCT check.
      • Relevant Quotes: (No quotes found regarding independent replication of this study.) Detailed Analysis: Criterion R requires an independent reproduction by a different research team, published in a peer-reviewed venue. A targeted search by paper title, DOI, and arXiv identifier did not identify an independent replication study reproducing this protocol. Criterion R is not met because no independent replication was found.
    • A

      All-subject Exams

      • The study does not use standardized exams across all main subjects, and criterion E is not met.
      • "This study focuses on finding out the cognitive cost of using an LLM in the educational context of writing an essay." (p. 2)
      • Relevant Quotes: 1) "This study focuses on finding out the cognitive cost of using an LLM in the educational context of writing an essay." (p. 2) Detailed Analysis: Criterion A requires standardized exam-based outcomes across all main subjects (or an equivalent broad standardized outcome set). This study focuses on a single domain task (essay writing), and per ERCT rules, if criterion E is not met then criterion A is not met. Criterion A is not met because outcomes are not standardized exams across all core subjects (and E is not met).
    • G

      Graduation Tracking

      • The study does not track participants to graduation and criterion Y is not met.
      • "The study took place over a period of 4 months, due to the scheduling and availability of the participants." (p. 23)
      • Relevant Quotes: 1) "The study took place over a period of 4 months, due to the scheduling and availability of the participants." (p. 23) Detailed Analysis: Criterion G requires tracking participants until graduation. The paper reports a short multi-session study with no graduation follow-up. Per ERCT rules, if criterion Y is not met then criterion G is not met. A targeted search for subsequent graduation-tracking papers by the same authors describing this cohort did not identify such follow-up publications. Criterion G is not met because there is no graduation tracking and Y is not met.
    • P

      Pre-Registered

      • The paper reports IRB approval but no public preregistration record was identified.
      • "The protocol was approved by the IRB of MIT (ID 21070000428)." (p. 23)
      • Relevant Quotes: 1) "The protocol was approved by the IRB of MIT (ID 21070000428)." (p. 23) Detailed Analysis: Criterion P requires a publicly preregistered protocol with a registration date before data collection began (for example OSF, AsPredicted, ClinicalTrials.gov, or another public registry). The paper provides IRB approval information but does not provide a preregistration link or registry identifier. A targeted search using the paper title and author names did not identify a matching public preregistration record. Criterion P is not met because no public preregistration record was found.

Request an Update or Contact Us

Are you the author of this study? Let us know if you have any questions or updates.

Have Questions
or Suggestions?

Get in Touch

Have a study you'd like to submit for ERCT evaluation? Found something that could be improved? If you're an author and need to update or correct information about your study, let us know.

  • Submit a Study for Evaluation

    Share your research with us for review

  • Suggest Improvements

    Provide feedback to help us make things better.

  • Update Your Study

    If you're the author, let us know about necessary updates or corrections.