The Effect of Transparency on Students’ Perceptions of AI Graders

Joslyn Orgill, Andra Rice, Max Fowler, Seth Poulsen

Published:
ERCT Check Date:
DOI: 10.48550/arXiv.2601.00765
  • science
  • higher education
  • US
  • EdTech platform
  • digital assessment
  • formative assessment
0
  • C

    Randomization was at the individual student level (not class- or school-level), and no one-to-one tutoring exception is stated.

    "The students were randomly assigned to the experimental or control condition."

  • E

    Outcomes rely on manually graded EiPE responses and survey items rather than a widely recognized standardized exam.

    "The second author scored all the pre and post-test EiPE answers."

  • T

    The activity and measurement occur in a short time window (late in a semester) rather than at least one full term after the intervention begins.

    "Students were recruited during the second-to-last week of the semester, once they had learned most of the course content, to complete an additional learning activity for a small amount of extra credit."

  • D

    The paper describes what the control group received and gives group sizes, but does not report detailed control group demographics and baseline characteristics as required by ERCT.

    "Students in the control condition were graded by the EiPE autograder as usual and did not receive any additional information."

  • S

    Randomization was not conducted at the school (or site) level; it was conducted at the student level in a single university course.

    "Participants for this study were recruited from an introductory computer science course for non-major students at a large public university in the United States."

  • I

    Key measurement and analysis were conducted by the author team, with no clearly described independent third-party evaluation.

    "The second author scored all the pre and post-test EiPE answers."

  • Y

    The study is a short activity with immediate post measures and does not track outcomes for at least 75% of an academic year (and T is not met).

    "To answer our research questions, we conducted a randomized controlled trial in which students completed a short learning activity, then afterwards responded to a survey about their experiences."

  • B

    The treatment adds transparency information and a quiz as the treatment variable being tested; the extra time/inputs are integral to the intervention rather than a confound.

    "Students in the experimental condition were asked to read a transparency statement and quizzed on the content of the statement."

  • R

    No independent replication of this specific transparency RCT was found in the paper or via external literature search.

  • A

    Because the study does not use standardized exam-based assessments (E not met), it cannot satisfy the all-subject standardized exams requirement.

  • G

    The study does not track participants through graduation; additionally, Y is not met, so G cannot be met under ERCT.

    "To answer our research questions, we conducted a randomized controlled trial in which students completed a short learning activity, then afterwards responded to a survey about their experiences."

  • P

    No protocol registry link/ID or dated statement indicating pre-registration prior to data collection was found in the paper or via registry-focused search.

Abstract

The development of effective autograders is key for scaling assessment and feedback. While NLP based autograding systems for open-ended response questions have been found to be beneficial for providing immediate feedback, autograders are not always liked, understood, or trusted by students. Our research tested the effect of transparency on students’ attitudes towards autograders. Transparent autograders increased students’ perceptions of autograder accuracy and willingness to discuss autograders in survey comments, but did not improve other related attitudes—such as willingness to be graded by them on a test—relative to the control without transparency. However, this lack of impact may be due to higher measured student trust towards autograders in this study than in prior work in the field. We briefly discuss possible reasons for this trend.

Full Article

ERCT Criteria Breakdown

  • Level 1 Criteria

    • C

      Class-level RCT

      • Randomization was at the individual student level (not class- or school-level), and no one-to-one tutoring exception is stated.
      • "The students were randomly assigned to the experimental or control condition."
      • Relevant Quotes: 1) "To answer our research questions, we conducted a randomized controlled trial in which students completed a short learning activity, then afterwards responded to a survey about their experiences." (p. 4) 2) "The students were randomly assigned to the experimental or control condition." (p. 5) 3) "There were 81 students in the treatment group and 74 students in the control." (p. 5) Detailed Analysis: Criterion C requires random assignment at the class level (or stronger), unless the intervention is explicitly one-to-one tutoring/personal teaching. Here, the paper describes a randomized controlled trial where individual "students" were randomly assigned to conditions within one course context. The intervention is transparency information about an AI grader (plus related materials), not one-to-one tutoring. The unit of randomization is not described as classes/sections (nor schools) being randomized. Final sentence: Criterion C is not met because individual students (not classes/schools) were randomly assigned.
    • E

      Exam-based Assessment

      • Outcomes rely on manually graded EiPE responses and survey items rather than a widely recognized standardized exam.
      • "The second author scored all the pre and post-test EiPE answers."
      • Relevant Quotes: 1) "2. Pre-test (manually graded EiPE questions posthoc)" (p. 6) 2) "6. Post-test (isomorphs of the pre-test, manually graded EiPE questions posthoc)" (p. 6) 3) "After the post-test, we surveyed the student’s attitudes on their experience with the autograder." (p. 7) 4) "The second author scored all the pre and post-test EiPE answers." (p. 7) 5) "Answers were scored on binary correctness, either a 0 (incorrect) or 1 (correct), using the same grading standards as found in previous EiPE autograder work." (p. 7) Detailed Analysis: Criterion E requires a standardized, widely recognized exam-based assessment (e.g., a state/national standardized test, or a validated standardized instrument administered/scored under a standardized protocol). In this paper, learning/performance measurement is based on EiPE questions that are manually graded post hoc by an author using a binary 0/1 rubric, and attitudinal outcomes are measured via survey (Likert-type items). The paper does not describe using any widely recognized standardized exam instrument. Therefore, the assessments do not satisfy ERCT’s standardized exam requirement. Final sentence: Criterion E is not met because the study uses researcher/course-context assessments and surveys rather than standardized exams.
    • T

      Term Duration

      • The activity and measurement occur in a short time window (late in a semester) rather than at least one full term after the intervention begins.
      • "Students were recruited during the second-to-last week of the semester, once they had learned most of the course content, to complete an additional learning activity for a small amount of extra credit."
      • Relevant Quotes: 1) "To answer our research questions, we conducted a randomized controlled trial in which students completed a short learning activity, then afterwards responded to a survey about their experiences." (p. 4) 2) "Students were recruited during the second-to-last week of the semester, once they had learned most of the course content, to complete an additional learning activity for a small amount of extra credit." (p. 5) 3) "As our primary interest for this study was the impact of transparency on students’ perceptions and attitudes, and not the impact this short task would have on student performance, we were not concerned with reconciling these grades over multiple graders." (p. 7) Detailed Analysis: Criterion T requires that outcomes be measured at least one full academic term after the intervention begins (or, equivalently, that the study tracks outcomes for at least a term from start to measurement). The paper frames the intervention as a "short learning activity" and indicates students were recruited in the "second-to-last week of the semester," with survey responses collected afterwards. This indicates the elapsed time from intervention exposure to outcome measurement is short and not term-long. Final sentence: Criterion T is not met because the study’s start-to-measurement interval is not at least one academic term.
    • D

      Documented Control Group

      • The paper describes what the control group received and gives group sizes, but does not report detailed control group demographics and baseline characteristics as required by ERCT.
      • "Students in the control condition were graded by the EiPE autograder as usual and did not receive any additional information."
      • Relevant Quotes: 1) "Students in the control condition were graded by the EiPE autograder as usual and did not receive any additional information." (p. 4) 2) "There were 81 students in the treatment group and 74 students in the control." (p. 5) 3) "We graded the tests primarily to ensure that the two groups of students were roughly comparable." (p. 8) Detailed Analysis: Criterion D requires a well-documented control group, typically including (a) a clear description of what the control group received, and (b) reported control-group characteristics such as demographics and baseline performance/characteristics. The paper clearly describes the control condition (no additional transparency information) and reports group sizes. It also states pre/post tests were graded to ensure groups were "roughly comparable." However, the paper does not present detailed control group demographics (e.g., gender, prior preparation) or a clear baseline-characteristics table by group that would satisfy ERCT’s documentation requirement. Final sentence: Criterion D is not met because, although the control condition is described, control-group demographics and baseline characteristics are not reported in sufficient detail.
  • Level 2 Criteria

    • S

      School-level RCT

      • Randomization was not conducted at the school (or site) level; it was conducted at the student level in a single university course.
      • "Participants for this study were recruited from an introductory computer science course for non-major students at a large public university in the United States."
      • Relevant Quotes: 1) "Participants for this study were recruited from an introductory computer science course for non-major students at a large public university in the United States." (p. 5) 2) "The students were randomly assigned to the experimental or control condition." (p. 5) Detailed Analysis: Criterion S requires school-level (or institution/site-level) randomization, meaning whole schools (or comparable educational units implementing the intervention) are randomized. This study occurs within one university course and randomizes individual students, not schools/sites/sections treated as sites. Final sentence: Criterion S is not met because assignment is at the student level within one course rather than at a school/site level.
    • I

      Independent Conduct

      • Key measurement and analysis were conducted by the author team, with no clearly described independent third-party evaluation.
      • "The second author scored all the pre and post-test EiPE answers."
      • Relevant Quotes: 1) "The second author scored all the pre and post-test EiPE answers." (p. 7) 2) "Two authors conducted a thematic analysis of these responses." (p. 10) Detailed Analysis: Criterion I requires that the evaluation be conducted independently from the intervention designers (e.g., an external evaluation team conducting scoring/data analysis). The paper reports that an author scored the pre/post answers and that authors conducted the thematic analysis of open-ended survey responses. The paper does not describe an independent evaluator conducting implementation, scoring, or analysis. Final sentence: Criterion I is not met because evaluation/scoring and qualitative analysis were performed by the authors without a documented independent third party.
    • Y

      Year Duration

      • The study is a short activity with immediate post measures and does not track outcomes for at least 75% of an academic year (and T is not met).
      • "To answer our research questions, we conducted a randomized controlled trial in which students completed a short learning activity, then afterwards responded to a survey about their experiences."
      • Relevant Quotes: 1) "To answer our research questions, we conducted a randomized controlled trial in which students completed a short learning activity, then afterwards responded to a survey about their experiences." (p. 4) 2) "Students were recruited during the second-to-last week of the semester, once they had learned most of the course content, to complete an additional learning activity for a small amount of extra credit." (p. 5) Detailed Analysis: Criterion Y requires outcomes measured at least 75% of one academic year after the intervention begins. The paper describes a "short learning activity" with survey measures collected afterwards, and recruitment late in the semester. Additionally, per ERCT rule: if Criterion T is not met, then Criterion Y is not met. Since this study does not meet T, it cannot meet Y. Final sentence: Criterion Y is not met because the study is short-term and does not track outcomes for a year.
    • B

      Balanced Control Group

      • The treatment adds transparency information and a quiz as the treatment variable being tested; the extra time/inputs are integral to the intervention rather than a confound.
      • "Students in the experimental condition were asked to read a transparency statement and quizzed on the content of the statement."
      • Relevant Quotes: 1) "Students in the control condition were graded by the EiPE autograder as usual and did not receive any additional information." (p. 4) 2) "The students in the experimental condition received additional information about the autograding systems that graded their learning activities." (p. 4) 3) "Students in the experimental condition were asked to read a transparency statement and quizzed on the content of the statement." (p. 6) 4) "The control group was only told that the learning activity made use of an AI autograder. No transparency quiz was given and their questions did not contain extra information on the performance of each autograder." (p. 6) Detailed Analysis: Criterion B compares the nature, quantity, and quality of resources (time, materials, supports) provided to intervention and control conditions, and asks whether the control condition offers a comparable substitute for those inputs—unless the additional inputs are explicitly the treatment variable being tested. The experimental condition receives added content and time-on-task (a transparency statement, a quiz, and extra per-question transparency information). The control condition does not receive these additions. However, the paper defines these added informational inputs as the intervention itself (i.e., "transparency" is the manipulated treatment). Under the ERCT Criterion B decision logic, this means the extra time/resources are integral to the treatment variable and not an accidental add-on that should have been balanced. Final sentence: Criterion B is met because the extra informational inputs are the intended transparency treatment being tested against a no-transparency baseline.
  • Level 3 Criteria

    • R

      Reproduced

      • No independent replication of this specific transparency RCT was found in the paper or via external literature search.
      • Relevant Quotes: 1) (No statement describing an independent replication of this trial was found in the paper.) Detailed Analysis: Criterion R requires an independently replicated study by a different research team in a different context, published in a peer-reviewed venue (replication may occur after the original study). The paper does not claim to be a replication and does not report that other teams reproduced this specific transparency manipulation. External internet searching (by title, authors, and DOI/arXiv identifier) did not identify any peer-reviewed, independent replication of this specific RCT as of the ERCT check date. Final sentence: Criterion R is not met because no independent replication of this specific study was found.
    • A

      All-subject Exams

      • Because the study does not use standardized exam-based assessments (E not met), it cannot satisfy the all-subject standardized exams requirement.
      • Relevant Quotes: 1) "2. Pre-test (manually graded EiPE questions posthoc)" (p. 6) 2) "After the post-test, we surveyed the student’s attitudes on their experience with the autograder." (p. 7) Detailed Analysis: Criterion A requires impacts measured across all main subjects using standardized exams, and ERCT specifies that if Criterion E is not met, then Criterion A is not met. Here, assessments are manually graded EiPE questions plus survey items, not standardized exams, so E is not met and A cannot be met. Final sentence: Criterion A is not met because Criterion E is not met and the study does not use standardized exams.
    • G

      Graduation Tracking

      • The study does not track participants through graduation; additionally, Y is not met, so G cannot be met under ERCT.
      • "To answer our research questions, we conducted a randomized controlled trial in which students completed a short learning activity, then afterwards responded to a survey about their experiences."
      • Relevant Quotes: 1) "To answer our research questions, we conducted a randomized controlled trial in which students completed a short learning activity, then afterwards responded to a survey about their experiences." (p. 4) 2) "Students were recruited during the second-to-last week of the semester, once they had learned most of the course content, to complete an additional learning activity for a small amount of extra credit." (p. 5) Detailed Analysis: Criterion G requires follow-up tracking until graduation from the relevant educational stage. This paper describes a short, late-semester activity within a university course with immediate post-activity measures; it contains no description of graduation tracking. Additionally, per ERCT rule: if Criterion Y is not met, then Criterion G is not met. Since Y is not met here, G cannot be met. Internet searching for follow-up publications by the same authors reporting graduation tracking for this cohort did not identify any such follow-up study as of the ERCT check date. Final sentence: Criterion G is not met because there is no graduation-length follow-up, and Y is not met.
    • P

      Pre-Registered

      • No protocol registry link/ID or dated statement indicating pre-registration prior to data collection was found in the paper or via registry-focused search.
      • Relevant Quotes: 1) (No statement of pre-registration, registry ID/link, or registration date was found in the paper.) Detailed Analysis: Criterion P requires an explicit, citable pre-registration statement (platform + ID/link) and evidence that registration occurred before data collection began. The paper’s methods describe the RCT procedures and analyses but do not mention pre-registration (e.g., OSF Registries, AEA RCT Registry, ClinicalTrials.gov, or a comparable registry). Additional internet searching using the paper title, authors, and DOI/arXiv identifier did not locate a public pre-registration record tied to this study as of the ERCT check date. Final sentence: Criterion P is not met because no pre-registration information was found in the paper or via external search.

Request an Update or Contact Us

Are you the author of this study? Let us know if you have any questions or updates.

Have Questions
or Suggestions?

Get in Touch

Have a study you'd like to submit for ERCT evaluation? Found something that could be improved? If you're an author and need to update or correct information about your study, let us know.

  • Submit a Study for Evaluation

    Share your research with us for review

  • Suggest Improvements

    Provide feedback to help us make things better.

  • Update Your Study

    If you're the author, let us know about necessary updates or corrections.