Smartphone-Based Augmented Reality for Calligraphy Instruction: Randomized Evidence on Self-Efficacy, Flow, and Artifact Quality

Shuhan Guo, Zhian Chen, Chengxi Jiang, Zhe Song, and Guanghui Huang

Published: Dec 13, 2025

ERCT Check Date: Apr 14, 2026

DOI: 10.1145/3779232.3779469

Link

Download PDF

arts
higher education
China
Asia
EdTech app
mobile learning

C

The study randomized individual participants rather than classes (and it is not a one-to-one tutoring exception).

"Participants were randomized 1:1 to an AR-supported calligraphy group or a conventional model-copying group (n=10 per arm) using a computer-generated sequence prepared by an independent research assistant, with allocation concealed in opaque sealed envelopes and revealed after baseline assessment (Figure 3)."
E

The outcomes are questionnaires and rubric/expert ratings, not a widely recognized standardized exam.

"Calligraphy self-efficacy was measured using a contextualized Chinese version of the MSLQ “Self-Efficacy for Learning and Performance” subscale [Pintrich and Groot 1990]."
T

The study tracks outcomes over four weeks with only a one-week follow-up, which is shorter than one academic term.

"This four-week, parallel-group randomized controlled trial enrolled adult participants to minimize variability in compliance and device use and to improve implementation control."
D

The control condition and baseline equivalence are documented, including what the control group did and a baseline table (Table 2).

"The control group watched blackboard or projector demonstrations and copied the examples without using any digital overlay techniques."
S

The trial is single-institution with participant-level randomization rather than school/site-level randomization.

"We recruited twenty adult students from a single institution."
I

The paper reports some safeguards (independent assistant for randomization and blinded raters), but it does not document that the study was conducted by an independent external evaluation team separate from the intervention developers.

"Block randomization (block size 4) was generated in advance by a research assistant independent of instruction, with allocation concealed in opaque, sealed envelopes and revealed after baseline assessment."
Y

Outcomes are measured over weeks rather than at least 75% of an academic year, and Y is also disqualified because T is not met.

"This four-week, parallel-group randomized controlled trial enrolled adult participants to minimize variability in compliance and device use and to improve implementation control."
B

Time and instructor attention are explicitly matched across arms, and the extra technology (smartphone AR overlay) is the integral treatment contrast rather than an unbalanced add-on resource.

"Session pacing, the number of characters practiced, and instructor time-on-task were matched across arms."
R

No independent replication study by other authors was found in the paper or via internet searching, and the authors instead call for larger, longer trials.

"They also warrant larger and longer trials with multimethod assessment."
A

The study does not use standardized exam-based assessments across subjects, and A is disqualified because E is not met.
G

The study includes only immediate post-test and a one-week follow-up, and no follow-up paper tracking participants to graduation was found; G is also disqualified because Y is not met.

"The primary outcome was calligraphy self-efficacy, assessed at three time points—pre–session 1 (baseline), post–session 4 (post-test), and one-week follow-up..."
P

The paper states outcomes and analyses were prespecified but provides no public preregistration record (registry, ID, and preregistration date) showing registration before data collection began.

"Statistical analysis was prespecified as follows."

Abstract

We conducted a parallel-group randomized controlled trial in routine calligraphy classes, comparing smartphone-based augmented reality with conventional model copying in twenty adults. The instructor, copybook, venue, and contact time were held constant. The primary outcome—calligraphy self-efficacy— improved more in the augmented reality group at post-test and at the one-week follow-up. Secondary outcomes showed a consistent but non-significant advantage for momentary flow. Process efficiency also improved, with fewer revision cycles per character, shorter time to criterion, and a higher completion rate, while subjective workload was similar between groups. Blinded expert ratings favored augmented reality for stroke-order fidelity, stroke continuity, component proportion, and overall balance. These findings support the feasibility of aligning process visualization with the temporal and structural demands of calligraphy. They also warrant larger and longer trials with multimethod assessment.

Full Article

ERCT Criteria Breakdown

Level 1 Criteria
- C
  Class-level RCT
  - The study randomized individual participants rather than classes (and it is not a one-to-one tutoring exception).
  - "Participants were randomized 1:1 to an AR-supported calligraphy group or a conventional model-copying group (n=10 per arm) using a computer-generated sequence prepared by an independent research assistant, with allocation concealed in opaque sealed envelopes and revealed after baseline assessment (Figure 3)."
  - Relevant Quotes: 1) "We conducted a parallel-group randomized controlled trial in routine calligraphy classes, comparing smartphone-based augmented reality with conventional model copying in twenty adults." (Abstract) 2) "We recruited twenty adult students from a single institution." (Section 3.2 Participants) 3) "Participants were randomized 1:1 to an AR-supported calligraphy group or a conventional model-copying group (n=10 per arm) using a computer-generated sequence prepared by an independent research assistant, with allocation concealed in opaque sealed envelopes and revealed after baseline assessment (Figure 3)." (Section 3.2 Participants) 4) "Both groups completed four sessions in the same calligraphy studio, once weekly for 30 minutes per session, taught by the same instructor using an identical running- script copybook and materials." (Section 3.3 Procedures) Detailed Analysis: Criterion C requires random assignment at the class level (or stronger, e.g., school/site level) to reduce contamination, unless the intervention is explicitly one-to-one tutoring/personal teaching (in which case student-level randomization is acceptable). The paper explicitly reports a parallel-group RCT with individual allocation ("Participants were randomized 1:1") of 20 adult students in a single institution, and both arms attended sessions in the same studio under the same instructor. The intervention is a classroom workflow ("routine calligraphy classes") and is not described as one-to-one tutoring. Therefore, the unit of randomization is the individual participant, not the class or school/site, and the tutoring exception does not apply. Criterion C is not met because randomization is at the individual level rather than the class (or school/site) level.
- E
  Exam-based Assessment
  - The outcomes are questionnaires and rubric/expert ratings, not a widely recognized standardized exam.
  - "Calligraphy self-efficacy was measured using a contextualized Chinese version of the MSLQ “Self-Efficacy for Learning and Performance” subscale [Pintrich and Groot 1990]."
  - Relevant Quotes: 1) "The primary outcome was calligraphy self-efficacy, assessed at three time points—pre–session 1 (baseline), post–session 4 (post-test), and one-week follow-up—with items targeting four ability domains: stroke-order execution, control at key inflection points, component proportion, and whole-character balance." (Section 3.3 Procedures) 2) "Flow was assessed using the short form of EduFlow-2 [Heutte et al. 2021]..." (Section 3.4 Measures) 3) "Perceived learning was measured with Rovai’s CAP scale [Rovai et al. 2009]..." (Section 3.4 Measures) 4) "Calligraphy self-efficacy was measured using a contextualized Chinese version of the MSLQ “Self-Efficacy for Learning and Performance” subscale [Pintrich and Groot 1990]." (Section 3.4 Measures) 5) "Product quality and stroke-order execution are scored using a standardized rubric; practice completion and time-to-criterion are recorded; and questionnaires characterize learning experience and process." (Section 1 Introduction) 6) "As an exploratory outcome, at least two trained, group-blinded raters independently scored final-session products; interrater reliability was quantified using consistency-type intraclass correlation coefficients (ICC)." (Section 3.3 Procedures) Detailed Analysis: Criterion E requires a standardized exam-based assessment that is widely recognized and externally standardized (for example, national/state standardized achievement tests), rather than study-specific measures or researcher-selected questionnaires aligned to the intervention. This study measures calligraphy self-efficacy (MSLQ-derived self-report), flow (EduFlow-2), perceived learning (CAP), process/efficiency metrics, and blinded expert ratings of produced artifacts using a rubric. While these are structured instruments and include blinding for product ratings, they are not described as a widely recognized, externally standardized exam-based assessment. Criterion E is not met because the study does not use a widely recognized standardized exam as the outcome measure.
- T
  Term Duration
  - The study tracks outcomes over four weeks with only a one-week follow-up, which is shorter than one academic term.
  - "This four-week, parallel-group randomized controlled trial enrolled adult participants to minimize variability in compliance and device use and to improve implementation control."
  - Relevant Quotes: 1) "This four-week, parallel-group randomized controlled trial enrolled adult participants to minimize variability in compliance and device use and to improve implementation control." (Section 3.3 Procedures) 2) "Both groups completed four sessions in the same calligraphy studio, once weekly for 30 minutes per session, taught by the same instructor using an identical running- script copybook and materials." (Section 3.3 Procedures) 3) "The primary outcome was calligraphy self-efficacy, assessed at three time points—pre–session 1 (baseline), post–session 4 (post-test), and one-week follow-up..." (Section 3.3 Procedures) Detailed Analysis: Criterion T requires outcome measurement at least one full academic term (~3 to 4 months) after the intervention begins, to reduce the risk that results reflect only short- term or novelty effects. The paper explicitly states the trial is "four-week" with four weekly sessions, and the delayed measurement is only a "one-week follow-up" after post-test. This is substantially shorter than a full academic term. Criterion T is not met because the tracking period from the start of the intervention to the latest outcome measurement is only weeks, not a full academic term.
- D
  Documented Control Group
  - The control condition and baseline equivalence are documented, including what the control group did and a baseline table (Table 2).
  - "The control group watched blackboard or projector demonstrations and copied the examples without using any digital overlay techniques."
  - Relevant Quotes: 1) "The control group watched blackboard or projector demonstrations and copied the examples without using any digital overlay techniques." (Section 3.3 Procedures) 2) "Baseline characteristics were comparable in age, sex, device familiarity, and pretest self-efficacy (Table 2)." (Section 4 Results) 3) "Table 2: Baseline sample characteristics and equivalence" (Section 4 Results) Detailed Analysis: Criterion D requires that the control group be described in sufficient detail to understand what it received and to judge baseline comparability (e.g., demographics and baseline performance/measurements relevant to outcomes). The paper describes the control activities (watching demonstrations and copying without digital overlays) and provides a baseline table and an explicit statement that baseline characteristics were comparable (Table 2). This is adequate documentation for ERCT purposes. Criterion D is met because the paper clearly documents the control condition and baseline equivalence information.
Level 2 Criteria
- S
  School-level RCT
  - The trial is single-institution with participant-level randomization rather than school/site-level randomization.
  - "We recruited twenty adult students from a single institution."
  - Relevant Quotes: 1) "We recruited twenty adult students from a single institution." (Section 3.2 Participants) 2) "Participants were randomized 1:1 to an AR-supported calligraphy group or a conventional model-copying group (n=10 per arm)..." (Section 3.2 Participants) Detailed Analysis: Criterion S requires randomization at the school/site level (i.e., multiple schools/sites allocated to intervention vs control). This study recruited participants from a single institution and randomized individuals within that single setting. Criterion S is not met because there is no school/site- level random assignment among multiple sites.
- I
  Independent Conduct
  - The paper reports some safeguards (independent assistant for randomization and blinded raters), but it does not document that the study was conducted by an independent external evaluation team separate from the intervention developers.
  - "Block randomization (block size 4) was generated in advance by a research assistant independent of instruction, with allocation concealed in opaque, sealed envelopes and revealed after baseline assessment."
  - Relevant Quotes: 1) "Block randomization (block size 4) was generated in advance by a research assistant independent of instruction, with allocation concealed in opaque, sealed envelopes and revealed after baseline assessment." (Section 3.3 Procedures) 2) "As an exploratory outcome, at least two trained, group-blinded raters independently scored final-session products..." (Section 3.3 Procedures) 3) "Android was initially chosen as the development platform, and various augmented reality (AR) development platforms were compared (Table 1)." (Section 3.1 Development Process) Detailed Analysis: Criterion I requires independent conduct: the evaluation should be carried out by people/organizations independent from the intervention designers/developers, to reduce bias risks in implementation, measurement, and analysis. The paper does include partial independence safeguards: randomization was prepared by an assistant independent of instruction, and product ratings were conducted by blinded raters. However, these statements do not establish that the overall trial (implementation, data collection governance, and analysis leadership) was conducted by an external evaluation team independent of the intervention developers. In addition, the methods include detailed intervention development decisions (e.g., selecting AR platforms), strongly indicating the author team developed the AR system being tested, with no clear statement of an independent evaluator leading the study. Criterion I is not met because full independent conduct of the evaluation is not clearly documented.
- Y
  Year Duration
  - Outcomes are measured over weeks rather than at least 75% of an academic year, and Y is also disqualified because T is not met.
  - "This four-week, parallel-group randomized controlled trial enrolled adult participants to minimize variability in compliance and device use and to improve implementation control."
  - Relevant Quotes: 1) "This four-week, parallel-group randomized controlled trial enrolled adult participants to minimize variability in compliance and device use and to improve implementation control." (Section 3.3 Procedures) 2) "The primary outcome was calligraphy self-efficacy, assessed at three time points—pre–session 1 (baseline), post–session 4 (post-test), and one-week follow-up..." (Section 3.3 Procedures) Detailed Analysis: Criterion Y requires outcome measurement at least 75% of an academic year after the intervention begins. This study is four weeks long with only a one-week follow-up, far below that threshold. Additionally, per the ERCT dependency rule, if criterion T is not met then criterion Y is not met. Criterion Y is not met because the study’s tracking period is weeks rather than (most of) an academic year.
- B
  Balanced Control Group
  - Time and instructor attention are explicitly matched across arms, and the extra technology (smartphone AR overlay) is the integral treatment contrast rather than an unbalanced add-on resource.
  - "Session pacing, the number of characters practiced, and instructor time-on-task were matched across arms."
  - Relevant Quotes: 1) "The instructor, copybook, venue, and contact time were held constant." (Abstract) 2) "Session pacing, the number of characters practiced, and instructor time-on-task were matched across arms." (Section 3.3 Procedures) 3) "We employed a smartphone-based handheld system owing to its portability and cost-effectiveness, leveraging learners’ own devices to avoid additional hardware investment and to facilitate deployment across diverse classroom and practice settings [Huang et al. 2019]." (Section 1 Introduction) Detailed Analysis: Criterion B requires comparing the nature, quantity, and quality of resources (time, budget, materials, adult support) across intervention and control, and determining whether any extra resources are either (a) balanced in the control condition, or (b) explicitly integral to the treatment being tested (so that business-as-usual control is acceptable by design). The paper states that contact time and core instructional conditions (instructor, copybook, venue) are held constant, and it explicitly states instructor time-on-task and pacing are "matched across arms." This strongly supports balanced time and teacher attention. The AR group uses a smartphone-based AR overlay. This is an additional tool, but it is the central treatment contrast ("comparing smartphone-based augmented reality with conventional model copying") rather than an unacknowledged extra resource that should have been matched independently of the treatment. The paper also indicates participants use their own devices, reducing the interpretation that the AR condition is simply a budget increase unrelated to the intervention definition. Criterion B is met because time and instructor resources are matched and the remaining difference (the AR tool) is the integral intervention being tested.
Level 3 Criteria
- R
  Reproduced
  - No independent replication study by other authors was found in the paper or via internet searching, and the authors instead call for larger, longer trials.
  - "They also warrant larger and longer trials with multimethod assessment."
  - Relevant Quotes: 1) "They also warrant larger and longer trials with multimethod assessment." (Abstract) 2) "Given the small sample and the limited sensitivity of momentary classroom measures, interpretation should emphasize effect sizes and their 95% confidence intervals, and larger, longer trials are warranted to confirm these patterns." (Section 5 Conclusion and Discussion) Detailed Analysis: Criterion R requires that the study be independently replicated by a different research team in a different context, with evidence available in peer-reviewed publications (often after the original publication). The paper itself does not cite any replication of this specific RCT; instead, it explicitly calls for "larger, longer trials," consistent with being an early/initial evaluation. Internet searching (by title, DOI, and author names) did not identify a peer-reviewed independent replication of this specific trial as of the ERCT check date. Criterion R is not met because no independent replication evidence was found.
- A
  All-subject Exams
  - The study does not use standardized exam-based assessments across subjects, and A is disqualified because E is not met.
  - Relevant Quotes: 1) "The primary outcome was calligraphy self-efficacy, assessed at three time points—pre–session 1 (baseline), post–session 4 (post-test), and one-week follow-up..." (Section 3.3 Procedures) 2) "Flow was assessed using the short form of EduFlow-2 [Heutte et al. 2021]..." (Section 3.4 Measures) 3) "Perceived learning was measured with Rovai’s CAP scale [Rovai et al. 2009]..." (Section 3.4 Measures) Detailed Analysis: Criterion A requires standardized exam-based assessments across all main subjects taught at the relevant educational level, and the ERCT rule states that if criterion E is not met then criterion A is not met. This study focuses on calligraphy-related outcomes and uses self-report questionnaires, process metrics, and rubric- based expert ratings rather than standardized exams. It also does not assess a multi-subject curriculum. Criterion A is not met because standardized exam-based assessment is not used (and criterion E is not met).
- G
  Graduation Tracking
  - The study includes only immediate post-test and a one-week follow-up, and no follow-up paper tracking participants to graduation was found; G is also disqualified because Y is not met.
  - "The primary outcome was calligraphy self-efficacy, assessed at three time points—pre–session 1 (baseline), post–session 4 (post-test), and one-week follow-up..."
  - Relevant Quotes: 1) "The primary outcome was calligraphy self-efficacy, assessed at three time points—pre–session 1 (baseline), post–session 4 (post-test), and one-week follow-up—with items targeting four ability domains: stroke-order execution, control at key inflection points, component proportion, and whole-character balance." (Section 3.3 Procedures) 2) "This four-week, parallel-group randomized controlled trial enrolled adult participants..." (Section 3.3 Procedures) Detailed Analysis: Criterion G requires tracking participants until graduation of the relevant educational stage, and ERCT also specifies that if criterion Y is not met then criterion G is not met. The paper reports only baseline, post-test, and a one-week follow-up after a four-week intervention, with no long-term tracking and no graduation endpoint. A targeted internet search for follow-up publications by the same author team tracking this cohort to graduation did not find such a paper as of the ERCT check date. Criterion G is not met because participants are not tracked anywhere near graduation (and Y is not met).
- P
  Pre-Registered
  - The paper states outcomes and analyses were prespecified but provides no public preregistration record (registry, ID, and preregistration date) showing registration before data collection began.
  - "Statistical analysis was prespecified as follows."
  - Relevant Quotes: 1) "Outcomes were prespecified: the primary outcome is calligraphy self-efficacy; secondary outcomes comprise momentary flow, perceived learning (CAP: cognitive, affective, psychomotor), and process/efficiency metrics; blinded expert ratings of finished work serve as an exploratory outcome to triangulate effects on product quality." (Section 1 Introduction) 2) "Statistical analysis was prespecified as follows." (Section 3.3 Procedures) Detailed Analysis: Criterion P requires a publicly accessible preregistration of the protocol before the study begins, including the registry/platform and an identifier (or link) plus timing evidence (registration date before data collection). While the paper states that outcomes and statistical analysis were "prespecified," it does not provide a public preregistration record (e.g., OSF/AsPredicted/AEA RCT Registry/ClinicalTrials.gov) with a registration ID and date. Internet searching by title/DOI did not identify a public preregistration entry for this study. Criterion P is not met because no public preregistration record with timing information was found.

Request an Update or Contact Us

Are you the author of this study? Let us know if you have any questions or updates.

Request Valuation Update

All Other Requests

Have Questions
or Suggestions?

Get in Touch

Have a study you'd like to submit for ERCT evaluation? Found something that could be improved? If you're an author and need to update or correct information about your study, let us know.

Submit a Study for Evaluation

Share your research with us for review
Suggest Improvements

Provide feedback to help us make things better.
Update Your Study

If you're the author, let us know about necessary updates or corrections.

Smartphone-Based Augmented Reality for Calligraphy Instruction: Randomized Evidence on Self-Efficacy, Flow, and Artifact Quality

The study randomized individual participants rather than classes (and it is not a one-to-one tutoring exception).

The outcomes are questionnaires and rubric/expert ratings, not a widely recognized standardized exam.

The study tracks outcomes over four weeks with only a one-week follow-up, which is shorter than one academic term.

The control condition and baseline equivalence are documented, including what the control group did and a baseline table (Table 2).

The trial is single-institution with participant-level randomization rather than school/site-level randomization.

The paper reports some safeguards (independent assistant for randomization and blinded raters), but it does not document that the study was conducted by an independent external evaluation team separate from the intervention developers.

Outcomes are measured over weeks rather than at least 75% of an academic year, and Y is also disqualified because T is not met.

Time and instructor attention are explicitly matched across arms, and the extra technology (smartphone AR overlay) is the integral treatment contrast rather than an unbalanced add-on resource.

No independent replication study by other authors was found in the paper or via internet searching, and the authors instead call for larger, longer trials.

The study does not use standardized exam-based assessments across subjects, and A is disqualified because E is not met.

The study includes only immediate post-test and a one-week follow-up, and no follow-up paper tracking participants to graduation was found; G is also disqualified because Y is not met.

The paper states outcomes and analyses were prespecified but provides no public preregistration record (registry, ID, and preregistration date) showing registration before data collection began.

Abstract

ERCT Criteria Breakdown

Level 1 Criteria

Class-level RCT

Exam-based Assessment

Term Duration

Documented Control Group

Level 2 Criteria

School-level RCT

Independent Conduct

Year Duration

Balanced Control Group

Level 3 Criteria

Reproduced

All-subject Exams

Graduation Tracking

Pre-Registered

Request an Update or Contact Us

Have Questions or Suggestions?

Submit a Study for Evaluation

Suggest Improvements

Update Your Study

Have Questions
or Suggestions?