ACL 2026 Findings

EDU-CIRCUIT-HW Evaluating MLLMs on Real-World University-Level STEM Student Handwritten Solutions

A benchmark and analysis pipeline for studying how multimodal large language models recognize authentic circuit-analysis homework, where upstream visual recognition errors can silently propagate into downstream auto-grading.

Our Benchmark data are collected from the undergraduate-level circuit analysis course at the Georgia Institute of Technology, Atlanta, during the term Spring 2025. The collection was approved by the Institutional Review Board (IRB).

Authors: Weiyu Sun, Liangliang Chen, Yongnuo Cai, Huiru Xie, Yi Zeng, Ying Zhang

1,334

authentic handwritten solutions

29

students from a Spring 2025 circuit course

62

unique homework questions

3.3%

human-routing rate in the case study

Interactive Examples

Inspect Gemini-2.5-Pro recognition errors and expert corrections alongside the original student image.

These examples showcase the recognition behavior of Gemini-2.5-Pro, a visually strong multimodal model, on EDU-CIRCUIT-HW.

By contrasting the model-recognized transcript with the expert-rectified version, we expose remaining recognition errors and how experts revise them, highlighting how upstream visual mistakes can introduce unpredictable effects for downstream reasoning and grading.

Original handwritten image

Gemini-2.5-Pro recognized transcript

Red marks highlight model-recognized spans that differ from the expert rectification.

Loading transcript...

Dataset Overview

Authentic university STEM homework, not isolated textbook snippets.

EDU-CIRCUIT-HW contains handwritten student solutions from an undergraduate circuit analysis course at a large research university in the southeastern United States. Each sample pairs a real handwritten submission with expert-supported evaluation artifacts, enabling us to study recognition fidelity and auto-grading robustness together rather than in isolation.

The observation split includes expert-verified near-verbatim transcripts of student work, while the held-out test split preserves realistic deployment conditions with ground-truth grades but no expert rectifications. This makes the dataset useful both for diagnosing visual failures and for measuring their downstream impact.

Observation set

513 solutions from 11 students, each with verified recognition and detailed grades.

Test set

821 solutions from 18 additional students with ground-truth grades for realistic evaluation.

Why it matters

The benchmark exposes latent recognition failures that can stay hidden when downstream grading criteria only inspect a subset of the student solution.

Observation and test split information from the EDU-CIRCUIT-HW paper.
Observation and test set statistics reported in the ACL 2026 Findings paper.

Research Findings

Recognition quality and grading quality can drift apart.

01

Current MLLMs can appear strong on downstream grading while still making substantial upstream visual recognition mistakes in equations, symbols, diagrams, and reasoning traces.

02

EDU-CIRCUIT-HW evaluates both recognition and grading, making it possible to study how recognition errors cascade into high-stakes educational decisions.

03

In the paper's case study, error-aware routing and correction improved robustness with only minimal human intervention, sending 3.3% of assignments to human graders while the rest were handled by GPT-5.1.