Credit: Elle Aon on Shutterstock
In A Nutshell
- A new study found that even the best AI models stumbled on roughly one in four structured coding tasks, raising real questions about how much developers should rely on them.
- Commercial AI systems like GPT-4o, GPT-4.1-mini, and o1-mini clustered near the top but were essentially tied, all scoring around 76%. Open-source models lagged about 10 points behind.
- Common formats like HTML, JSON, and CSV are reliably handled by most models. Less common formats like TOML, Mermaid diagrams, and TikZ graphics tripped up nearly every system tested.
- Researchers argue that AI benchmarks have focused too heavily on reasoning and conversation, while the structured output skills most critical to real software work have gone largely untested until now.
AI chatbots can write a sonnet, explain quantum physics, and pass a bar exam. But hand one a real software job, like producing a properly structured data file or building a page that renders correctly in a browser, and a very different picture emerges. A new study testing 12 of today’s leading AI models on practical coding and data-formatting tasks found that even the best systems stumbled on roughly one in four challenges. For anyone who has started treating AI as a reliable coding partner, that number is a problem worth taking seriously.
Researchers at the University of Waterloo and several partner institutions designed a testing framework called StructEval to measure how well large language models handle structured outputs, the formats that power much of modern software. JSON files move data between applications. HTML code builds web pages. React, a popular tool for building interactive web interfaces, brings buttons and menus to life. XML, a format widely used to organize and transfer data between systems, keeps countless apps talking to one another. When an AI model gets any of these wrong, the consequences are concrete: broken apps, failed data pipelines, and web pages that refuse to load. Structured outputs are the backbone of professional software development.
Results published in Transactions on Machine Learning Research showed that GPT-4o, the top-performing model in the study, managed an average score of just 76.02%. Put another way, the strongest system tested stumbled on about one in four tasks. The next two finishers, GPT-4.1-mini at 75.64% and o1-mini at 75.58%, were so close as to be essentially tied with GPT-4o at the top. Open-source models fared considerably worse, trailing the commercial group by roughly 10 percentage points on average. Several formatting challenges defeated every single model in the study, with all systems scoring below 50%.
How Researchers Put AI Coding Skills to the Test
StructEval ran 12 models through 2,035 individual test cases covering 44 task types across 18 structured output formats, split into two broad categories. One group covers text-only structures, meaning data files a developer can read directly, including JSON, YAML (used for software configuration), CSV (spreadsheet-style tables), and XML. A second group covers visual formats, code that must run through a browser or rendering engine to produce a visible result, including HTML, React, Matplotlib charts, and SVG graphics.
Each model faced two types of challenges. Generation tasks asked a model to produce a structured output from a plain-English description, such as building a JSON file that stores article metadata from scratch. Conversion tasks asked a model to translate from one format to another, such as turning a YAML configuration file into a JSON file.
Scoring went beyond surface appearances: researchers checked for correct formatting rules, confirmed that required fields were present, and for visual formats, ran the rendered output through a vision AI model to verify the layout matched what the prompt requested. No example answers were provided before any task, simulating real-world conditions where AI tools typically receive a request cold.
Where AI Coding Succeeds, and Where It Falls Apart
Some formats posed no serious challenge. Generating basic HTML pages, JSON data files, CSV spreadsheets, and Markdown documents proved relatively reliable, with most models scoring above 90%. Conversions between closely related formats, such as transforming YAML into JSON or rendering React code as HTML, were similarly comfortable for stronger models.
Venture beyond those well-worn formats and performance drops off quickly. Generating TOML configuration files, a format commonly used to set up software projects, from a plain-text description proved so difficult that some models scored close to zero. Producing Mermaid diagrams, a format used to generate flowcharts and process maps from simple text commands, stumped nearly every system tested, with all models averaging well below 50%. Converting Matplotlib charts, a standard Python data visualization tool, into TikZ, a format used to render graphics in scientific documents, yielded near-universally poor scores across every model tested.
Among individual models, Qwen3-4B, a compact open-source model developed by Chinese tech firm Alibaba, stood out as the strongest free alternative at 67.04%, though it still trailed the commercial leaders by nearly 10 points. At the other end, Microsoft’s Phi-3-mini finished last at 40.79%. Bigger is not always better: Phi-3-mini underperformed even similarly sized models, suggesting that how a model is trained matters more than how large it is. Researchers found it repeatedly produced garbled output markers in TOML-to-YAML tasks and failed to preserve data relationships in CSV-to-JSON conversions, the kinds of structural errors that cascade into larger failures inside a real software system.
Producing visual code consistently proved harder than producing text-only structured data. For text formats, getting the syntax and field names right is largely sufficient. For visual formats, a model must translate a verbal description into code that, when executed, produces a specific layout on screen, something that requires spatial reasoning most language models handle poorly. Commercial models held a clear edge on the hardest visual tasks. Open-source models held their own on simpler conversions, where a ready-made input structure gives the model a roadmap to follow.
A Case Against Trusting AI With the Code Stack
For developers and engineering teams leaning on AI tools to write or transform structured data and code, these results should be taken as a caution. Everyday formats are reliable. Less common or more visually demanding formats are not, at least not yet. A model scoring 76% on a critical task is a practical concern, not an abstract one, and errors in those categories rarely surface cleanly.
Researchers behind StructEval argue that structured output generation has been underexamined by the AI evaluation community, with most existing benchmarks focused on reasoning and question-answering rather than the nuts-and-bolts of software work. Closing these performance gaps, they contend, should be a clear priority as AI tools take on larger roles in real development workflows. Fluency in conversation and fluency in code structure are two very different skills, and at one stumble in every four tasks, AI may not yet justify the level of trust some developers place in it.
Disclaimer: This article is based on a peer-reviewed study. The findings reflect performance on a specific benchmark and may not represent the full range of AI capabilities across all real-world coding environments. Model capabilities are evolving rapidly, and results may differ across updated versions of the systems evaluated.
Paper Notes
Limitations
StructEval evaluates language models on static, single-page visual rendering formats and does not account for dynamic interface behaviors such as button interactions, page transitions, or animations, all features essential to many real-world applications. The dataset’s initial content was generated by large language models and then put through a two-round human expert review. Despite that process, residual biases from the model-generated content may persist, particularly in subtle cases that are difficult to catch manually. Broader multi-annotator audits or automated bias-detection methods could strengthen the dataset’s reliability in future iterations.
Funding and Disclosures
Support for this research was provided in part by the Google Cloud Research Credits Program. No additional funding disclosures or conflicts of interest were noted by the authors.
Publication Details
“StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs” was authored by Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, and Wenhu Chen. Authors are affiliated with the University of Waterloo, the University of Toronto, HKUST, Shanghai University, the University of British Columbia, the Vector Institute, and one independent contributor. Published in Transactions on Machine Learning Research in January 2026. Available on OpenReview at https://openreview.net/forum?id=buDwV7LUA7 and on arXiv as preprint 2505.20139.







