{"id":35120,"date":"2026-03-17T17:10:00","date_gmt":"2026-03-17T17:10:00","guid":{"rendered":"https:\/\/www.tun.com\/home\/?p=35120"},"modified":"2026-03-17T21:10:23","modified_gmt":"2026-03-17T21:10:23","slug":"top-ai-coding-tools-still-misfire-on-1-in-4-tasks-study-finds","status":"publish","type":"post","link":"https:\/\/www.tun.com\/home\/top-ai-coding-tools-still-misfire-on-1-in-4-tasks-study-finds\/","title":{"rendered":"Top AI Coding Tools Still Misfire on 1 in 4 Tasks, Study Finds"},"content":{"rendered":"\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-uagb-blockquote uagb-block-e7eb3fc3 uagb-blockquote__skin-border uagb-blockquote__stack-img-none\"><blockquote class=\"uagb-blockquote\"><div class=\"uagb-blockquote__content\">A new University of Waterloo study finds that even the most advanced AI coding tools get structured tasks wrong about one in four times, especially for images, video and websites. The results highlight both the promise and current limits of AI as a reliable software development partner.<\/div><footer><div class=\"uagb-blockquote__author-wrap uagb-blockquote__author-at-left\"><\/div><\/footer><\/blockquote><\/div>\n\n\n\n<div class=\"wp-block-group is-content-justification-space-between is-nowrap is-layout-flex wp-container-core-group-is-layout-0dfbf163 wp-block-group-is-layout-flex\"><div style=\"font-size:16px;\" class=\"has-text-align-left wp-block-post-author\"><div class=\"wp-block-post-author__content\"><p class=\"wp-block-post-author__name\">The University Network<\/p><\/div><\/div>\n\n\n<div class=\"wp-block-uagb-social-share uagb-social-share__outer-wrap uagb-social-share__layout-horizontal uagb-block-ee584a31\">\n<div class=\"wp-block-uagb-social-share-child uagb-ss-repeater uagb-ss__wrapper uagb-block-ec619ce7\"><span class=\"uagb-ss__link\" data-href=\"https:\/\/www.facebook.com\/sharer.php?u=\" tabindex=\"0\" role=\"button\" aria-label=\"facebook\"><span class=\"uagb-ss__source-wrap\"><span class=\"uagb-ss__source-icon\"><svg xmlns=\"https:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\"><path d=\"M504 256C504 119 393 8 256 8S8 119 8 256c0 123.8 90.69 226.4 209.3 245V327.7h-63V256h63v-54.64c0-62.15 37-96.48 93.67-96.48 27.14 0 55.52 4.84 55.52 4.84v61h-31.28c-30.8 0-40.41 19.12-40.41 38.73V256h68.78l-11 71.69h-57.78V501C413.3 482.4 504 379.8 504 256z\"><\/path><\/svg><\/span><\/span><\/span><\/div>\n\n\n\n<div class=\"wp-block-uagb-social-share-child uagb-ss-repeater uagb-ss__wrapper uagb-block-32d99934\"><span class=\"uagb-ss__link\" data-href=\"https:\/\/twitter.com\/share?url=\" tabindex=\"0\" role=\"button\" aria-label=\"twitter\"><span class=\"uagb-ss__source-wrap\"><span class=\"uagb-ss__source-icon\"><svg xmlns=\"https:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\"><path d=\"M389.2 48h70.6L305.6 224.2 487 464H345L233.7 318.6 106.5 464H35.8L200.7 275.5 26.8 48H172.4L272.9 180.9 389.2 48zM364.4 421.8h39.1L151.1 88h-42L364.4 421.8z\"><\/path><\/svg><\/span><\/span><\/span><\/div>\n\n\n\n<div class=\"wp-block-uagb-social-share-child uagb-ss-repeater uagb-ss__wrapper uagb-block-1d136f14\"><span class=\"uagb-ss__link\" data-href=\"https:\/\/www.linkedin.com\/shareArticle?url=\" tabindex=\"0\" role=\"button\" aria-label=\"linkedin\"><span class=\"uagb-ss__source-wrap\"><span class=\"uagb-ss__source-icon\"><svg xmlns=\"https:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 448 512\"><path d=\"M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z\"><\/path><\/svg><\/span><\/span><\/span><\/div>\n<\/div>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>Artificial intelligence may be writing more code than ever, but a new study suggests it is still far from a flawless programming partner.<\/p>\n\n\n\n<p>Researchers at the University of Waterloo found that today\u2019s top AI coding tools make mistakes in roughly one out of every four structured tasks they are given, raising concerns about how much developers can safely rely on them without careful review.<\/p>\n\n\n\n<p>The team benchmarked 11 large language models, or LLMs, on 44 software-related tasks that required them to produce outputs in precise formats, such as JSON or XML, rather than in free-form text. These formats are widely used in software development because they can be read by both humans and machines and plugged directly into larger systems.<\/p>\n\n\n\n<p>The results show that even the most advanced commercial models reached only about 75% accuracy, while leading open-source models hovered closer to 65%. In other words, even the best systems failed to follow the rules or produce correct results about a quarter of the time.<\/p>\n\n\n\n<p>The study focused on what AI companies call \u201cstructured outputs,\u201d a feature introduced by firms including OpenAI, Google and Anthropic to make AI-generated responses more predictable and machine-friendly. Instead of returning a paragraph of natural language, an AI assistant might now return a neatly formatted JSON object or a block of valid Markdown that can be fed straight into a program or documentation pipeline.<\/p>\n\n\n\n<p>Co-first author Dongfu Jiang, a doctoral student in computer science at Waterloo, noted the team wanted to test both how well these systems obey the required formats and whether the content they produce is actually right.<\/p>\n\n\n\n<p>\u201cWith this kind of study, we want to measure not only the syntax of the code \u2013 that is, whether it\u2019s following the set rules \u2013 but also whether the outputs produced for various tasks were accurate,\u201d Jiang said in a news release.<\/p>\n\n\n\n<p>To do that, the researchers created a benchmark that spanned 18 different structured output formats and dozens of tasks. Some tasks involved relatively straightforward text processing, while others pushed models to generate more complex artifacts, such as layouts for websites or specifications for images and video.<\/p>\n\n\n\n<p>The models handled the simpler text-focused tasks reasonably well. But performance dropped sharply when the systems were asked to generate structures that described or coordinated multiple media types.<\/p>\n\n\n\n<p>\u201cWe found that while they do okay with text-related tasks, they really struggle on tasks involving image, video, or website generation,\u201d Jiang added.<\/p>\n\n\n\n<p>That gap matters because developers are increasingly using AI to scaffold entire applications, not just individual lines of code. A tool that can reliably output a JSON snippet for a configuration file but stumbles when asked to design a web page layout or a multimedia workflow could introduce subtle bugs or inconsistencies that are hard to catch.<\/p>\n\n\n\n<p>The project was a collaborative effort led by Jiang, undergraduate student Jialin Yang, and Wenhu Chen, an assistant professor of computer science at Waterloo. Seventeen additional researchers from Waterloo and other institutions contributed annotations, carefully checking whether each AI-generated output followed the required structure and solved the task correctly.<\/p>\n\n\n\n<p>The work reflects a broader push in the AI community to move beyond flashy demos and systematically measure what these systems can and cannot do. Benchmarks like this one help developers, companies and policymakers understand the real-world reliability of AI tools before they are deployed in critical workflows.<\/p>\n\n\n\n<p>The project also showcases how deeply students at Waterloo are involved in building and testing AI systems, not just using them, according to Chen.<\/p>\n\n\n\n<p>\u201cThere have been a lot of similar benchmarking projects happening in our labs recently,\u201d Chen said in the news release. \u201cAt Waterloo, students often begin as annotators, then organize projects and create their own benchmarking studies. They\u2019re not just using AI in their studies \u2013 they\u2019re building, researching and evaluating it.\u201d<\/p>\n\n\n\n<p>For now, the researchers caution that AI coding assistants are best treated as powerful but fallible collaborators. They can speed up routine tasks, suggest code snippets and help explore design options, but they are not ready to replace human judgment.<\/p>\n\n\n\n<p>\u201cDevelopers might have these agents working for them, but they still need significant human supervision,\u201d Jiang added.<\/p>\n\n\n\n<p>That message is especially important for students and early-career programmers, who may be tempted to lean heavily on AI tools. Educators and industry leaders have warned that overreliance on AI-generated code can erode fundamental skills and make it harder to spot subtle errors.<\/p>\n\n\n\n<p>At the same time, the Waterloo team\u2019s findings are not a reason to abandon AI in software development. Instead, they highlight where the technology is strongest today and where more research is needed. Improving structured output reliability \u2014 particularly for complex tasks involving images, video and web interfaces \u2014 could unlock safer, more automated development pipelines.<\/p>\n\n\n\n<p>The researchers plan to share their benchmark widely so that AI developers can use it to test new models and track progress over time. As models improve, rerunning them on the same set of tasks will make it easier to see whether accuracy is truly getting better, and in which areas.<\/p>\n\n\n\n<p>The <a href=\"https:\/\/arxiv.org\/abs\/2505.20139\" target=\"_blank\" rel=\"noopener\" title=\"\">study<\/a> appears in the journal Transactions on Machine Learning Research and is slated to be presented at the International Conference on Learning Representations (ICLR) in 2026. For now, its message is clear: AI coding tools are advancing quickly, but they still need human partners who understand both their power and their limits.<\/p>\n\n\n\n<div style=\"height:13px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p><strong>Source: <\/strong><a href=\"https:\/\/uwaterloo.ca\/news\/media\/top-ai-coding-tools-make-mistakes-one-four-times\" target=\"_blank\" rel=\"noopener\" title=\"\">University of Waterloo<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A new University of Waterloo study finds that even the most advanced AI coding tools get structured tasks wrong about one in four times, especially for images, video and websites. The results highlight both the promise and current limits of AI as a reliable software development partner.<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-no-separators","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[8],"tags":[291],"class_list":["post-35120","post","type-post","status-publish","format-standard","hentry","category-ai","tag-university-of-waterloo"],"acf":[],"aioseo_notices":[],"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false},"uagb_author_info":{"display_name":"The University Network","author_link":"https:\/\/www.tun.com\/home\/author\/funky_junkie\/"},"uagb_comment_info":0,"uagb_excerpt":"A new University of Waterloo study finds that even the most advanced AI coding tools get structured tasks wrong about one in four times, especially for images, video and websites. The results highlight both the promise and current limits of AI as a reliable software development partner.","_links":{"self":[{"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/posts\/35120","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/comments?post=35120"}],"version-history":[{"count":3,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/posts\/35120\/revisions"}],"predecessor-version":[{"id":35147,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/posts\/35120\/revisions\/35147"}],"wp:attachment":[{"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/media?parent=35120"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/categories?post=35120"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/tags?post=35120"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}