Did we normalize AI-generated code too soon? Hereβs where it falls short and how itβs improving.
Large language models (LLMs) seemed to arrive in a flash. Monumental productivity gains were promised. Coding assistants flourished. Millions of multi-line code blocks were generated with a key press and merged. It worked like magic. But at the back of everyoneβs mind was a nagging thoughtβcan I actually trust this code?
It feels laughable to question the merits of AI in software development in 2025, as itβs already inextricably entrenched. Microsoft reports that 150 million developers use GitHub Copilot. Stack Overflowβs 2024 survey found 61.8% of developers use AI within their development process. Google claims over a quarter of its new code is AI-generated.
In short, βAI-generated code is already the norm,β says Chris Anley, chief scientist at NCC Group. But is it really up to the task?
The problems with AI-generated code
βLetβs be real: LLMs are not software engineers,β says Steve Wilson, chief product officer at Exabeam and author of OβReillyβs Playbook for Large Language Model Security. βLLMs are like interns with goldfish memory. Theyβre great for quick tasks but terrible at keeping track of the big picture.β
As reliance on AI increases, that βbig pictureβ is being sidelined. Ironically, by certain accounts, the total developer workload is increasingβthe majority of developers spend more time debugging AI-generated code and resolving security vulnerabilities, found The 2025 State of Software Delivery report.
βAI output is usually pretty good, but itβs still not quite reliable enough,β says Bhavani Vangala, co-founder and vice president of engineering at Onymos. βIt needs to be a lot more accurate and consistent. Developers still always need to review, debug, and adjust it.β
To improve AI-generated code, we must address key concerns: distrust, code quality issues, context limitations, hallucinations, and security risks. AI shows incredible promise, but human oversight remains critical.
Bloat and context limits
AI code completion tools tend to generate new code from scratch rather than reuse or refactor existing code, leading to technical debt. Worse, they tend to duplicate code, missing opportunities for code reuse and increasing the volume of code that must be maintained. βCode bloat and maintainability issues arise when verbose or inefficient code adds to technical debt,β notes Sreekanth Gopi, prompt engineer and senior principal consultant at Neuroheart.ai.
GitClearβs 2025 AI Copilot Code Quality report analyzed 211 million lines of code changes and found that in 2024, the frequency of duplicated code blocks increased eightfold. βSince AI-authored code began its surge in mid-2022, there has been more evidence every year that code duplication keeps growing,β says Bill Harding, CEO of Amplenote and GitClear. In addition to piling on unnecessary technical debt, cloned code blocks are linked to more defectsβanywhere from 15% to 50% more, research suggests.
These issues stem from AIβs limited context. βAI is better the more context it has, but there is a limit on how much information can be supplied to an AI model,β says Rod Cope, chief technical officer at Perforce Software. GitHub reports Copilot Chat has a 64k-128k token context window, equating to about 30 to 100 small files or five to 20 large ones. While context windows are growing, theyβre still insufficient to grasp full software architectures or suggest proper refactoring.
No βbig pictureβ thinking
While AI excels at pattern recognition, it doesnβt see the βwhyβ behind the code. This limits its ability to make trade-offs around business logic, user experience, or long-term maintainability. βAI lacks the full context and problem-solving abilities that senior engineers bring to the table,β says Nick Durkin, chief technical officer at Harness.
Coding is inherently a creative and people-centric activity. βAI cannot build new things that previously did not exist,β says Tobie Morgan Hitchcock, chief executive officer and co-founder of SurrealDB. βDevelopers use creativity and knowledge of human preference to build solutions that are specifically designed for the end user.β
As a result, AI tools often βwaste more time than they saveβ for areas like generating entire programs or where broader context is required, says NCCβs Anley. βThe quality of the code generated drops significantly when theyβre asked to write longer-form routines.β
Hallucinations and security risks
Hallucinations still remain a concern. βAI doesnβt just make mistakesβit makes them confidently,β says Exabeamβs Wilson. βIt will invent open-source packages that donβt exist, introduce subtle security vulnerabilities, and do it all with a straight face.β
These errors often stem from a poor data corpus. As Durkin explains, AI trained on synthetic data risks creating an echo chamber, leading to model collapse.
Cory Hymel, vice president of research and innovation at Crowdbotics, likewise points to a lack of high-quality training data as the biggest hurdle. For instance, OpenAI Codex, the popular model that GitHub Copilot uses, was trained on publicly available code containing errors that affect quality.
Security vulnerabilities are another issue. βAI-generated code may contain exploitable flaws,β says Neuroheart.aiβs Gopi. While AI is good at fixing bugs, it struggles to find them. A research paper from OpenAI found that AI agents βfail to root cause, resulting in partial or flawed solutions.β The paper notes:
Agents pinpoint the source of an issue remarkably quickly, using keyword searches across the whole repository to quickly locate the relevant file and functionsβoften far faster than a human would. However, they often exhibit a limited understanding of how the issue spans multiple components or files, and fail to address the root cause, leading to solutions that are incorrect or insufficiently comprehensive.
Other industry reports find increasing AI defects. For instance, Apiiro research found that personally identifiable information (PII) and payment data exposed in code repositories have surged three-fold since mid-2023, attributing this to the adoption of AI-assisted development.
Legal gray areas also could stunt the use of AI code and introduce compliance issuesβsome AI tools claim ownership of the code they output, while others retain IP for model retraining purposes. βMany companies are concerned about protecting proprietary data and ensuring it is not inadvertently used to train external models,β says Adam Kentosh, field chief technical officer of North America at Digital.ai.
Distrust and adoption barriers
βIt all comes down to trustβdo people trust what AI generates for building new applications?β asks Dan Fernandez, vice president of product management at Salesforce. Googleβs 2024 DORA report found that, on average, developers only βsomewhatβ trust AI-generated code.
βThe biggest barrier to adoption is trust in AIβs accuracy,β says Durkin. Unlike a human developer, AI has no intrinsic conscience or accountability, he says, making compliance and reliability checks more crucial for AI outputs.
AIβs opacity makes it difficult to trust in critical applications. βTrust is a big issue when it comes to any AI-provided code, but for legacy code in particular, which is where most software investment happens,β says Jeff Gabriel, executive vice president of engineering at Contentful.
βThe biggest hurdle is likely internal opposition to AI at many companies,β says Joseph Thacker, the solo founder of rez0corp and bug bounty hunter, noting that high-level staff often bar sanctioned AI use.
How AI-generated code will improve
Although AI-generated code faces obstacles, solutions are emergingβmany revisiting fundamental coding best practices. βThe challenges are multi-faceted, but weβre already seeing these challenges addressed,β says Shuyin Zhao, vice president of product for GitHub Copilot.
Validating AI outputs
Just as with human-generated code, rigorous testing must be applied to AI-generated code. βDevelopers should still carefully review, refine, and optimize AI-generated code to ensure it meets the highest standards for security, performance, and maintainability,β says Kevin Cochrane, chief marketing officer at Vultr.
Automated testing of AI outputs will be key. Perforceβs Cope recommends taking a slice out of the devops playbook with automated testing, static code analysis, and masking sensitive data for training AI models. βMany of these tools are already engineered to support AI or, if not, will do so very soon.β
βIncreased code throughput from AI puts pressure on downstream processes and systems, necessitating robust automation in QA testing to ensure continued reliability,β adds Digital.aiβs Kentosh.
AI can also play a role in policing itselfβdouble-checking code quality, using predictive models to identify potential risks, and conducting security scans. βMore widespread use of responsible AI (RAI) filters to screen for harmful content, security vulnerabilities, and notify users of public code matching are all important,β says GitHubβs Zhao.
Progressive rollouts can also help avoid drawbacks by gauging the effect of individual code changes. βTechniques like canary deployments, feature flagging, or feature management allow teams to validate code with limited exposure,β says Durkin.
Better training data
It all comes down to the training data because, as the saying goes, βgarbage in, garbage out.β As such, Zhao believes we need βmore sanitization and use of high-quality code samples as training data.β Avoiding model collapse requires feeding AI models additive data rather than regurgitated outputs.
Feeding LLMs project-specific context, like custom libraries, style guides, software bills of materials, or security knowledge, can also improve accuracy. βEnsuring AI models are trained on trusted data and fine-tuned for specific applications will help improve the accuracy of AI-generated code and minimize hallucinations in outputs,β says Salesforceβs Fernandez.
Certain IDE-based solutions and technologies are emerging to grant developers more real-time context, too. Onymosβs Vangala proposes that retrieval-augmented generation (RAG) will help reference version-specific software libraries or code repositories.
Finely tuned models
Instead of relying on massive general models, companies are shifting toward smaller, specialized models for specific coding tasks. βThe largest model isnβt necessary for every use case in the developer life cycle,β says Fernandez. βWeβre exploring a federated architecture of smaller models, where low-powered LLMs handle many tasks for developers.β
Improved training and finely tuned models will likely result in a higher degree of accuracy, but the best results may operate behind corporate firewalls. β2025 will see the rise of fine-tuned models trained on companiesβ existing code that run βbehind the wallβ significantly outperforming publicly available models,β says Crowdboticsβs Hymel.
Enhanced prompt engineering
Another aspect is improved prompt engineering. βWeβll also need to work on how we prompt, which includes the additional context and potential fine-tuning for system-specific scenarios,β says Contentfulβs Gabriel.
βPrompt engineering is going to be a necessary part of a software engineerβs job,β says Vangala. To get there, the onus is on developers to upskill. βWe need to teach our developers how to write better prompts to get the kind of AI output we want.β
New AI-enabled solutions will also help. βThe biggest impact will come from better models and better coding applications which provide more context,β says rez0corpβs Thacker, pointing to solutions like Cursor and the recent upgrades to GitHub Copilot.
New agentic AI tools
AI agents will be a continued focal point for improving software engineering overall, bringing self-checking capabilities. βNew reasoning models can now iterate and verify their own work, reducing hallucinations,β says Exabeamβs Wilson.
For instance, GitHub has added Copilot Autofix, which can detect vulnerabilities and provide fix suggestions in real time, and a build and repair agent to Copilot Workspace. βPerhaps the biggest, most exciting thing weβll continue to see is the use of agents to improve code quality,β says GitHubβs Zhao.
βI expect that AI-generated code will be normalized over the next year,β says Fernandez, pointing to the ongoing rise of AI-powered agents for software developers that extend beyond code generation to testing, documentation, and code reviews.
βDevelopers should also investigate the myriad of tools available to find those that work and consider how to fill the gaps with those that donβt,β says Gabriel. This will require both individual and organizational investment, he adds.
Looking to the future, many anticipate open source leading further AI democratization. βI expect weβll see a lot more open-source models emerge to address specific use cases,β says David DeSanto, chief product officer at GitLab.
Governance around AI usage
Enhancing developersβ confidence in AI-generated code will also rely on setting guardrails for responsible usage. βWith the appropriate guardrails in place to ensure responsible and trusted AI outputs, businesses and developers will become more comfortable starting with AI-generated code,β says Salesforceβs Fernandez.
To get there, leadership must establish clear directions. βUltimately, itβs about setting clear boundaries for those with access to AI-generated code and putting it through stricter processes to build developer confidence,β says Durkin.
βEnsuring transparency in model training data helps mitigate ethical and intellectual property risks,β says Neuroheart.aiβs Gopi. Transparency is crucial from an IP standpoint, too. βHaving no hold on AI output is critical for advancing AI code generation as a whole,β says GitLabβs DeSanto, who references GitLab Duoβs transparency commitment regarding its underlying models and usage of data.
For security-conscious organizations, on-premises AI may be the answer to avoiding data privacy issues. Running self-hosted models in air-gapped, offline deployments allows AI to be used in regulated environments while maintaining data security, says DeSanto.
Striking a balance between human and AI
All experts interviewed for this piece believe AI will assist developers rather than replace them wholesale. In fact, most view keeping developers in the loop as imperative for retaining code quality. βFor now, human oversight remains essential when using AI-generated code,β says Digital.aiβs Kentosh.
βBuilding applications will mostly remain in the hands of the creative professionals using AI to supplement their work,β says SurrealDBβs Hitchcock. βHuman oversight is absolutely necessary and required in the use of AI coding assistants, and I donβt see that changing,β adds Zhao.
Why? Partially, the ethical challenges. βComplete automation remains unattainable, as human oversight is critical for addressing complex architectures and ensuring ethical standards,β says Gopi. That said, AI reasoning is expected to improve. According to Wilson, the next phase is AI βbecoming a legitimate engineering assistant that doesnβt just write code, but understands it.β
Others are even more bullish. βI think that the most valuable AI-driven systems will be those that can be handed over to AI coding entirely,β says Contentfulβs Gabriel, although he acknowledges this is not yet a consistent reality. For now, future outlooks still place AI and humans cooperating side-by-side. βDevelopers will become more supervisors rather than writing every line of code,β says Perforceβs Cope.
The end goal is striking the right balance between productivity gains from AI and avoiding over-reliance. βIf developers rely too heavily on AI without a solid understanding of the underlying code, we risk losing creativity and technical depth, which are crucial for innovation,β says Kentosh.
Wild ride ahead
Amazon recently claimed its AI rewrote a Java application, saving $260 million. Others are under pressure to prove similar results. βMost companies have made an investment in some type of AI-assisted development service or copilot at this point and will need to see a return on their investment,β says Kentosh.
Due to many factors, AI adoption continues to accelerate. βMost every developer I know is using AI in some capacity,β adds Thacker. βFor many of them, AI is writing the majority of the code they produce each day.β
Yet, while AI eliminates repetitive tasks effectively, it still requires human intervention to take it to the final mile. βThe majority of code bases are boilerplate and repeatable,β says Crowdboticsβs Hymel. βWeβll see AI being used to lay 51%+ of the βgroundworkβ of an application that is then taken over by humans to complete.β
The bottom line? βAI-generated code isnβt greatβyet,β says Wilson. βBut if youβre ignoring it, youβre already behind. The next 12 months are going to be a wild ride.β


