The recent GTC 2025 show highlighted significant advancements in agentic AI, especially in coding capabilities. These breakthroughs are driven by benchmarks like SWE-bench, which measures how well AI models can solve coding issues. In just a few months, the best AI models moved from resolving only 1.96% of problems to 55%, showing rapid improvement. Other benchmarks like GAIA and BIRD track the performance of AI in various tasks, including SQL generation. Key industry leaders anticipate that within a year, AI could be responsible for nearly all coding tasks, enhancing programmer productivity significantly, although human oversight will remain crucial. This growing capability indicates a shift toward AI being an integral part of software development.
Last week’s GTC 2025 show marked a significant moment for agentic AI, showcasing its potential in the tech world. Behind the scenes, the technology has been evolving rapidly, as seen in recent coding benchmarks like SWE-bench and GAIA. These improvements have raised expectations about the future capabilities of AI agents in software development.
In the past, AI-generated code was often too buggy or insecure for practical use. However, this situation is changing fast. AI models are now producing more reliable code, making them more useful for developers. SWE-bench, created by researchers at Princeton University, evaluates how effectively language models like Meta’s Llama and Anthropic’s Claude can tackle common software issues. Initially, these models struggled, with even the best performing at just under 2%. Now, they are resolving over 55% of coding issues on a revised benchmark.
Another benchmark, GAIA, measures general AI assistant capabilities, including reasoning and tool-use proficiency. A year ago, the top score on this test was 14. Today, advancements have taken the top score to approximately 53, indicating rapid progress.
Additionally, the BIRD benchmark assesses how well AI can convert natural language into SQL queries. The accuracy of top models has improved significantly, now reaching 77% compared to human performance at around 92%. These advancements in AI-driven coding have led experts, including top executives from Nvidia and Anthropic, to believe we may soon see AI generating 90% of code.
Jensen Huang, CEO of Nvidia, envisions a future where computers create software based on human input, marking a shift from traditional coding practices. While some experts remain cautious, noting that the best AI tools still require human oversight, the potential for increased productivity is evident.
As tools like coding assistants continue to develop, they are expected to make programming more efficient, allowing humans to focus on refining and improving AI-generated code rather than starting from scratch. Concerns about code quality, ambiguity in language, and AI’s tendency to produce misleading results persist but are becoming less of a hurdle as technology evolves.
In summary, the capabilities of AI in software development are expanding quickly. With benchmarks showing significant improvements, we are likely moving towards a future where AI plays an increasingly central role in coding tasks, significantly enhancing programmer productivity.
Keywords: agentic AI, coding benchmarks, AI-generated code, SWE-bench, GAIA
Secondary keywords: software development, AI assistants, programming efficiency
What are benchmarks for Agentic AI’s coding ability?
Benchmarks are tests or standards used to measure the performance of AI in coding tasks. They help show how well Agentic AI can write and understand code compared to human programmers.
How does Agentic AI perform compared to human coders?
Agentic AI shows impressive results, often completing coding tasks quickly and accurately. However, it may not always match the creativity and problem-solving skills of experienced human coders.
What types of coding tasks can Agentic AI handle?
Agentic AI can handle various coding tasks, like writing scripts, debugging code, and even creating full applications. It works well with popular programming languages like Python, Java, and JavaScript.
Can Agentic AI learn from its coding mistakes?
Yes, Agentic AI can improve over time by learning from its mistakes. This means it can adapt and become better at coding tasks based on feedback and new examples.
Is Agentic AI suitable for professional software development?
While Agentic AI is useful for many coding tasks, it works best as a tool to assist human programmers. Its strengths lie in efficiency, while human developers bring creativity and critical thinking to the table.