OpenAI Launches New Benchmark to Evaluate Research Abilities of AI Agents and Enhance Development Strategies
OpenAI has introduced PaperBench, a new benchmark designed to evaluate how well AI agents can replicate advanced AI research. The test requires AI to read and understand academic papers, write necessary code, and execute it to reproduce the paper’s results. PaperBench uses 20 top papers from the ICML 2024 conference, featuring 8,316 tasks that can ...