ChatGPT and Claude are ‘becoming capable of tackling real-world missions,’ say scientists

Researchers from Tsinghua University, Ohio State University, and UC Berkeley have developed a tool, AgentBench, to measure the capabilities of large language models (LLMs) as real-world agents. The tool tests models’ abilities to perform complex tasks in various environments, such as operating within an SQL database and online shopping. The study revealed that top-tier models like GPT-4 significantly outperformed open-source models, indicating their potential for developing a potent, continuously learning agent.
Read more at Cointelegraph…

ChatGPT and Claude are ‘becoming capable of tackling real-world missions,’ say scientists

Related

When Code Training Goes Wrong: The Surprising Case of Emergent AI Misalignment

The Energy Infrastructure Gap That Could Decide the AI Race

AI-Powered Security Checks: Filtering Bots Without Slowing Users

Inside the Underground World of LLM Jailbreaks

GPT-5 is Here, and It’s Not What You Expected

The AI Agent That Actually Knows How to Build ML Models

Qwen-Image: Finally, an AI That Can Actually Write

Perplexity’s Stealth Crawling Sparks Debate Over AI Web Ethics

Feeding Your Gut to Fight Fat: How Tryptophan Sparks Hormone Recovery