menu_open Columnists
We use cookies to provide some features and experiences in QOSHE

More information  .  Close

AI systems are great at tests. But how do they perform in real life?

12 0
25.08.2025

Earlier this month, when OpenAI released its latest flagship artificial intelligence (AI) system, GPT-5, the company said it was “much smarter across the board” than earlier models. Backing up the claim were high scores on a range of benchmark tests assessing domains such as software coding, mathematics and healthcare.

Benchmark tests like these have become the standard way we assess AI systems – but they don’t tell us much about the actual performance and effects of these systems in the real world.

What would be a better way to measure AI models? A group of AI researchers and metrologists – experts in the science of measurement – recently outlined a way forward.

Metrology is important here because we need ways of not only ensuring the reliability of the AI systems we may increasingly depend upon, but also some measure of their broader economic, cultural, and societal impact.

We count on metrology to ensure the tools, products, services, and processes we use are reliable.

Take something close to my heart as a biomedical ethicist – health AI. In healthcare, AI promises to improve diagnoses and patient monitoring, make medicine more personalised and help prevent diseases, as well as handle some administrative tasks.

These promises will only be realised if we can be sure health AI is safe and effective, and that means finding reliable ways to measure it.

We already have well-established systems for measuring the safety and effectiveness of drugs and medical devices, for example. But this is........

© The Conversation