May 15, 2025

OpenAI benchmarking tool tests healthcare LLMs

Editor's Note

OpenAI has launched an open-source benchmark designed to test the safety and effectiveness of large language models in healthcare, according to a May 13 report in Fierce Healthcare. Called HealthBench, the dataset evaluates AI performance in real-world medical scenarios, moving beyond outdated exam-style questions and incorporating feedback from hundreds of global physicians.

As detailed in the article, HealthBench was developed with input from 262 physicians across 60 countries and includes 5,000 multi-turn, multilingual conversations between patients or clinicians and AI. These conversations span specialties and settings, including emergency care and global health, and are scored against 48,562 criteria using rubrics focused on accuracy, safety, communication, and appropriateness. Each conversation is graded with a model-based system and mapped to one of seven healthcare themes.

OpenAI claims many current benchmarks fail to reflect clinical complexity or expert medical standards, the outlet reports. HealthBench aims to correct this by evaluating how AI behaves in nuanced and often uncertain healthcare contexts. The goal is to shape shared industry standards and incentivize safer, more useful AI tools for both patients and providers.

Stanford AI executive director Ethan Goh told Fierce Healthcare the tool helps fill a major gap in healthcare AI evaluation. Existing benchmarks like MedQA and USMLE are now "saturated," with top AI models scoring near perfect marks—making it hard to gauge further progress. HealthBench, by contrast, measures performance in task-level clinical use cases, offering more meaningful insights.

The release of HealthBench marks OpenAI’s first healthcare AI application, though the company is actively expanding in the field. Examples cited in the article include working with Sanofi and Formation Bio to streamline clinical trial recruitment, with Iodine Software to enhance administrative AI, and with UTHealth Houston and Color Health on training and cancer care tools.  

Read More >>

Join our community

Learn More
Video Spotlight
Live chat by BoldChat