February 1, 2024

ChatGPT study prompts questions about clinical applications for large-language-model AI

Editor's Note

Although ChatGPT has shown human-level performance on several professional and academic benchmarks, a recent study of its potential for clinical applications raised questions among surgeon evaluators. Findings were reported in the journal Surgery on January 20.

Specifically, researchers tested OpenAI’s general-purpose large-language model on questions from the Surgical Council on Resident Education question bank. They also fed the AI a second commonly used surgical knowledge assessment, referred to in the study as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. Surgeon evaluators assessed answers for accuracy, categorized reasons for model errors, and assessed the stability of performance on repeat queries.

The tool performed better on multiple-choice questions, correctly answering 71.3% and 67.9% of questions for each dataset, respectively, versus 47.9% and 66.1% of the open-ended questions. Common reasons for incorrect responses included inaccurate information in the question and accurate information with circumstantial discrepancy. Asked the same question again, ChatGPT’s answers varied for 36.4% of questions answered incorrectly the first time.

Better performance on multiple-choice than open-ended questions and inconsistency in responses raise questions about the safety and consistency required for clinical application, researchers conclude. “Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care." 

Live chat by BoldChat