GeistHaus
log in · sign up

The pitfalls of multiple-choice questions in generative AI and medical education - Scientific Reports

nature.com

The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.

1 page links to this URL
Learning with LLMs

AI is here, and its impacts on education cannot be overstated. Let’s put aside the issues of cheating; I assume that you want to learn, perhaps with the assistance of LLMs if they are actually helpful. But how do you know you’re not using AI as a crutch, versus using it to augment learning? The former setting outsources your thinking to AI, whereas the latter can help you reveal gaps in your understanding, bypass blockers that prevent learning, and/or tailor education to your style. In this post, I provide an analogy between learning and phase transitions in statistical mechanics, and describe recommendations and warnings on using LLMs in different learning scenarios.

0 inbound links article en