Fine-Tuned LLMs Surface Clinical Documentation Errors Before They Reach Patients

Challenge

Clinical documentation errors were buried in unstructured text, with manual review too slow and inconsistent to catch them reliably across departments.

Solution

Fine-tuned MedGemma 27B and GPT-OSS 120B on a hospital-specific error taxonomy, deploying a human-in-the-loop pilot with structured, traceable outputs for clinical review staff.

Impact

Demonstrated LLM-assisted error detection in a real hospital environment, with faster triage, evidence-backed flags, and a repeatable pipeline for continued model improvement.

After Pierre-Auguste Renoir, *Luncheon of the Boating Party* (1881) — Impressionist warmth and close human attention mirror the care context of clinical error detection and patient safety.

Background

Introduction

Medical errors are the third leading cause of death in the United States, responsible for more than 250,000 deaths annually — ahead of respiratory disease, accidents, and stroke. The Institute of Medicine estimates 1.5 million Americans are injured by medication errors each year, most traceable to gaps in documentation review. Clinical notes are dense, unstructured, and produced under pressure, making manual review both essential and unsustainable at scale. Rule-based systems catch known patterns but miss the nuanced, context-dependent errors that matter most. This engagement took a different approach: fine-tuning large language models on clinically-grounded error categories and deploying them inside a hospital environment with the traceability and human oversight required for patient safety applications.

The problem

Key Challenges

A local hospital wanted to reduce preventable clinical documentation and medication-related errors while improving patient safety workflows. Potential errors were buried in unstructured text across clinical notes, discharge summaries, and handoffs. Manual review processes were time-consuming and inconsistent across departments. Existing rule-based checks had limited coverage and generated noisy, high-fatigue alerts. Any AI solution needed careful validation and a controlled deployment path to avoid disrupting clinical operations.

Errors Buried in Unstructured Text

Clinical notes, discharge summaries, and handoff documents contain high-risk information in free text — impossible to screen reliably at scale with manual review alone.

Inconsistent Manual Review

Review processes varied across departments, creating uneven coverage and making it difficult to standardize what constitutes a flaggable error.

Noisy Rule-Based Alerts

Existing rule-based checks had limited coverage and generated excessive false positives, contributing to alert fatigue rather than reducing it.

Clinical Deployment Constraints

Any AI system needed careful validation and a controlled rollout — unsafe or overconfident flags in a clinical setting can cause harm rather than prevent it.

What we built

Solution Components

Led an end-to-end project to fine-tune and evaluate MedGemma 27B and GPT-OSS 120B for medical error detection, delivering a pilot deployment inside the hospital environment. Defined a clinical error taxonomy and labeling protocol covering medication conflicts, contraindications, allergy mismatches, dose and route inconsistencies, documentation contradictions, and missing follow-ups. Applied supervised fine-tuning with curated clinical examples, structured outputs (error type, rationale, supporting text span, suggested next step), and a rigorous evaluation framework measuring precision and recall per error class. Designed a human-in-the-loop workflow with traceability to source text, and packaged the model as a secure, logged, access-controlled pilot service for clinical and quality staff.

Clinical Error Taxonomy

Defined target error categories — medication conflicts, contraindications, allergy mismatches, dose and route inconsistencies, documentation contradictions, missing follow-ups — aligned to the hospital's review priorities.

Fine-Tuned LLM Models

Applied supervised fine-tuning to MedGemma 27B and GPT-OSS 120B using curated clinical examples, producing structured outputs with error type, rationale, supporting text span, and suggested next step.

Safety-Focused Evaluation

Measured precision and recall tradeoffs per error class, calibrated detection thresholds, and assessed failure modes to minimize unsafe or overly speculative flags.

Structured Actionable Outputs

Engineered model outputs to return error type, clinical rationale, supporting text span, and suggested next step — giving reviewers everything needed to act immediately without re-reading the source document.

Human-in-the-Loop Workflow

Designed reviewable outputs with full traceability to source text and explicit uncertainty handling — keeping clinicians and quality staff in control of every flagged case.

Controlled Pilot Deployment

Packaged the model as a secure, logged, access-controlled pilot service, enabling clinical and quality staff to test on realistic cases without disrupting existing workflows.

Outcomes

Impact

Demonstrated feasibility of LLM-assisted medical error detection in a real hospital environment. Improved reviewer efficiency by surfacing likely issues earlier with evidence and rationale attached, enabling faster triage. Delivered a validated foundation suitable for expansion into additional departments and error classes, along with a repeatable pipeline — data curation, fine-tuning, evaluation, deployment — for continuous clinical model improvement.

LLMs fine-tuned for clinical error detection

Error categories in clinical taxonomy

Live

Pilot deployed in a real hospital environment

Phase pipeline: curation, fine-tuning, evaluation, deployment

How we delivered

Our Process

STEP 01

Taxonomy & Labeling

Defined clinical error categories and built a labeling protocol using curated examples aligned to the hospital's documentation review priorities.

STEP 02

Fine-Tuning

Applied supervised fine-tuning to MedGemma 27B and GPT-OSS 120B with structured output targets and prompt design for actionable, evidence-backed flags.

STEP 03

Evaluation

Ran safety and usefulness evaluation across error classes — precision/recall tradeoffs, threshold calibration, and failure mode analysis to minimize clinical risk.

STEP 04

Hospital Deployment

Deployed the model in a secure, logged pilot environment with access controls, enabling realistic testing by clinical and quality review staff.

Built with

Tech Stack

MedGemma 27B

GPT-OSS 120B

Have a similar challenge?

Let's discuss how AI can transform your workflows.

Book a Call