Fine-Tuned LLMs Surface Clinical Documentation Errors Before They Reach Patients
Introduction
Medical errors are the third leading cause of death in the United States, responsible for more than 250,000 deaths annually — ahead of respiratory disease, accidents, and stroke. The Institute of Medicine estimates 1.5 million Americans are injured by medication errors each year, most traceable to gaps in documentation review. Clinical notes are dense, unstructured, and produced under pressure, making manual review both essential and unsustainable at scale. Rule-based systems catch known patterns but miss the nuanced, context-dependent errors that matter most. This engagement took a different approach: fine-tuning large language models on clinically-grounded error categories and deploying them inside a hospital environment with the traceability and human oversight required for patient safety applications.
Key Challenges
A local hospital wanted to reduce preventable clinical documentation and medication-related errors while improving patient safety workflows. Potential errors were buried in unstructured text across clinical notes, discharge summaries, and handoffs. Manual review processes were time-consuming and inconsistent across departments. Existing rule-based checks had limited coverage and generated noisy, high-fatigue alerts. Any AI solution needed careful validation and a controlled deployment path to avoid disrupting clinical operations.
Errors Buried in Unstructured Text
Clinical notes, discharge summaries, and handoff documents contain high-risk information in free text — impossible to screen reliably at scale with manual review alone.
Inconsistent Manual Review
Review processes varied across departments, creating uneven coverage and making it difficult to standardize what constitutes a flaggable error.
Noisy Rule-Based Alerts
Existing rule-based checks had limited coverage and generated excessive false positives, contributing to alert fatigue rather than reducing it.
Clinical Deployment Constraints
Any AI system needed careful validation and a controlled rollout — unsafe or overconfident flags in a clinical setting can cause harm rather than prevent it.
Solution Components
Led an end-to-end project to fine-tune and evaluate MedGemma 27B and GPT-OSS 120B for medical error detection, delivering a pilot deployment inside the hospital environment. Defined a clinical error taxonomy and labeling protocol covering medication conflicts, contraindications, allergy mismatches, dose and route inconsistencies, documentation contradictions, and missing follow-ups. Applied supervised fine-tuning with curated clinical examples, structured outputs (error type, rationale, supporting text span, suggested next step), and a rigorous evaluation framework measuring precision and recall per error class. Designed a human-in-the-loop workflow with traceability to source text, and packaged the model as a secure, logged, access-controlled pilot service for clinical and quality staff.
Clinical Error Taxonomy
Defined target error categories — medication conflicts, contraindications, allergy mismatches, dose and route inconsistencies, documentation contradictions, missing follow-ups — aligned to the hospital's review priorities.
Fine-Tuned LLM Models
Applied supervised fine-tuning to MedGemma 27B and GPT-OSS 120B using curated clinical examples, producing structured outputs with error type, rationale, supporting text span, and suggested next step.
Safety-Focused Evaluation
Measured precision and recall tradeoffs per error class, calibrated detection thresholds, and assessed failure modes to minimize unsafe or overly speculative flags.
Structured Actionable Outputs
Engineered model outputs to return error type, clinical rationale, supporting text span, and suggested next step — giving reviewers everything needed to act immediately without re-reading the source document.
Human-in-the-Loop Workflow
Designed reviewable outputs with full traceability to source text and explicit uncertainty handling — keeping clinicians and quality staff in control of every flagged case.
Controlled Pilot Deployment
Packaged the model as a secure, logged, access-controlled pilot service, enabling clinical and quality staff to test on realistic cases without disrupting existing workflows.
Impact
Demonstrated feasibility of LLM-assisted medical error detection in a real hospital environment. Improved reviewer efficiency by surfacing likely issues earlier with evidence and rationale attached, enabling faster triage. Delivered a validated foundation suitable for expansion into additional departments and error classes, along with a repeatable pipeline — data curation, fine-tuning, evaluation, deployment — for continuous clinical model improvement.
Our Process
Taxonomy & Labeling
Defined clinical error categories and built a labeling protocol using curated examples aligned to the hospital's documentation review priorities.
Fine-Tuning
Applied supervised fine-tuning to MedGemma 27B and GPT-OSS 120B with structured output targets and prompt design for actionable, evidence-backed flags.
Evaluation
Ran safety and usefulness evaluation across error classes — precision/recall tradeoffs, threshold calibration, and failure mode analysis to minimize clinical risk.
Hospital Deployment
Deployed the model in a secure, logged pilot environment with access controls, enabling realistic testing by clinical and quality review staff.