Continuous Model Optimization for Healthcare

Model Selection Isn't a One-Time Decision

Most health systems pick AI models during a pilot, validate them once, and move on. Six months later the workload looks different — patient volumes shift, new clinical protocols change the complexity of documentation tasks, regulatory requirements evolve — but the model assignment hasn't moved.

Meanwhile, the model landscape changes just as fast. New models launch every quarter. Existing models get updated. Pricing shifts. A model from a provider the health system has never evaluated might now outperform their current assignment on half their production traffic.

No one inside the organization has time to continuously re-evaluate. The team that ran the original pilot has moved on. The models are running. The bills are paying. And the gap between what's deployed and what's optimal grows silently.

Pnyx closes that gap. As production prompts flow through the system, Pnyx continuously evaluates them against the full model landscape — including models outside the health system's current stack. When a better fit emerges, the system doesn't auto-switch. It fires an alert. The clinical governance team reviews the data and makes the call.

Why Healthcare

Healthcare AI is moving from pilot to infrastructure. U.S. healthcare AI adoption jumped from 3% to 22% in two years, and by 2026 ambient documentation, clinical decision support, prior authorization automation, and revenue cycle AI are becoming standard across health systems. Health system leaders describe 2026 as the year organizations must move from scattered AI pilots to governed deployment.

But governed deployment requires continuous oversight — and that's where most health systems have a gap. Three dynamics make healthcare uniquely suited to continuous model optimization:

Clinical workloads drift. A documentation workflow that processed mostly routine encounters during the pilot may now handle a different case mix. A prior authorization workflow built for a handful of procedures now covers dozens. The prompts flowing through production today aren't the prompts the model was validated against. Without continuous evaluation, the health system is running on stale assumptions.

The regulatory environment demands ongoing monitoring. The Joint Commission and the Coalition for Health AI released governance guidance requiring health systems to implement continuous monitoring of AI system performance — not just initial validation. States are layering on disclosure and oversight requirements. HHS issued a Request for Information in late 2025 seeking input on how to accelerate AI adoption while maintaining safety. The direction is clear: regulators expect health systems to know how their AI is performing at all times, not just at deployment.

Model updates change behavior without warning. When a provider updates a model, the behavior on existing tasks can shift. A model that handled clinical documentation well before an update might handle it differently after — and the change surfaces as clinician complaints or coding errors, not as a routing signal. Health systems need to detect performance drift before it reaches patients.

How It Works

A regional health system runs AI across eight clinical and operational workflows. Models were selected during pilots over the past 18 months. The system wants to know — on an ongoing basis — whether those choices still hold.

Pnyx connects to the health system's AI traffic and continuously evaluates production prompts against three dimensions:

Workload drift detection

Pnyx tracks the skill profile of each workflow over time. When the complexity, domain requirements, or safety sensitivity of a workflow's prompts shift meaningfully, the system flags it.

Example: A clinical documentation workflow was validated on general medicine encounters. Over six months, the health system expanded its oncology service line. The documentation prompts now include complex treatment narratives and multi-drug regimen summaries that demand significantly more domain knowledge and reasoning depth than the original case mix. The assigned model — selected for general medicine — is no longer the right fit.

Without continuous evaluation, this surfaces as clinician dissatisfaction with AI-generated notes. With Pnyx, it surfaces as a workload drift alert with data: the workflow's skill profile has shifted, here's how, and here's what the current model can and can't handle.

Cross-provider model scouting

Pnyx evaluates new and updated models against the health system's actual production workloads — not synthetic benchmarks. When a model from any provider outperforms the current assignment on real prompts, it gets flagged.

Example: The health system uses GPT-5 for prior authorization letter generation. A new Anthropic model releases with stronger instruction-following on structured clinical documents. Pnyx evaluates it against the health system's actual prior auth prompts and finds it produces more complete, policy-compliant letters at lower cost. The health system's current provider contract doesn't include this model. Pnyx flags the opportunity — the governance team decides whether to evaluate further.

This isn't switching models automatically. It's surfacing options the health system would never discover on its own, because no one has time to benchmark every new release against every production workflow.

Customer-defined rules and alerting

The health system sets its own thresholds. Pnyx monitors against them and fires alerts when conditions are met. The governance team — not the routing layer — makes every decision.

Rules health systems configure:

Performance threshold: Alert when any model outside our current stack scores 15%+ higher on a workflow's skill requirements than the assigned model.
Quality drift: Alert when a model update causes output quality on a workflow to drop below baseline — before patients or clinicians notice.
Cost opportunity: Alert when pricing changes make our current routing 20%+ more expensive than an available alternative at equal or better quality.
Regulatory readiness: Flag any workflow where the model assignment lacks the documentation required for Joint Commission or state-level AI governance review.
New model release: When a major provider launches a new model, automatically evaluate it against our top five workflows and report fit.

Every alert includes the data: which workflow, what changed, what the current model delivers, what the alternative delivers, and what the tradeoff is. No action is taken without human approval.

What This Looks Like Over Time

Month 1: Pnyx establishes baseline skill profiles for all eight workflows. Initial workload intelligence report delivered — same as a standalone analysis.

Month 3: A model provider releases a major update. Pnyx automatically benchmarks it against all eight workflows. Three workflows would benefit. Alert fired to governance team with evaluation data.

Month 6: Clinical documentation workflow's prompt complexity has shifted due to service line expansion. Workload drift alert shows the skill profile has moved beyond what the assigned model handles well. Governance team reviews and approves a model change.

Month 9: A competitor provider drops pricing on a model that matches four workflows. Cost opportunity alert shows 30% savings potential on high-volume workflows with no quality impact. Finance and clinical teams review together.

Month 12: The health system's AI cost structure has improved three times — not through a single optimization event, but through continuous monitoring that caught drift, surfaced alternatives, and gave the governance team the data to act.

Why This Requires a Neutral Layer

Model providers release benchmarks showing their latest model is better. They don't tell you when a competitor's model outperforms theirs on your specific workload. They don't alert you when their own update degrades performance on your clinical documentation. They don't monitor your prompt patterns to detect when your workload has drifted beyond what their model handles well.

Pnyx is provider-agnostic. It evaluates across the full landscape — OpenAI, Anthropic, Google, open-source — against the health system's own production prompts. The alerts are driven by what the work requires, not by what any single provider sells.

The Adoption Path

Start with workload analysis. Pnyx evaluates current production prompts and delivers a baseline report — skill profiles, model fit, cost mapping. Read-only. No routing changes.

Set rules. The governance team defines alert thresholds: performance gaps, cost opportunities, quality drift, new model evaluation criteria. Rules reflect the organization's risk tolerance and decision-making process.

Monitor continuously. Pnyx tracks workload patterns, evaluates new models as they release, detects drift, and fires alerts when thresholds are crossed. Every alert is data. Every decision is human.

Act when the data supports it. Some alerts lead to model changes. Some lead to deeper evaluation. Some confirm the current assignment is still right. The point is that the health system always knows — instead of discovering six months later that the model landscape moved and they didn't.