Feature Flags for LLM Rollouts: Switch Models and Prompts Without Redeploying
Every LLM upgrade is a gamble until it isn't. You swap in the latest model, tweak the system prompt, push to production — and find out that "better on benchmarks" doesn't always mean better for your users.
The problem isn't the model. It's how you release it. Most teams treat an LLM config change the same way they treated a database schema change in 2015 — one big flip, everyone gets it at once, and you pray.
Why LLM Changes Are Riskier Than They Look
Staging environments lie. You can run a new model against a curated eval suite and get green across the board, then see subtle quality regressions on real user inputs you never anticipated. LLMs are non-deterministic — the edge cases live in production, not in your test harness.
The same goes for prompt changes. A small rewording of your system prompt can shift tone, change how the model handles edge cases, and affect refusal behaviour in ways that are genuinely hard to catch before real users hit it.
And cost is a variable too. A new model might be 3× more expensive per token at your traffic level. You want to validate quality and latency before you commit your inference budget to it.
Put Your Inference Config Behind a Flag
The fix is the same one that solved this problem for frontend releases: decouple the deploy from the release. Ship the new model configuration to production, but gate it behind a feature flag. Start with 5% of traffic, watch quality metrics and token costs, then ramp.
Here's a minimal TypeScript pattern using the Featureflow SDK. The flag evaluates per request, giving you consistent bucketing across a session:
import featureflow from 'featureflow-client';
const client = featureflow.init(process.env.FEATUREFLOW_API_KEY!, {
user: {
key: userId, // consistent bucketing per user
attributes: {
plan: user.plan, // target paid users first if needed
},
},
});
// Flag returns a model name as a string value
const model =
(client.evaluate('llm-model-version').getValue() as string) ?? 'gpt-4o';
// Separate flag for prompt variant
const useNewPrompt = client.evaluate('prompt-template-v2').isOn();
const systemPrompt = useNewPrompt ? SYSTEM_PROMPT_V2 : SYSTEM_PROMPT_V1;
const response = await openai.chat.completions.create({
model,
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userMessage },
],
});Two flags, two independent control surfaces. You can A/B the model independently of the prompt, which matters — you want to isolate variables when something changes.
Rollout Strategy That Actually Works
Start with internal users. Target employees or beta opt-ins first. Get qualitative feedback before you expose real customers to a model change.
Ramp slowly on a quality signal. Pick one metric you trust — user thumbs-down rate, task completion, session length — and define a threshold before moving to the next percentage increment.
Keep the kill switch ready. If the new model starts producing unexpected outputs, flip the flag off from the dashboard. No redeploy. No incident bridge. You're back to the last known good model in seconds.
Watch cost per request. Flag rollout percentages are a cost dial too. You can ramp to 20% for a week to measure token spend before deciding whether the quality gain is worth the invoice.
Beyond Model Version: Other Things Worth Flagging
Once you have the pattern in place, the same approach applies to any part of your inference config: temperature and top-p sampling parameters, context window size, retrieval strategies in RAG pipelines, and fallback provider routing (OpenAI → Anthropic when rate limits hit). Each one is a config change that benefits from gradual exposure and an instant rollback path.
👉 Featureflow supports string-valued flags out of the box — no extra infrastructure needed. See how targeting and percentage rollouts work at featureflow.com or browse the docs.
#FeatureFlags#LLM#AIDevelopment#ContinuousDelivery#MLOps
Ship your next model upgrade with confidence
Start free with Featureflow — gradual rollouts and kill switches for your entire stack, including your AI layer.
Start Now (Free)Related Articles
Why Feature Flags Are the Safety Net Every AI-Powered Dev Team Needs
Agentic AI ships code to production at sprint speed—but without guardrails, velocity becomes risk. Here's how feature flags keep humans in control.
Testing Code That Has Feature Flags: Strategies That Don't Explode Your Test Matrix
Add ten boolean flags and you have 1,024 versions of your app — in theory. Here's how to keep your test suite tractable: stub the SDK, pin variants per test, default to safe, and clean up alongside the flag.
Feature Flags and the Strangler Fig: Refactor Legacy Code Without the Big-Bang Rewrite
Big-bang rewrites kill teams. The strangler fig pattern with feature flags lets you replace legacy code one slice at a time — shadow-testing, ramping traffic, and keeping a kill switch the whole way.