AlgoMaster Logo

Iterating in Production

Last Updated: March 15, 2026

Ashish

Ashish Pratap Singh

Launching an AI system is not the end of development. In practice, AI applications improve continuously after they reach production. New data arrives, user behavior evolves, models become outdated, and better techniques emerge. To keep the system effective, teams must regularly refine models, update features, and improve system behavior based on real-world feedback.

Iterating in production means treating AI systems as living systems rather than static software. Engineers monitor performance, collect new data, retrain models, run experiments, and gradually deploy improvements while ensuring the system remains stable and reliable.

This process often involves techniques such as A/B testing, shadow deployments, canary releases, and continuous model retraining. Each change must be evaluated carefully to ensure it improves the system without introducing regressions.

In this chapter, you will learn how modern AI teams iterate on systems safely in production.

Collecting User Feedback

The most direct signal you can get is a user telling you whether the response was useful. The challenge is that most users will not do this unless you make it almost effortless. A thumbs up or thumbs down takes one click. A text box asking "what was wrong?" gets maybe a 2% response rate. Both are valuable for different reasons.

Think of feedback as a pyramid. At the base, you have implicit feedback: did the user copy the response, edit it, or immediately type a follow-up question? In the middle, you have low-friction explicit feedback: thumbs up/down, a star rating, a single-click correction. At the top, you have detailed explicit feedback: free text, correction submissions, detailed reports. You will collect a lot of the base, some of the middle, and a little of the top, and all three layers tell you something different.

The diagram captures the full feedback surface. Notice that both the explicit clicks and the implicit behaviors feed into the same store. Implicit signals are especially valuable because they require no user effort, and they scale to 100% of your traffic.

Here is a minimal feedback collection API that captures all three levels:

main.py
Loading...

The key design decision here is storing the full request context alongside every feedback record. The system_prompt, user_input, model, and prompt_version fields turn raw ratings into something you can actually act on. Without this context, a thumbs down tells you something went wrong. With it, you can group failures by prompt version, by input type, or by model, and start seeing patterns.

Building the Feedback-to-Improvement Pipeline

Collecting feedback is only half the job. The other half is turning that data into action. A feedback database with thousands of thumbs-down records is a graveyard unless you have a process for reviewing it, categorizing it, and using it to change something.

The pipeline works in four stages. First, you aggregate and surface the worst failures, not one by one but as clusters. Second, you analyze what the failures have in common. Third, you update your prompt or model to address the pattern. Fourth, you measure whether the change actually helped.

Here is a simple analysis module that surfaces patterns in negative feedback:

main.py
Loading...

Output:

Once you have the negative examples in front of you, the goal is to look for patterns, not to fix individual cases. Maybe 40% of failures come from requests that include code snippets and your prompt never told the model how to handle code. Maybe users asking questions in Spanish consistently get worse answers. The pattern is the thing you fix, and the fix goes into the next prompt version.

A/B Testing Prompts and Models

Prompt engineering without measurement is just guessing with extra steps. You change the system prompt, deploy it, and tell yourself the responses feel better. But "feels better" is not a metric. A/B testing gives you a way to run two versions simultaneously and let user behavior tell you which one is actually better.

The mechanics are straightforward. For each incoming request, you assign the user to variant A or variant B. Both variants use the same user input but different prompts or models. You collect feedback for both and compare their satisfaction rates once you have enough data.

There is one important rule: assign at the session level, not the request level. If the same user sees both variants within the same session, the experience is inconsistent and the signals get noisy. Hash the session ID to determine which variant to use, and keep that assignment stable for the duration of the session.

main.py
Loading...

The assign_variant method uses an MD5 hash of the session ID to produce a stable bucket assignment. You do not need to store which variant each session is in. The same session ID will always hash to the same bucket, so the assignment is consistent across the lifetime of the session without any additional state.

When to call the test. The common mistake is running A/B tests for too short a time. A test with 50 samples per variant is almost useless. As a rough rule of thumb, run until you have at least 200 rated interactions per variant. More is better. For high-traffic applications, this might take hours. For lower-traffic internal tools, it might take weeks.

Scroll
VariantRated InteractionsSatisfaction RateStatistical Significance
Control (v1.2)41271%Baseline
Treatment (v1.3)9874%Too few samples
Treatment (v1.3)38976%Significant (p < 0.05)

The table above illustrates why sample size matters. At 98 samples, a 3-point improvement looks promising but could easily be noise. At 389 samples, the same improvement becomes statistically reliable. When in doubt, wait longer before making a decision.

Measuring Business Impact

Satisfaction rates and thumbs-up percentages are proxy metrics. They tell you whether users liked the response, but they do not tell you whether the AI is actually helping the business. The metrics that matter depend on what the AI is supposed to do.

For a code assistant, you care about whether users accepted the generated code or deleted it. For a customer support bot, you care about whether the ticket was resolved without escalation to a human agent. For a document summarizer, you care about whether users needed to read the original document afterward. These are outcome metrics, and they are much harder to fake than satisfaction ratings.

Here is a lightweight business impact tracker that can measure a few common outcomes:

main.py
Loading...

The metric that often gets overlooked is time. If the AI response cuts the time a user needs to complete a task from 5 minutes to 90 seconds, that is a concrete business outcome even if the satisfaction rate is only 65%. Tracking time_to_outcome_ms gives you a denominator for calculating time saved per session, which is the kind of number that justifies continued investment.

Closing the Loop: Turning Feedback into Improvements

The feedback loop is only closed when the data you collect actually changes something. The process is: pull negative examples, find the pattern, write a targeted fix into the prompt, deploy as a new version, measure whether the fix worked.

Here is a worked example. Suppose your analyzer shows that 38% of negative feedback comes from requests where the user asked a question in a language other than English, and your system prompt says nothing about language handling. The fix is small:

main.py
Loading...

The negative feedback cases become your test suite. Before deploying v1.3, run your 20 most common multilingual failures through both prompts and compare the outputs manually. If v1.3 handles them better, you have evidence before you ship.

This principle scales. When you have enough corrections in your feedback database, those correction pairs become fine-tuning data. The user's original input is the training input, the user's corrected version is the target output, and the model's bad response is the negative example. Fine-tuning on real correction data tends to outperform fine-tuning on synthetic data because it is drawn from your actual distribution of user inputs.

main.py
Loading...

The export_fine_tuning_data function writes a JSONL file in the format expected by OpenAI's fine-tuning API and most other providers. The filter on LENGTH(correction) > 20 removes trivially short corrections that are unlikely to teach the model anything useful.

This diagram shows the full loop. Production traffic generates feedback, feedback gets analyzed for patterns, patterns drive prompt updates or fine-tuning data export, the new version gets deployed, and measurement closes the loop back into production. Each cycle through this loop should produce a measurable improvement. If your satisfaction rate is not trending upward over multiple cycles, the analysis step needs more attention.

References