Last Updated: March 15, 2026
Most AI applications fail not because the models are weak, but because the surrounding architecture is poorly designed. A powerful model alone is not enough. To build reliable, scalable AI systems, you need the right architectural patterns that structure how models interact with data, tools, users, and other services.
In this chapter, you will learn four foundational architecture patterns for AI applications.
Traditional web applications are fast and predictable. A database query takes 5ms. A REST API call to an internal service takes 50ms. You can fit 20 of them into a single user request and still respond in under a second.
LLM calls break those assumptions. A single GPT-4o call can take 2-10 seconds. An embedding call for a large document batch might take 30 seconds. A RAG pipeline that retrieves, re-ranks, and generates could take 15 seconds end-to-end. An agent that makes multiple tool calls might run for minutes.
This means the architecture decisions you make, whether a call is synchronous or asynchronous, how components communicate, where you put queues and caches, have an outsized impact on user experience and system reliability.
There is another dimension that makes AI architecture tricky: cost scales with usage in ways that traditional systems do not. Every LLM call costs money. Every embedding costs money. If your architecture accidentally processes the same document twice, or makes redundant LLM calls, or does not cache effectively, you burn cash. Architecture is not just about latency and reliability here. It is also about economics.
Let's look at the four patterns that cover the vast majority of production AI systems.
This is the pattern we have been using since the beginning. A user sends a request, your system processes it with an LLM, and returns a response.
The request-response pattern maps directly to the HTTP request/response cycle that every web developer already knows. The only difference is that the "processing" step involves an LLM call instead of (or in addition to) a database query.
The flow is simple:
Request-response works well when:
Here is a minimal request-response system that takes a user's question, retrieves relevant documents, and generates an answer:
This is the architecture behind most chatbots, customer support tools, and AI search features. It is simple, predictable, and easy to debug. The biggest drawback is latency. The user waits for retrieval and generation to complete before seeing anything. Streaming helps, but the fundamental issue is that everything happens synchronously in the request path.
The most common enhancement to request-response is streaming. Instead of waiting for the full response, you stream tokens to the user as they are generated. This does not reduce total latency, but it dramatically improves perceived latency because the user sees text appearing within a second or two.
The other enhancement is caching. If the same question gets asked frequently (and it will in customer support), you can cache the answer and skip the LLM call entirely.
The pipeline pattern is for workloads where data flows through multiple processing stages, each transforming or enriching it before passing it to the next. Think of it as a factory assembly line, but for data.
Imagine you are building a system that processes uploaded documents for a knowledge base. When a user uploads a PDF, you need to:
Each step depends on the output of the previous one. If you tried to do all of this in a single request-response cycle, the user would wait minutes. And if embedding generation fails on chunk 47 of 200, you would lose all progress.
Pipelines solve this by breaking the work into discrete stages that can run independently, retry individually, and scale separately.
The solid arrows show data flow. The dotted arrows show optional message queues between stages. In a simple pipeline, you might call functions directly. In a production pipeline, you put queues between stages so that each stage can run independently, scale horizontally, and retry failures without restarting the whole pipeline.
Without queues, if the embedding stage fails, the whole pipeline fails. With queues, the chunking stage pushes its output to a queue and moves on to the next document. The embedding stage pulls from the queue at its own pace. If it crashes, the messages stay in the queue and get reprocessed when the service restarts. No data is lost, no work is duplicated.
This is not unique to AI systems. Traditional data engineering uses the same pattern (ETL pipelines, stream processing). The AI twist is that some stages are expensive and slow (embedding generation, LLM summarization), so the ability to scale them independently matters even more.
Here is a simplified pipeline that processes a document through chunking, embedding, and storage. In production, you would add queues between stages, but the core logic is the same.
Not all pipelines are linear. Here are two common variations:
One stage sends data to multiple downstream stages that run in parallel. For example, after chunking a document, you might generate embeddings and a summary simultaneously. Both results feed into a final storage stage.
The path through the pipeline depends on the data. For example, if the uploaded file is a PDF, you run OCR. If it is a text file, you skip that stage. If the document is longer than 10,000 tokens, you generate a summary. If it is short, you skip summarization.
Pipelines work well when:
Pipelines are not a good fit for interactive, real-time use cases where the user expects an immediate response. For those, use request-response (possibly with streaming).
The agent pattern is fundamentally different from the first two. In request-response and pipeline, you (the developer) define the exact sequence of operations. The system follows a predetermined path. In the agent pattern, the LLM decides what to do next.
The agent loop follows a simple cycle: observe, think, act, repeat. The LLM receives the current state of the world (observations), decides what action to take (tool calls), executes the action, and feeds the result back as a new observation. This loop continues until the LLM decides the task is complete.
From an architecture standpoint, this loop has three properties that make it very different from the other patterns:
A request-response call might take 3 seconds. An agent might make 2 tool calls or 20. You cannot predict the total execution time, which makes timeout management and user experience design much harder.
Every loop iteration involves at least one LLM call. An agent that takes 15 iterations to complete a task costs 15x more than a single LLM call. You need cost guardrails (maximum iterations, token budgets) or a single complex task could blow your monthly API budget.
With each iteration, the conversation history grows. The LLM sees all previous observations and actions. This means later iterations are more expensive (more input tokens) and eventually you hit context window limits. Production agents need strategies for context management: summarizing previous steps, dropping irrelevant observations, or using a separate memory system.
Here is a minimal agent loop that uses tools to gather information and synthesize an answer. The key architectural element is the loop itself and the exit conditions.
The critical architectural elements in this code are:
max_iterations is reached. Both paths need to return something useful to the user.The agent pattern works well when:
The agent pattern is not a good fit when:
The event-driven pattern flips the trigger. Instead of a user explicitly requesting something, AI processing is triggered by events: a new document is uploaded, a customer sends an email, a code commit is pushed, a sensor reading exceeds a threshold.
In an event-driven system, components communicate through events rather than direct calls. An event producer publishes an event to a message broker (like Kafka, RabbitMQ, or a cloud-native service like AWS SQS). One or more event consumers subscribe to those events and process them independently.
The AI component is just another consumer. It listens for specific events, runs its processing (which might involve LLM calls, embeddings, or agent loops), and either produces new events or writes results to a database.
Event-driven architecture gives you three things that are hard to get with the other patterns:
The system that uploads a document does not need to know that an AI summarizer exists. It just publishes a "document_uploaded" event. You can add, remove, or update AI consumers without touching the producer. This is huge for teams where the AI system is built by a different team than the core product.
Each consumer scales independently. If summarization is the bottleneck, you spin up more summarizer instances. The classifier can run on a single instance if its load is light. The message broker handles the load distribution.
If the AI summarizer crashes, events pile up in the queue. When the summarizer restarts, it picks up where it left off. No events are lost. No user sees an error. They just get their summary a few minutes later than usual.
Here is a simplified event-driven system using Python queues to simulate message broker behavior. In production, you would replace the queue with Kafka, RabbitMQ, or a cloud service.
Event-driven is often not a standalone pattern. It is a trigger mechanism that kicks off one of the other patterns. A "document_uploaded" event might trigger a pipeline (chunking, embedding, storage). A "customer_email_received" event might trigger request-response (classify the email and generate a draft reply). A "complex_task_created" event might trigger an agent loop.
This composability is what makes the event-driven pattern so useful. It lets you decouple the "when should AI run?" question from the "how should AI process this?" question.
Event-driven works well when:
Event-driven is not a good fit when:
Now that you have seen all four patterns, how do you decide which one to use? Here is a decision framework based on three questions.
Here is the same framework as a decision table:
Most production AI systems are not purely one pattern. They combine patterns at different layers. Here are a few examples to make this concrete:
A customer support platform uses event-driven to trigger on incoming tickets, a pipeline to enrich the ticket with customer history and sentiment analysis, and request-response when the agent responds to the customer in real time.
A code review tool uses event-driven to trigger on pull request events, an agent loop to analyze the code changes and generate feedback, and request-response when the developer asks follow-up questions about specific feedback.
A document intelligence product uses event-driven to trigger on document upload, a pipeline for the ingestion process (OCR, chunking, embedding), and request-response for the search and Q&A interface that users interact with.
The point is not to pick one pattern and commit to it forever. The point is to recognize these patterns so you can compose them deliberately, rather than accidentally reinventing them in a tangled, hard-to-maintain way.