AlgoMaster Logo

Architecture Patterns for AI Applications

Last Updated: March 15, 2026

Ashish

Ashish Pratap Singh

Most AI applications fail not because the models are weak, but because the surrounding architecture is poorly designed. A powerful model alone is not enough. To build reliable, scalable AI systems, you need the right architectural patterns that structure how models interact with data, tools, users, and other services.

In this chapter, you will learn four foundational architecture patterns for AI applications.

Why Architecture Matters More for AI Systems

Traditional web applications are fast and predictable. A database query takes 5ms. A REST API call to an internal service takes 50ms. You can fit 20 of them into a single user request and still respond in under a second.

LLM calls break those assumptions. A single GPT-4o call can take 2-10 seconds. An embedding call for a large document batch might take 30 seconds. A RAG pipeline that retrieves, re-ranks, and generates could take 15 seconds end-to-end. An agent that makes multiple tool calls might run for minutes.

This means the architecture decisions you make, whether a call is synchronous or asynchronous, how components communicate, where you put queues and caches, have an outsized impact on user experience and system reliability.

There is another dimension that makes AI architecture tricky: cost scales with usage in ways that traditional systems do not. Every LLM call costs money. Every embedding costs money. If your architecture accidentally processes the same document twice, or makes redundant LLM calls, or does not cache effectively, you burn cash. Architecture is not just about latency and reliability here. It is also about economics.

Let's look at the four patterns that cover the vast majority of production AI systems.

Pattern 1: Request-Response

This is the pattern we have been using since the beginning. A user sends a request, your system processes it with an LLM, and returns a response.

How It Works

The request-response pattern maps directly to the HTTP request/response cycle that every web developer already knows. The only difference is that the "processing" step involves an LLM call instead of (or in addition to) a database query.

The flow is simple:

  1. User sends a message or query
  2. API server receives the request and determines what processing is needed
  3. If context is required, a retrieval step (vector search, database lookup) runs first
  4. The LLM generates a response using the query and any retrieved context
  5. The response is sent back to the user

When This Pattern Fits

Request-response works well when:

  • The user expects a direct answer (chatbots, search, Q&A)
  • Processing time is tolerable (under 10-15 seconds)
  • Each request is independent and does not depend on previous processing
  • You need the simplest possible architecture to start with

Here is a minimal request-response system that takes a user's question, retrieves relevant documents, and generates an answer:

main.py
Loading...

This is the architecture behind most chatbots, customer support tools, and AI search features. It is simple, predictable, and easy to debug. The biggest drawback is latency. The user waits for retrieval and generation to complete before seeing anything. Streaming helps, but the fundamental issue is that everything happens synchronously in the request path.

Handling the Latency Problem

The most common enhancement to request-response is streaming. Instead of waiting for the full response, you stream tokens to the user as they are generated. This does not reduce total latency, but it dramatically improves perceived latency because the user sees text appearing within a second or two.

main.py
Loading...

The other enhancement is caching. If the same question gets asked frequently (and it will in customer support), you can cache the answer and skip the LLM call entirely.

Pattern 2: Pipeline

The pipeline pattern is for workloads where data flows through multiple processing stages, each transforming or enriching it before passing it to the next. Think of it as a factory assembly line, but for data.

The Problem It Solves

Imagine you are building a system that processes uploaded documents for a knowledge base. When a user uploads a PDF, you need to:

  1. Extract text from the PDF
  2. Split the text into chunks
  3. Generate embeddings for each chunk
  4. Store the chunks and embeddings in a vector database
  5. Optionally, generate a summary of the document

Each step depends on the output of the previous one. If you tried to do all of this in a single request-response cycle, the user would wait minutes. And if embedding generation fails on chunk 47 of 200, you would lose all progress.

Pipelines solve this by breaking the work into discrete stages that can run independently, retry individually, and scale separately.

How It Works

The solid arrows show data flow. The dotted arrows show optional message queues between stages. In a simple pipeline, you might call functions directly. In a production pipeline, you put queues between stages so that each stage can run independently, scale horizontally, and retry failures without restarting the whole pipeline.

Why Queues Matter

Without queues, if the embedding stage fails, the whole pipeline fails. With queues, the chunking stage pushes its output to a queue and moves on to the next document. The embedding stage pulls from the queue at its own pace. If it crashes, the messages stay in the queue and get reprocessed when the service restarts. No data is lost, no work is duplicated.

This is not unique to AI systems. Traditional data engineering uses the same pattern (ETL pipelines, stream processing). The AI twist is that some stages are expensive and slow (embedding generation, LLM summarization), so the ability to scale them independently matters even more.

A Concrete Example: Document Ingestion Pipeline

Here is a simplified pipeline that processes a document through chunking, embedding, and storage. In production, you would add queues between stages, but the core logic is the same.

main.py
Loading...

Pipeline Variants

Not all pipelines are linear. Here are two common variations:

Fan-out pipeline

One stage sends data to multiple downstream stages that run in parallel. For example, after chunking a document, you might generate embeddings and a summary simultaneously. Both results feed into a final storage stage.

Conditional pipeline

The path through the pipeline depends on the data. For example, if the uploaded file is a PDF, you run OCR. If it is a text file, you skip that stage. If the document is longer than 10,000 tokens, you generate a summary. If it is short, you skip summarization.

When This Pattern Fits

Pipelines work well when:

  • Work can be broken into sequential stages with clear inputs and outputs
  • The total processing time is too long for synchronous request-response
  • Individual stages need to scale independently (embedding generation is the bottleneck, not chunking)
  • You need retry logic at the stage level, not the pipeline level
  • Batch processing is involved (processing many documents, not just one)

Pipelines are not a good fit for interactive, real-time use cases where the user expects an immediate response. For those, use request-response (possibly with streaming).

Pattern 3: Agent Loop

The agent pattern is fundamentally different from the first two. In request-response and pipeline, you (the developer) define the exact sequence of operations. The system follows a predetermined path. In the agent pattern, the LLM decides what to do next.

How It Works

The agent loop follows a simple cycle: observe, think, act, repeat. The LLM receives the current state of the world (observations), decides what action to take (tool calls), executes the action, and feeds the result back as a new observation. This loop continues until the LLM decides the task is complete.

From an architecture standpoint, this loop has three properties that make it very different from the other patterns:

Unpredictable execution time

A request-response call might take 3 seconds. An agent might make 2 tool calls or 20. You cannot predict the total execution time, which makes timeout management and user experience design much harder.

Unpredictable cost

Every loop iteration involves at least one LLM call. An agent that takes 15 iterations to complete a task costs 15x more than a single LLM call. You need cost guardrails (maximum iterations, token budgets) or a single complex task could blow your monthly API budget.

Accumulated context

With each iteration, the conversation history grows. The LLM sees all previous observations and actions. This means later iterations are more expensive (more input tokens) and eventually you hit context window limits. Production agents need strategies for context management: summarizing previous steps, dropping irrelevant observations, or using a separate memory system.

A Concrete Example: Research Agent

Here is a minimal agent loop that uses tools to gather information and synthesize an answer. The key architectural element is the loop itself and the exit conditions.

main.py
Loading...

The critical architectural elements in this code are:

  1. The `max_iterations` guard. Without this, a confused agent could loop forever, burning tokens and money. In production, you would also add a token budget and a wall-clock timeout.
  2. The messages list as state. The entire agent state is the conversation history. This grows with each iteration, which means later iterations cost more.
  3. The exit condition. The loop ends when the LLM responds without tool calls, or when max_iterations is reached. Both paths need to return something useful to the user.

When This Pattern Fits

The agent pattern works well when:

  • The task requires multiple steps that cannot be predetermined
  • The system needs to react to intermediate results (search led to a dead end, try a different query)
  • The user delegates a complex, open-ended task ("research X and write a report")
  • You are willing to accept higher latency and cost for more capable behavior

The agent pattern is not a good fit when:

  • The processing steps are known in advance (use a pipeline instead)
  • Latency must be predictable and fast (use request-response)
  • Cost control is critical and per-request budgets are tight
  • Reliability requirements are very high (agents are inherently less predictable)

Pattern 4: Event-Driven

The event-driven pattern flips the trigger. Instead of a user explicitly requesting something, AI processing is triggered by events: a new document is uploaded, a customer sends an email, a code commit is pushed, a sensor reading exceeds a threshold.

How It Works

In an event-driven system, components communicate through events rather than direct calls. An event producer publishes an event to a message broker (like Kafka, RabbitMQ, or a cloud-native service like AWS SQS). One or more event consumers subscribe to those events and process them independently.

The AI component is just another consumer. It listens for specific events, runs its processing (which might involve LLM calls, embeddings, or agent loops), and either produces new events or writes results to a database.

Why This Pattern Is Powerful

Event-driven architecture gives you three things that are hard to get with the other patterns:

Decoupling

The system that uploads a document does not need to know that an AI summarizer exists. It just publishes a "document_uploaded" event. You can add, remove, or update AI consumers without touching the producer. This is huge for teams where the AI system is built by a different team than the core product.

Scalability

Each consumer scales independently. If summarization is the bottleneck, you spin up more summarizer instances. The classifier can run on a single instance if its load is light. The message broker handles the load distribution.

Resilience

If the AI summarizer crashes, events pile up in the queue. When the summarizer restarts, it picks up where it left off. No events are lost. No user sees an error. They just get their summary a few minutes later than usual.

A Concrete Example: Auto-Tagging New Documents

Here is a simplified event-driven system using Python queues to simulate message broker behavior. In production, you would replace the queue with Kafka, RabbitMQ, or a cloud service.

main.py
Loading...

Combining Event-Driven with Other Patterns

Event-driven is often not a standalone pattern. It is a trigger mechanism that kicks off one of the other patterns. A "document_uploaded" event might trigger a pipeline (chunking, embedding, storage). A "customer_email_received" event might trigger request-response (classify the email and generate a draft reply). A "complex_task_created" event might trigger an agent loop.

This composability is what makes the event-driven pattern so useful. It lets you decouple the "when should AI run?" question from the "how should AI process this?" question.

When This Pattern Fits

Event-driven works well when:

  • AI processing is triggered by system events, not user requests
  • Processing can happen asynchronously (user does not need an immediate response)
  • Multiple AI systems need to react to the same event
  • You need to scale AI processing independently of the main application
  • Reliability is critical (no events should be lost)

Event-driven is not a good fit when:

  • The user needs an immediate, synchronous response
  • The architecture complexity is not justified by the scale (for a single-user app, just call a function)
  • Event ordering matters and is hard to guarantee (some message brokers handle this, others do not)

Choosing the Right Pattern

Now that you have seen all four patterns, how do you decide which one to use? Here is a decision framework based on three questions.

Question 1: Who initiates the AI processing?

  • If the user initiates it and expects a response, start with request-response.
  • If a system event initiates it, start with event-driven.

Question 2: Are the processing steps known in advance?

  • If yes (extract, chunk, embed, store), use a pipeline.
  • If no (the LLM decides what to do next), use an agent loop.

Question 3: How important is latency?

  • Sub-second: request-response with caching
  • Seconds: request-response with streaming
  • Minutes: pipeline or agent loop
  • Doesn't matter: event-driven background processing

Here is the same framework as a decision table:

Scroll
CharacteristicRequest-ResponsePipelineAgent LoopEvent-Driven
TriggerUser requestScheduled or triggeredUser taskSystem event
LatencyLow (seconds)Medium-High (minutes)High (variable)Async (no user wait)
StepsFixed, fewFixed, manyDynamicDepends on consumer
Cost predictabilityHighHighLowHigh per event
ComplexityLowMediumHighMedium
Best forChatbots, search, Q&AIngestion, ETL, batchResearch, complex tasksMonitoring, auto-processing

Real-World Systems Use Multiple Patterns

Most production AI systems are not purely one pattern. They combine patterns at different layers. Here are a few examples to make this concrete:

A customer support platform uses event-driven to trigger on incoming tickets, a pipeline to enrich the ticket with customer history and sentiment analysis, and request-response when the agent responds to the customer in real time.

A code review tool uses event-driven to trigger on pull request events, an agent loop to analyze the code changes and generate feedback, and request-response when the developer asks follow-up questions about specific feedback.

A document intelligence product uses event-driven to trigger on document upload, a pipeline for the ingestion process (OCR, chunking, embedding), and request-response for the search and Q&A interface that users interact with.

The point is not to pick one pattern and commit to it forever. The point is to recognize these patterns so you can compose them deliberately, rather than accidentally reinventing them in a tangled, hard-to-maintain way.

References