Last Updated: March 13, 2026
Many modern AI applications interact with external services such as APIs, databases, and model providers. These operations often involve waiting for network responses, which can slow down programs if handled sequentially.
Asynchronous Python allows your program to perform other work while waiting for these operations to complete. By using async and await, you can write code that handles many tasks concurrently without blocking execution.
In this chapter, you will learn the fundamentals of asynchronous programming in Python and how to use it to build responsive, efficient applications that scale to handle many tasks at once.
Before diving into code, let's build an intuition for what async actually means. The best analogy is a restaurant.
Imagine a waiter who takes one table's order, walks to the kitchen, stands there watching the chef cook, waits until the food is ready, brings it back to the table, and only then moves to the next table.
If there are 10 tables and each meal takes 15 minutes to prepare, the last table waits 150 minutes. This is absurd, but this is exactly how synchronous code handles I/O-bound tasks.
A real waiter takes Table 1's order, hands it to the kitchen, immediately walks to Table 2, takes their order, hands it to the kitchen, moves to Table 3, and so on. When any table's food is ready, the waiter picks it up and delivers it.
All 10 tables get served in roughly 15-20 minutes because the waiter never stands around waiting. The kitchen (the external API) is the bottleneck, not the waiter (your program).
Async programming works the same way. Your API calls do not get faster individually. But by overlapping the waiting time, the total wall-clock time drops dramatically.
In the synchronous case, three 5-second calls take 15 seconds total. In the async case, those same three calls overlap and finish in about 5 seconds total.
Python's async programming is built on the asyncio module. Let's break down the core concepts.
The event loop is the engine that makes async work. It maintains a queue of tasks, runs them until each one hits a waiting point (like an API call), then switches to another task that is ready to make progress.
You almost never interact with the event loop directly. You just write async functions and let the loop handle the scheduling.
Two keywords are all you need to get started.
async def declares a coroutine function. Calling it does not execute the function immediately. It returns a coroutine object that the event loop can schedule.await pauses the current coroutine and hands control back to the event loop. When the awaited operation finishes, the coroutine resumes from where it left off.Here is a simple example that simulates API calls using asyncio.sleep:
This code runs three simulated API calls sequentially. Even though we are using async syntax, the three await calls happen one after another, so the total time is 0.2 + 0.3 + 0.1 = 0.6 seconds. We are not getting any concurrency benefit yet. To overlap these calls, we need asyncio.gather(), which we will cover shortly.
The important thing to understand: await asyncio.sleep(delay) is non-blocking. It tells the event loop "I am going to be idle for this long, go do something else." If we had scheduled other tasks, the event loop would run them during this waiting period.
Compare this to time.sleep(delay), which freezes your entire program that's why it's not recommended to use time.sleep() in async code.
asyncio.run() is the entry point. It creates an event loop, runs your top-level coroutine, and cleans up when done. You call it once, from regular (non-async) code:
asyncio.run(main())
This is the standard way to launch async programs. One caveat: you cannot call asyncio.run() if an event loop is already running.
asyncio.sleep is great for learning, but real AI engineering means making real HTTP calls. The requests library does not support async. Instead, use httpx, which provides both sync and async clients with an almost identical API.
A few things to notice:
httpx.AsyncClient is used as an async context manager (async with). This ensures the connection pool is properly opened and closed.await client.post(...) is the async version of requests.post(...). The syntax is nearly identical.timeout. LLM APIs can hang, and you do not want your program waiting forever.AsyncClient across calls. It maintains a connection pool internally, which avoids the overhead of establishing a new TCP connection for every request.This is where async starts paying off. asyncio.gather() takes multiple coroutines and runs them concurrently. It returns a list of results in the same order as the input coroutines.
Without gather, calling three models sequentially takes 0.3 + 0.4 + 0.2 = 0.9 seconds. With gather, all three calls start at the same time, and the total time is determined by the slowest call: 4 seconds. That is a 2.25x speedup with just one line change.
By default, if any coroutine in gather raises an exception, the entire gather call fails and the exception propagates immediately. This is usually not what you want when calling multiple LLM APIs, because one model timing out should not kill the results you already got from the others.
The return_exceptions=True flag changes this behavior. Instead of raising, exceptions are returned as regular items in the results list:
Two out of three succeeded, and the failure is captured without crashing the program. This is essential for production AI systems where you call multiple models or process large batches and need to handle partial failures gracefully.
Here is a problem you will hit immediately in production: API rate limits. OpenAI, Anthropic, and every other provider cap how many requests you can make per second or per minute.
If you fire off 100 concurrent requests with gather, you will get rate-limited (HTTP 429 errors) after the first handful. The solution is asyncio.Semaphore, which acts as a concurrency limiter.
A semaphore holds an internal counter. Each time a task acquires the semaphore, the counter decreases. When it reaches zero, any further tasks wait until a running task releases the semaphore.
With a semaphore of 3, only 3 tasks run at any given moment. As soon as one finishes, the next one in the queue starts. You get the concurrency benefit while staying within rate limits.
Here is a more realistic version tailored to LLM API calls:
This RateLimitedClient wraps the semaphore and HTTP client together. You create it once, and every call through it automatically respects the concurrency limit. The max_concurrent parameter lets you tune it to whatever your API tier allows.
Tasks queue up behind the semaphore. Only 3 pass through at a time. When one completes and releases its semaphore slot, the next queued task enters.
asyncio.gather() waits for all tasks to finish and then returns all results at once. But sometimes you want to process results as they arrive, especially when tasks have very different latencies.
asyncio.as_completed() returns an iterator of futures that yield results in the order they finish, not the order they were submitted:
This is useful when you want to display partial results to a user, update a progress bar, or implement a "fastest model wins" pattern where you return the first successful response and cancel the rest.
Many LLM APIs support streaming, where the model sends tokens back incrementally as they are generated. In Python, async generators let you consume these token streams elegantly using async for.
An async generator is a function declared with async def that uses yield instead of return. The caller iterates over it with async for:
You can also write your own async generators for custom streaming pipelines. For instance, here is a generator that simulates token streaming:
Async generators are the foundation for building streaming UIs, real-time logging, and progressive response display in AI applications.