{"title":"Async Python","description":"","content":"Most AI applications spend more time waiting on other systems than running Python code. They wait for model APIs, vector databases, file stores, queues, webhooks, and internal services.\n\nAsync Python helps with that kind of waiting. It lets one Python process keep many network operations in progress at the same time, without creating one thread per request.\n\nThe important limitation is simple: `async` and `await` do not make one model call faster. A five-second model call is still a five-second model call. Async helps when you have several independent I/O-bound operations and can overlap the waiting time.\n\nIn this chapter, you will learn the async patterns that show up often in AI systems: async HTTP calls, concurrent model requests, structured concurrency, streaming responses, timeouts, cancellation, and concurrency limits.\n\n---\n\n# The Basic Idea\n\nThink of an application that needs three independent API responses. In regular synchronous code, it often does this:\n\n1. Send request 1 and wait.\n2. Send request 2 and wait.\n3. Send request 3 and wait.\n\nIf each request takes five seconds, the program waits about fifteen seconds total.\n\nWith async code, the program can send all three requests, then wait while the network and remote services do their work. The requests still take about five seconds each, but the waiting happens at the same time.\n\n\n```mermaid\nflowchart TD\n subgraph SYNC[\"Synchronous Execution\"]\n direction LR\n S1[\"Call API 1
Wait 5s\"]:::primary --> S2[\"Call API 2
Wait 5s\"]:::primary --> S3[\"Call API 3
Wait 5s\"]:::primary\n end\n\n subgraph ASYNC[\"Asynchronous Execution\"]\n direction LR\n A1[\"Call API 1\"]:::green\n A2[\"Call API 2\"]:::green\n A3[\"Call API 3\"]:::green\n A1 --> W[\"Wait ~5s
(all three
in parallel)\"]:::orange\n A2 --> W\n A3 --> W\n W --> R[\"All results
ready\"]:::teal\n end\n\n SYNC ~~~ ASYNC\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n```\n\n\nThis is the main value of async in AI engineering. It does not remove latency. It helps you avoid waiting serially for independent operations.\n\n---\n\n# asyncio Fundamentals\n\nPython's built-in async library is `asyncio`. You do not need much syntax to get started, but the details matter.\n\n### The Event Loop\n\nThe event loop is the scheduler for async code.\n\nIt runs a coroutine until that coroutine reaches an `await`. If the coroutine is waiting on I/O, such as an HTTP request, the event loop can run another coroutine in the meantime. When the I/O is ready, the original coroutine continues.\n\nMost application code does not manage the event loop directly. You write async functions and use `asyncio.run()` at the edge of the program.\n\n### async def and await\n\nTwo keywords do most of the work.\n\n- `async def` defines a coroutine function. Calling it creates a coroutine object; the body does not run until the coroutine is awaited or scheduled.\n- `await` pauses the current coroutine until the awaited operation finishes. While it is paused, the event loop can let other coroutines run.\n\nHere is a small example that simulates API calls with `asyncio.sleep`:\n\n\n**main.py**\n\n```python\nimport asyncio\nimport time\n\nasync def call_llm(prompt: str, delay: float) -> str:\n \"\"\"Pretend to call an LLM API that takes `delay` seconds.\"\"\"\n print(f\"Sending: {prompt}\")\n await asyncio.sleep(delay)\n print(f\"Received response for: {prompt}\")\n return f\"Response to: {prompt}\"\n\nasync def main():\n start = time.perf_counter()\n\n result1 = await call_llm(\"What is Python?\", 0.2)\n result2 = await call_llm(\"What is async?\", 0.3)\n result3 = await call_llm(\"What is an LLM?\", 0.1)\n\n elapsed = time.perf_counter() - start\n print(result1)\n print(result2)\n print(result3)\n print(f\"Sequential: {elapsed:.1f} seconds\") # ~0.6 seconds\n\nasyncio.run(main())\n```\n\n\nThis code is async, but it is still sequential. Each call is awaited before the next one starts, so the total time is about 0.2 + 0.3 + 0.1 seconds.\n\nThis is a common point of confusion: async syntax alone does not create concurrency. To overlap the calls, you must schedule them together. We will do that with `asyncio.gather()` shortly.\n\nOne important detail: `await asyncio.sleep(delay)` is non-blocking. It gives control back to the event loop. By contrast, `time.sleep(delay)` blocks the whole thread. Do not use `time.sleep()` inside async code.\n\n### asyncio.run()\n\n`asyncio.run()` starts an async program from regular Python code. It creates an event loop, runs the top-level coroutine, and cleans up afterward.\n\n`asyncio.run(main())`\n\nThis is the standard entry point for scripts and command-line programs. In notebooks, web frameworks, and async test runners, an event loop may already be running. In those environments, you normally `await` the coroutine directly or use the framework's async hooks instead of calling `asyncio.run()` again.\n\n---\n\n# Async HTTP Calls with httpx\n\n`asyncio.sleep()` is useful for examples. Real AI applications usually wait on HTTP calls.\n\nThe popular `requests` library is synchronous. If you call it inside async code, it blocks the event loop. For async HTTP, use a library such as `httpx` or `aiohttp`. This chapter uses `httpx` because its API is close to `requests` and it has a clean async client.\n\n\n**main.py**\n\n```python\nimport asyncio\nimport httpx\nimport os\n\nasync def call_openai(client: httpx.AsyncClient, prompt: str) -> str:\n \"\"\"Send a prompt to OpenAI and return the plain text response.\"\"\"\n response = await client.post(\n \"https://api.openai.com/v1/responses\",\n headers={\"Authorization\": f\"Bearer {os.environ['OPENAI_API_KEY']}\"},\n json={\n \"model\": \"gpt-5.4-mini\",\n \"input\": prompt,\n \"max_output_tokens\": 200,\n },\n timeout=30.0,\n )\n response.raise_for_status()\n body = response.json()\n return \"\".join(\n part[\"text\"]\n for item in body.get(\"output\", [])\n for part in item.get(\"content\", [])\n if part.get(\"type\") == \"output_text\"\n )\n\nasync def main():\n async with httpx.AsyncClient() as client:\n result = await call_openai(client, \"Explain async in one sentence.\")\n print(result)\n\nasyncio.run(main())\n```\n\n\nThis example shows the main habits to build early:\n\n- Use `httpx.AsyncClient` as an async context manager so connections are opened and closed cleanly.\n- Reuse one client for a batch of calls. The client keeps a connection pool, which avoids repeated connection setup.\n- Set timeouts. External services can be slow, overloaded, or unreachable.\n- Call `response.raise_for_status()` so HTTP errors become exceptions instead of silently flowing through your code.\n\n---\n\n# asyncio.gather() for Concurrent Tasks\n\nThis is where async starts to help.\n\n`asyncio.gather()` takes several awaitables, runs them concurrently, and returns their results in the same order you passed them in.\n\n\n**main.py**\n\n```python\nimport asyncio\nimport time\n\nasync def call_model(model_name: str, prompt: str) -> dict:\n \"\"\"Simulate model APIs with different response times.\"\"\"\n latencies = {\n \"large-model\": 0.3,\n \"balanced-model\": 0.4,\n \"fast-model\": 0.2,\n }\n delay = latencies.get(model_name, 2)\n\n await asyncio.sleep(delay)\n return {\n \"model\": model_name,\n \"response\": f\"Answer from {model_name}\",\n \"latency\": delay,\n }\n\nasync def main():\n prompt = \"Explain distributed systems in 2 sentences.\"\n\n start = time.perf_counter()\n\n results = await asyncio.gather(\n call_model(\"large-model\", prompt),\n call_model(\"balanced-model\", prompt),\n call_model(\"fast-model\", prompt),\n )\n\n elapsed = time.perf_counter() - start\n print(f\"Total time: {elapsed:.1f} seconds\") # ~0.4 seconds, not 0.9\n\n for r in results:\n print(f\"{r['model']}: {r['response']}\")\n\nasyncio.run(main())\n```\n\n\nWithout `gather`, these calls would take about 0.9 seconds in sequence. With `gather`, they finish in about 0.4 seconds because the slowest call determines the total time.\n\nUse this pattern when the work is independent. If request B depends on the result of request A, keep it sequential.\n\n---\n\n# TaskGroup for Structured Concurrency\n\n`asyncio.gather()` is still common, especially when you want ordered results. In Python 3.11 and newer, `asyncio.TaskGroup` is often a better fit when several tasks belong to one operation.\n\nA task group gives child tasks a clear lifetime. Tasks start inside the `async with` block, and the block does not finish until the tasks finish. If one task fails, the task group cancels the remaining tasks and raises an exception group.\n\nThat behavior is useful when partial work should stop after the parent operation has failed.\n\n\n**main.py**\n\n```python\nimport asyncio\n\nasync def score_document(document_id: str) -> dict:\n await asyncio.sleep(0.1)\n return {\"document_id\": document_id, \"score\": 0.82}\n\nasync def main():\n async with asyncio.TaskGroup() as group:\n tasks = [\n group.create_task(score_document(document_id))\n for document_id in [\"doc-1\", \"doc-2\", \"doc-3\"]\n ]\n\n results = [task.result() for task in tasks]\n print(results)\n\nasyncio.run(main())\n```\n\n\nUse `gather(return_exceptions=True)` when partial failure is acceptable and you want to inspect every result. Use `TaskGroup` when the tasks belong to the same unit of work and one failure should stop the rest.\n\n### Error Handling with return_exceptions\n\nBy default, if any coroutine inside `gather()` raises an exception, that exception is raised to the caller. You do not get a normal list of results.\n\nThat is sometimes correct. But if you are calling several providers or scoring many independent documents, one timeout may not make the whole batch unusable.\n\n`return_exceptions=True` changes the behavior. Exceptions are returned as items in the result list, next to successful results:\n\n\n**main.py**\n\n```python\nimport asyncio\n\nasync def flaky_model(name: str) -> str:\n if name == \"unstable-model\":\n raise TimeoutError(f\"{name} timed out\")\n await asyncio.sleep(0.3)\n return f\"Response from {name}\"\n\nasync def main():\n results = await asyncio.gather(\n flaky_model(\"primary-model\"),\n flaky_model(\"unstable-model\"),\n flaky_model(\"backup-model\"),\n return_exceptions=True,\n )\n\n for i, result in enumerate(results):\n if isinstance(result, Exception):\n print(f\"Task {i} failed: {result}\")\n else:\n print(f\"Task {i}: {result}\")\n\nasyncio.run(main())\n```\n\n\nTwo calls succeed, and the failure is captured without losing the successful results. This is useful when partial answers are acceptable and you have a clear plan for handling failures.\n\n---\n\n# Timeouts and Cancellation\n\nAsync programs need clear stop conditions. Without timeouts, one slow dependency can keep a request, job, or worker alive far longer than intended.\n\nFor individual operations, `asyncio.timeout()` is a simple built-in guard in Python 3.11 and newer:\n\n\n**main.py**\n\n```python\nimport asyncio\n\nasync def slow_provider_call() -> str:\n await asyncio.sleep(2)\n return \"model response\"\n\nasync def main():\n try:\n async with asyncio.timeout(0.5):\n result = await slow_provider_call()\n print(result)\n except TimeoutError:\n print(\"Provider call took too long\")\n\nasyncio.run(main())\n```\n\n\nCancellation is the other side of the same idea. If a user closes the page, a request times out, or a parent task fails, your coroutine may be cancelled.\n\nMost of the time, you should let cancellation propagate. Use `try` and `finally` only for cleanup:\n\n\n**main.py**\n\n```python\nimport asyncio\n\nasync def stream_to_user() -> None:\n try:\n while True:\n await asyncio.sleep(0.2)\n print(\"sending chunk\")\n finally:\n print(\"cleaning up stream\")\n\nasync def main():\n task = asyncio.create_task(stream_to_user())\n await asyncio.sleep(0.7)\n\n task.cancel()\n\n try:\n await task\n except asyncio.CancelledError:\n print(\"stream cancelled\")\n\nasyncio.run(main())\n```\n\n\nDo not catch `asyncio.CancelledError` unless you plan to re-raise it after cleanup. In modern Python, `CancelledError` inherits from `BaseException`, so a normal `except Exception` block will not catch it. That is intentional: cancellation should usually stop the coroutine.\n\n---\n\n# Semaphores for Concurrency Limits\n\nOne production problem appears as soon as async starts working well: you can send too much work at once.\n\nModel providers and vector databases usually limit requests per minute, tokens per minute, concurrent requests, or some combination of those. Your own service also has limits: memory, CPU, connection pools, database connections, and queue capacity.\n\nIf you start 100 requests at once with `gather()`, you may get HTTP 429 errors, increase tail latency, or overload your own process. `asyncio.Semaphore` is the simplest way to cap in-flight work.\n\nA semaphore has a counter. Each task acquires one slot before it starts the protected work. When all slots are taken, the next task waits until a slot is released.\n\n\n**main.py**\n\n```python\nimport asyncio\nimport time\n\nasync def call_api(semaphore: asyncio.Semaphore, task_id: int) -> str:\n async with semaphore:\n print(f\"[{time.perf_counter():.1f}] Task {task_id} started\")\n await asyncio.sleep(0.1)\n print(f\"[{time.perf_counter():.1f}] Task {task_id} done\")\n return f\"Result {task_id}\"\n\nasync def main():\n semaphore = asyncio.Semaphore(3)\n\n tasks = [call_api(semaphore, i) for i in range(10)]\n results = await asyncio.gather(*tasks)\n print(f\"Got {len(results)} results\")\n\nasyncio.run(main())\n```\n\n\nWith a limit of 3, only three tasks enter the protected section at the same time. When one finishes, another task can start.\n\nA semaphore limits concurrency. It does not, by itself, enforce requests per minute or tokens per minute. For strict provider limits, combine a concurrency limit with retry backoff, provider rate-limit headers, and a real rate limiter.\n\nHere is a small client wrapper that applies the same idea to LLM API calls:\n\n\n**main.py**\n\n```python\nimport asyncio\nimport httpx\nimport os\n\nclass RateLimitedClient:\n def __init__(self, max_concurrent: int = 5, timeout: float = 30.0):\n self.semaphore = asyncio.Semaphore(max_concurrent)\n self.client = httpx.AsyncClient(timeout=timeout)\n\n async def __aenter__(self) -> \"RateLimitedClient\":\n return self\n\n async def __aexit__(self, exc_type, exc, tb) -> None:\n await self.client.aclose()\n\n async def call(self, prompt: str, model: str = \"gpt-5.4-mini\") -> str:\n async with self.semaphore:\n response = await self.client.post(\n \"https://api.openai.com/v1/responses\",\n headers={\"Authorization\": f\"Bearer {os.environ['OPENAI_API_KEY']}\"},\n json={\n \"model\": model,\n \"input\": prompt,\n \"max_output_tokens\": 300,\n },\n )\n response.raise_for_status()\n body = response.json()\n return \"\".join(\n part[\"text\"]\n for item in body.get(\"output\", [])\n for part in item.get(\"content\", [])\n if part.get(\"type\") == \"output_text\"\n )\n\nasync def main():\n prompts = [\n \"Summarize this support ticket.\",\n \"Extract the action items from this meeting note.\",\n \"Classify this user feedback.\",\n ]\n\n async with RateLimitedClient(max_concurrent=2) as llm:\n results = await asyncio.gather(\n *(llm.call(prompt) for prompt in prompts)\n )\n\n for result in results:\n print(result)\n\nasyncio.run(main())\n```\n\n\nThis class keeps the HTTP client and semaphore together. Every call through it respects the same concurrency limit, and the async context manager closes the connection pool when the batch is done.\n\n\n```mermaid\nflowchart LR\n subgraph Tasks[\"10 Pending Tasks\"]\n\t\tdirection LR\n T1[\"Task 1\"]:::primary\n T2[\"Task 2\"]:::primary\n T3[\"Task 3\"]:::primary\n T4[\"Task 4\"]:::primary\n T5[\"...\"]:::primary\n end\n\n SEM[\"Semaphore
Limit: 3\"]:::orange\n\n subgraph Active[\"Active (max 3)\"]\n\t\tdirection LR\n A1[\"Task A\"]:::green\n A2[\"Task B\"]:::green\n A3[\"Task C\"]:::green\n end\n\n API[\"LLM API\"]:::teal\n\n Tasks --> SEM --> Active --> API\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n```\n\n\nTasks wait behind the semaphore. Only three pass through at a time in this diagram. When one task finishes, the next waiting task gets a slot.\n\n---\n\n# asyncio.as_completed() for Early Results\n\n`asyncio.gather()` waits for all tasks to finish, then returns all results at once. Sometimes that is not ideal.\n\nIf tasks have different latencies, you may want to process each result as soon as it is ready. `asyncio.as_completed()` gives you results in completion order, not submission order:\n\n\n**main.py**\n\n```python\nimport asyncio\nimport time\n\nasync def call_model(name: str, delay: float) -> dict:\n await asyncio.sleep(delay)\n return {\"model\": name, \"delay\": delay}\n\nasync def main():\n tasks = [\n call_model(\"fast-model\", 0.1),\n call_model(\"large-model\", 0.2),\n call_model(\"backup-model\", 0.1),\n call_model(\"balanced-model\", 0.3),\n ]\n\n start = time.perf_counter()\n\n for coro in asyncio.as_completed(tasks):\n result = await coro\n elapsed = time.perf_counter() - start\n print(f\"[{elapsed:.1f}s] Got result from {result['model']}\")\n\nasyncio.run(main())\n```\n\n\nThis is useful for progress updates, partial UI results, and \"fastest successful response wins\" designs. If you use the fastest-response pattern, remember to cancel work you no longer need and close any streaming responses cleanly.\n\n---\n\n# Async Generators for Streaming LLM Responses\n\nMany LLM APIs support streaming. Instead of waiting for the full response, the application receives small chunks as the model produces output.\n\nIn Python, async iteration fits streaming well. You can process each chunk as it arrives while still allowing the event loop to handle other work.\n\nHere is a raw HTTP streaming example using the OpenAI Responses API:\n\n\n**main.py**\n\n```python\nimport asyncio\nimport httpx\nimport json\nimport os\n\nasync def stream_completion(\n client: httpx.AsyncClient,\n prompt: str,\n model: str = \"gpt-5.4-mini\",\n) -> None:\n \"\"\"Stream text deltas from an LLM and print them as they arrive.\"\"\"\n async with client.stream(\n \"POST\",\n \"https://api.openai.com/v1/responses\",\n headers={\"Authorization\": f\"Bearer {os.environ['OPENAI_API_KEY']}\"},\n json={\n \"model\": model,\n \"input\": prompt,\n \"stream\": True,\n },\n timeout=60.0,\n ) as response:\n response.raise_for_status()\n\n async for line in response.aiter_lines():\n if line.startswith(\"data: \") and line != \"data: [DONE]\":\n chunk = json.loads(line[6:])\n if chunk.get(\"type\") == \"response.output_text.delta\":\n print(chunk[\"delta\"], end=\"\", flush=True)\n\n print()\n\nasync def main():\n async with httpx.AsyncClient() as client:\n await stream_completion(client, \"Write a haiku about Python.\")\n\nasyncio.run(main())\n```\n\n\nYou can also write your own async generators. An async generator is an `async def` function that uses `yield`. The caller consumes it with `async for`.\n\nHere is a simple simulated stream:\n\n\n**main.py**\n\n```python\nimport asyncio\n\nasync def fake_token_stream(text: str, tokens_per_second: int = 10):\n words = text.split()\n delay = 1.0 / tokens_per_second\n for word in words:\n await asyncio.sleep(delay)\n yield word + \" \"\n\nasync def main():\n async for token in fake_token_stream(\"Async generators work well for streams\"):\n print(token, end=\"\", flush=True)\n print()\n\nasyncio.run(main())\n```\n\n\nAsync generators are useful for streaming UIs, live logs, agent traces, progress events, and any pipeline where partial output is valuable.\n\n---\n\n# When Not to Use Async\n\nAsync is not a general performance switch.\n\nUse async when your program waits on I/O: HTTP APIs, databases, queues, sockets, object storage, and streaming responses.\n\nDo not expect async to speed up CPU-heavy work such as parsing huge files, resizing images, running local model inference, or doing large numerical computations. For CPU-bound work, use better algorithms, vectorized libraries, multiprocessing, worker queues, or a dedicated service.\n\nAlso be careful with blocking libraries. If a library does not support async and performs network or disk I/O, calling it inside an async function can block the event loop. In that case, use an async-compatible library, run the blocking call in a thread with `asyncio.to_thread()`, or move it out of the request path.\n\n---\n\n# Quiz\n\n---\n\n### References\n\n- [Python asyncio documentation](https://docs.python.org/3/library/asyncio.html)\n- [httpx async documentation](https://www.python-httpx.org/async/)\n- [Real Python: Async IO in Python](https://realpython.com/async-io-python/)\n- [PEP 492: Coroutines with async and await syntax](https://peps.python.org/pep-0492/)\n- [OpenAI Responses API reference](https://platform.openai.com/docs/api-reference/responses/create)\n- [OpenAI Models documentation](https://platform.openai.com/docs/models)","pageType":"ai-engineering"}

Get Premium