{"title":"Making Your First LLM API Call","description":null,"content":"An LLM API call is still an API call.\n\nYou send a request, authenticate with an API key, pass in some text, and get a response back. The unusual part is that the response is generated by a model, so you also have to care about things like prompt wording, token limits, latency, cost, and occasional failures.\n\nIn this chapter, you will make your first LLM call, read the response, estimate token usage, and add basic retry logic. These are small pieces, but they are the pieces every real AI application is built on.\n\n---\n\n# Choosing a Provider\n\nFirst, we need a provider.\n\nThis course uses [**OpenRouter**](https://openrouter.ai/). OpenRouter is an AI gateway: one API key, one OpenAI-compatible API, and access to many models from providers such as OpenAI, Anthropic, Google, Meta, Mistral, and others.\n\nThat matters because beginners should not have to learn five different SDKs before they can build one useful app. With a gateway, you write one integration and choose the model with a string.\n\n\n```mermaid\nflowchart TB\n APP[Your App]:::primary --> OR[OpenRouter]:::orange\n\n OR --> OAI[OpenAI
models]:::teal\n OR --> ANT[Anthropic
Claude]:::teal\n OR --> GGL[Google
Gemini]:::teal\n OR --> META[Meta
Llama]:::teal\n OR --> MIS[Mistral]:::teal\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n```\n\n\nYour app sends a request to OpenRouter. OpenRouter sends that request to the model you selected and returns a response in a mostly consistent format.\n\nThis does not mean all models behave the same. They still differ in quality, speed, price, context window, tool support, and structured-output support. The benefit is simpler engineering: you can compare models by changing the model name instead of rewriting the whole client.\n\nYou can browse OpenRouter's [model list](https://openrouter.ai/models) and [free model collection](https://openrouter.ai/collections/free-models). Treat free-model availability as temporary. Pricing, limits, and supported model slugs change over time, so always check the model page before building a project around a specific model.\n\nThe examples below hardcode model names so you can run the snippets directly. If a model is not available for your account, replace the string with another model slug from OpenRouter.\n\n---\n\n# Setting Up Your API Key\n\nMost LLM APIs use API keys. An API key proves that the request is coming from your account, and it is usually connected to your credits or billing method.\n\n### Getting Your API Key\n\nCreate an account at [openrouter.ai](https://openrouter.ai), then create an API key from your account settings.\n\nFree access, rate limits, and payment requirements vary by model and can change. Check OpenRouter's [pricing page](https://openrouter.ai/pricing) and the specific model page before assuming a model is free for your use case.\n\n### Do Not Hardcode API Keys\n\nAn API key is a secret. Anyone who has it may be able to use your account. Do not paste it directly into your code.\n\nNever do this:\n\n\n**main.py**\n\n```python\n# Never put a real API key in source code.\nclient = OpenAI(api_key=\"sk-abc123...\")\n```\n\n\nUse an environment variable instead.\n\nThere are two common ways to do that.\n\n#### Approach 1: Export it in your shell\n\n\n```shell\nexport OPENROUTER_API_KEY=\"sk-or-v1-your-key-here\"\n```\n\n\nThis works well for quick experiments. The variable stays available in that terminal session.\n\n#### Approach 2: Use a `.env` file\n\nFor a project, a local `.env` file is usually easier. Create a file named `.env` in your project root:\n\n\n```plaintext\nOPENROUTER_API_KEY=sk-or-v1-your-key-here\n```\n\n\nThen load it in Python with `python-dotenv`:\n\n\n**main.py**\n\n```python\nimport os\nfrom dotenv import load_dotenv\n\nload_dotenv()\n\nopenrouter_key = os.getenv(\"OPENROUTER_API_KEY\")\n```\n\n\nAdd `.env` to `.gitignore` before you commit anything:\n\n\n```plaintext\n# .gitignore\n.env\n.env.local\n.env.*.local\n```\n\n\n### Installing the SDKs\n\nInstall the libraries used in this chapter:\n\n\n```shell\npip install openai python-dotenv tiktoken\n```\n\n\nWe will use the OpenAI Python SDK because OpenRouter supports an OpenAI-compatible chat API. The main change is the `base_url`.\n\n---\n\n# Anatomy of an LLM API Request\n\nMost chat-style LLM requests have the same core parts:\n\n- a model name\n- a list of messages\n- optional generation settings such as output length and temperature\n\n### The Messages Array\n\nThe most important part is the **messages array**. It is a list of messages in conversation order. Each message has a `role` and `content`.\n\nThe common roles are:\n\n- **system:** Instructions that guide the model's behavior, constraints, and style.\n- **user:** The user's question, instruction, or task.\n- **assistant:** A previous model response, included when you want the model to continue a conversation.\n\nHere is a small example:\n\n\n**main.py**\n\n```python\nmessages = [\n {\n \"role\": \"system\",\n \"content\": \"You are a practical Python tutor. Explain things clearly.\"\n },\n {\n \"role\": \"user\",\n \"content\": \"Write a function that checks if a string is a palindrome.\"\n }\n]\n```\n\n\n### The Request/Response Flow\n\nWhen you call the API, this is the basic flow:\n\n\n```mermaid\nsequenceDiagram\n participant App as Your App\n participant API as LLM API\n participant Model as LLM Model\n\n App->>API: POST /chat/completions\n API->>API: Check API key, request body, and limits\n API->>Model: Send the prompt to the selected model\n Model-->>API: Generated tokens\n API-->>App: Response JSON (message + usage stats)\n\n Note over App,Model: Latency depends on model, load, and output length\n```\n\n\nIn plain English:\n\n1. Your application sends the model name, messages, and settings.\n2. The API checks your key and validates the request.\n3. The model processes the input and generates output.\n4. The API returns JSON containing the answer, token usage, and metadata.\n\n---\n\n# Your First LLM Call\n\nHere is a complete first call through OpenRouter:\n\n\n**main.py**\n\n```python\nimport os\nfrom dotenv import load_dotenv\nfrom openai import OpenAI\n\nload_dotenv()\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[\n {\n \"role\": \"system\",\n \"content\": \"You are a senior software engineer. Give concise, practical answers.\",\n },\n {\n \"role\": \"user\",\n \"content\": \"What are the three most important things to know about database indexing?\",\n },\n ],\n max_completion_tokens=300,\n)\n\nanswer = response.choices[0].message.content\nprint(answer)\n\nif response.usage:\n print(\n f\"\\nTokens used - Input: {response.usage.prompt_tokens}, \"\n f\"Output: {response.usage.completion_tokens}, \"\n f\"Total: {response.usage.total_tokens}\"\n )\n```\n\n\nHere is what each part does.\n\n`base_url=\"https://openrouter.ai/api/v1\"` points the OpenAI SDK at OpenRouter instead of OpenAI's own endpoint.\n\n`model` tells OpenRouter which model to use. OpenRouter model slugs usually look like `provider/model-name`. Here, `openai/gpt-5.4-mini` routes the request to OpenAI's GPT-5.4 Mini model. This is a concrete model slug, which is better for a course because learners are more likely to see the same behavior when they run the code.\n\n`messages` contains the conversation. In this example, the system message sets the answer style and the user message asks the actual question.\n\n`max_completion_tokens` limits how many tokens the model can generate. This protects you from unexpectedly long responses and helps control cost.\n\n`choices[0].message.content` is the generated answer. `usage`, when present, gives token counts that are useful for cost tracking and debugging.\n\n### Understanding the Response Object\n\nThe response object is larger than the answer text. Simplified, it looks like this:\n\n\n```json\n{\n \"id\": \"chatcmpl-abc123\",\n \"object\": \"chat.completion\",\n \"model\": \"openai/gpt-5.4-mini\",\n \"choices\": [\n {\n \"index\": 0,\n \"message\": {\n \"role\": \"assistant\",\n \"content\": \"Here are the three most important things...\"\n },\n \"finish_reason\": \"stop\"\n }\n ],\n \"usage\": {\n \"prompt_tokens\": 42,\n \"completion_tokens\": 156,\n \"total_tokens\": 198\n }\n}\n```\n\n\nA few fields worth understanding:\n\n- **choices:** A list of generated responses. Most apps request one response, so they read `choices[0]`.\n- **finish_reason:** Why generation stopped. `\"stop\"` usually means the model finished normally. `\"length\"` means it hit the output limit and may be cut off.\n- **usage:** Token counts for the request and response. These are useful for cost, latency, and monitoring.\n\n---\n\n# Understanding Tokens and Costs\n\nLLMs do not process text exactly as words. They process **tokens**.\n\nA token is a small chunk of text. In English, a token is often about three or four characters, but that is only a rough estimate. Common words may be one token. Long words, uncommon words, punctuation, and code can split into several tokens.\n\nUseful rules of thumb:\n\n- 1 token is about 4 characters in English\n- 100 tokens is about 75 words\n- A typical page of text is about 300-400 tokens\n\nToken counts matter for three reasons:\n\n**Cost:** Most providers charge by token. Output tokens are often more expensive than input tokens because generating text requires more compute.\n\n**Context window:** Every model has a maximum number of tokens it can handle in one request. That limit includes your input, previous conversation, retrieved documents, and the model's output.\n\n**Latency:** Longer inputs and longer outputs usually take more time.\n\n\n**main.py**\n\n```python\ndef estimate_tokens(text: str) -> int:\n \"\"\"Very rough estimate for English text.\"\"\"\n return len(text) // 4\n\nprompt = \"Explain the difference between TCP and UDP in networking.\"\nprint(f\"Estimated tokens: {estimate_tokens(prompt)}\")\n```\n\n\n### Counting Tokens Before Sending\n\nFor a better estimate with OpenAI-style tokenizers, use [**tiktoken**](https://github.com/openai/tiktoken):\n\n\n**main.py**\n\n```python\nimport tiktoken\n\nprompt = \"Explain the difference between TCP and UDP in networking.\"\n\nencoder = tiktoken.get_encoding(\"o200k_base\")\ntokens = encoder.encode(prompt)\nprint(f\"Token count: {len(tokens)}\")\n```\n\n\nFor non-OpenAI model families, `tiktoken` is still only an estimate. Claude, Gemini, Llama, and other models use different tokenizers. Use estimates before sending the request, then log the actual counts returned in `response.usage`.\n\n---\n\n# Error Handling: Rate Limits, Timeouts, and Retries\n\nLLM APIs fail for ordinary reasons: bad keys, bad requests, rate limits, provider outages, network problems, timeouts, and not enough credits.\n\nGood application code treats these as expected cases, not surprises.\n\n### Common Error Types\n\nHere are common errors:\n\n\n| Error | HTTP Code | Cause | Solution |\n|-------|-----------|-------|----------|\n| Rate limit exceeded | 429 | Too many requests too quickly | Retry with backoff |\n| Request timeout | 408/504 | The request took too long | Retry or reduce output length |\n| Invalid API key | 401 | Wrong or expired key | Check key configuration |\n| Context too large | 400/413 | Too many tokens in the request | Shorten the prompt or use a larger context model |\n| Server error | 500/502/503 | Provider-side issue | Retry with backoff |\n| Insufficient quota | 402 | Not enough credits | Add credits or choose another model |\n\n\nHere is a small retry wrapper:\n\n\n**main.py**\n\n```python\nimport os\nimport random\nimport time\nfrom dotenv import load_dotenv\nfrom openai import (\n APIConnectionError,\n APIStatusError,\n APITimeoutError,\n OpenAI,\n RateLimitError,\n)\n\nload_dotenv()\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n timeout=30.0,\n)\n\nRETRYABLE_STATUS_CODES = {408, 429, 500, 502, 503, 504}\n\ndef retry_delay(attempt: int) -> float:\n return min((2 ** attempt) + random.uniform(0, 1), 10)\n\ndef call_with_retry(max_retries=3, **kwargs):\n for attempt in range(max_retries + 1):\n try:\n return client.chat.completions.create(**kwargs)\n\n except RateLimitError:\n if attempt == max_retries:\n raise\n wait_time = retry_delay(attempt)\n print(f\"Rate limited. Retrying in {wait_time:.1f}s.\")\n time.sleep(wait_time)\n\n except (APITimeoutError, APIConnectionError):\n if attempt == max_retries:\n raise\n wait_time = retry_delay(attempt)\n print(f\"Temporary connection problem. Retrying in {wait_time:.1f}s.\")\n time.sleep(wait_time)\n\n except APIStatusError as error:\n if error.status_code not in RETRYABLE_STATUS_CODES or attempt == max_retries:\n raise\n wait_time = retry_delay(attempt)\n print(f\"API error {error.status_code}. Retrying in {wait_time:.1f}s.\")\n time.sleep(wait_time)\n\nresponse = call_with_retry(\n max_retries=3,\n model=\"openai/gpt-5.4-mini\",\n messages=[\n {\"role\": \"user\", \"content\": \"Explain TCP vs UDP in one paragraph.\"},\n ],\n max_completion_tokens=300,\n)\nprint(response.choices[0].message.content)\n```\n\n\nThis wrapper retries only errors that are likely to be temporary. It does not retry bad API keys, invalid model names, malformed requests, or quota errors, because retrying those usually just wastes time and money.\n\n### Setting Timeouts\n\nAlways set a timeout in application code. Without one, a request can keep your user waiting much longer than you intended.\n\n\n**main.py**\n\n```python\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n timeout=30.0,\n)\n```\n\n\nFor long answers or slower reasoning models, use a longer timeout and stream the response so the user sees progress.\n\n---\n\n# Putting It All Together: Compare a Few Models\n\nNow we can send the same prompt to a few models and compare the answers. This is one of the practical benefits of using a gateway: one client, one key, several model families.\n\n\n**main.py**\n\n```python\nimport os\nimport time\nfrom dotenv import load_dotenv\nfrom openai import OpenAI\n\nload_dotenv()\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n timeout=30.0,\n)\n\ndef call_model(model: str, prompt: str, system_prompt: str = \"You are a helpful assistant.\") -> dict:\n start_time = time.time()\n\n response = client.chat.completions.create(\n model=model,\n messages=[\n {\"role\": \"system\", \"content\": system_prompt},\n {\"role\": \"user\", \"content\": prompt},\n ],\n max_completion_tokens=200,\n )\n\n elapsed = time.time() - start_time\n usage = response.usage\n\n return {\n \"model\": model,\n \"response\": response.choices[0].message.content,\n \"input_tokens\": usage.prompt_tokens if usage else None,\n \"output_tokens\": usage.completion_tokens if usage else None,\n \"latency_seconds\": round(elapsed, 2),\n }\n\ndef compare_models(prompt: str, system_prompt: str = \"You are a helpful assistant.\") -> None:\n models = [\n \"openai/gpt-5.4-mini\",\n \"~anthropic/claude-sonnet-latest\",\n \"meta-llama/llama-3.3-70b-instruct\",\n ]\n\n print(f\"Prompt: {prompt}\\n\")\n print(\"=\" * 80)\n\n results = []\n for model in models:\n try:\n result = call_model(model, prompt, system_prompt)\n results.append(result)\n print(f\"\\n--- {result['model']} ---\")\n print(f\"Latency: {result['latency_seconds']}s\")\n print(f\"Tokens: {result['input_tokens']} in / {result['output_tokens']} out\")\n print(f\"\\n{result['response']}\\n\")\n except Exception as error:\n print(f\"\\n--- {model} FAILED ---\")\n print(f\"Error: {error}\\n\")\n\n if results:\n print(\"=\" * 80)\n print(\"\\nComparison Summary:\")\n print(f\"{'Model':<50} {'Latency':>10} {'Input':>8} {'Output':>8}\")\n print(\"-\" * 80)\n for r in results:\n print(f\"{r['model']:<50} {r['latency_seconds']:>8.2f}s \"\n f\"{r['input_tokens']:>8} {r['output_tokens']:>8}\")\n\nif __name__ == \"__main__\":\n compare_models(\n prompt=\"What are the three most important things to know about database indexing?\",\n system_prompt=\"You are a senior software engineer. Give concise, practical answers.\",\n )\n```\n\n\nIf one of these model slugs is not available for your account, replace it with a model from the current OpenRouter model list.\n\nWhen you run the script, compare a few things:\n\n- **Answer quality:** Is the answer correct, clear, and useful?\n- **Style:** Does the model answer directly, or does it add too much explanation?\n- **Latency:** How long does the response take?\n- **Token usage:** How much input and output did the request consume?\n\nDo not choose a model only because it works once. For real applications, test it on the kinds of prompts your users will actually send.\n\n---\n\n# Quiz","pageType":"ai-engineering"}

Get Premium