{"title":"Understanding AI Application Costs","description":"","content":"AI application cost is not one line item called \"the model.\" It is the sum of inference, embeddings, retrieval, re-ranking, evaluation, retries, storage, observability, and sometimes GPU capacity that sits idle between traffic spikes.\n\nOptimizing only the visible LLM call often leaves much of the bill untouched. A production RAG or agent system can spend a large share of its budget on context assembly, repeated tool calls, validation retries, and background evaluation. This chapter breaks those costs down so cost decisions come from measured spend, not a surprising invoice.\n\n---\n\n# How Token Pricing Works\n\nThe basic rule is simple: most hosted LLM APIs charge by tokens. A few details matter beyond that.\n\n**Input and output tokens are priced differently.** Input tokens include the system prompt, user message, retrieved documents, tool definitions, and conversation history. Output tokens are the model's generated response. Output tokens usually cost several times more because generation is sequential: each new token depends on the tokens before it.\n\n**Cached input may be priced separately.** Some providers discount repeated prompt prefixes. That changes the cost model for long system prompts, tool schemas, and repeated reference documents. A 6,000-token prompt that is mostly cacheable can be cheaper than a 3,000-token prompt whose beginning changes on every request.\n\n**Reasoning and multimodal features change the bill.** Models with extended reasoning budgets may bill reasoning work separately or as output-like tokens, depending on the provider. Audio, images, search grounding, code execution, and region-specific processing can have their own rates. Cost estimates based only on text tokens will be wrong if your application uses those features.\n\nRather than memorizing provider prices, think in tiers. The actual numbers change often, so keep current prices in configuration and check the provider's pricing page before making product or architecture decisions.\n\n\n| Model Tier | Typical Role | Cost Pattern |\n|------------|--------------|--------------|\n| Frontier / reasoning | Hard reasoning, coding, agent workflows, high-risk answers | Highest input and output cost |\n| Mid-tier | General product features, short analysis, everyday chat | Good default for many paths |\n| Small / nano | Classification, extraction, rewriting, routing, query cleanup | Low cost, narrower reliability envelope |\n| Embedding model | Query and document embeddings | Much cheaper than generation models |\n| Hosted open-weight | High-volume specialized tasks, custom deployment needs | Price depends on host and hardware |\n\n\nWhen estimating cost, check whether your workload uses batch discounts, cached-input discounts, long-context surcharges, web-search charges, data-residency multipliers, or dedicated-capacity pricing.\n\nA few patterns hold across providers. Frontier models are expensive enough that routing matters, so cheaper models should handle classification, extraction, short summaries, and other control-plane decisions when quality is good enough. Output tokens dominate many bills, especially when the application encourages verbose answers. Cached input and batch APIs can change the economics of long, repetitive prompts.\n\n\n```mermaid\nflowchart LR\n Q[Your Query]:::primary --> SYS[System Prompt
~500 tokens]:::orange\n Q --> CTX[Retrieved Context
~2,000 tokens]:::orange\n Q --> HIST[Chat History
~1,000 tokens]:::orange\n Q --> USER[User Message
~100 tokens]:::orange\n\n SYS --> INPUT[Total Input
3,600 tokens]:::teal\n CTX --> INPUT\n HIST --> INPUT\n USER --> INPUT\n\n INPUT --> MODEL[LLM]:::primary\n MODEL --> OUTPUT[Generated Output
~500 tokens]:::red\n\n INPUT --> COST_IN[\"Example Input Cost
3,600 x $0.25/1M
= $0.0009\"]:::green\n OUTPUT --> COST_OUT[\"Example Output Cost
500 x $2.00/1M
= $0.001\"]:::green\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nThe diagram shows a typical RAG query. Most of the input tokens come from retrieved context and chat history, not from the user's question. A user may type 20 words while the system sends thousands of tokens to the model. Much of cost optimization is about closing that gap without removing context the model actually needs.\n\n---\n\n# The Hidden Costs\n\nToken costs from the primary LLM are easy to see in a provider dashboard. Production systems carry several other cost centers, especially when they use RAG, tools, agents, or automated evaluation.\n\n### Embedding Generation\n\nEvery RAG query usually starts with a query embedding. That call is usually cheap, but indexing is not always trivial. If you have 100,000 chunks at 500 tokens each, the initial index processes 50 million tokens. Every document update, chunking change, or embedding-model migration reopens that bill.\n\n\n| Embedding Tier | Cost Pattern | Typical Use |\n|----------------|--------------|-------------|\n| Small embedding model | Lowest cost | General semantic search, support docs, FAQs |\n| Large embedding model | Higher cost | Better recall for harder retrieval tasks |\n| Domain-focused embedding model | Varies | Code, multilingual, legal, biomedical, or technical content |\n| Self-hosted embedding model | Infrastructure cost | High-volume or data-control use cases |\n\n\n### Re-ranking Calls\n\nA re-ranker is another model call on the hot path. Cross-encoder re-rankers process the query and candidate documents together, which can improve relevance but adds latency and cost. Some providers price re-ranking per search; others price it by tokens or request units. At top-k = 20 with 500-token chunks, a single re-rank step may process roughly 10K tokens, so the cost can grow faster than expected.\n\n### Evaluation and Testing\n\nLLM-as-judge evaluation means using a model to score another model's output. Evaluating every production response can approach the cost of a second inference pipeline. Sampling a small percentage of traffic is often enough for monitoring, while full evaluation is better reserved for releases, regressions, and high-risk workflows.\n\n### Retries and Fallbacks\n\nAPI calls fail. Rate limits happen. Structured outputs sometimes fail validation. Each retry is another cost event, and validation retries are easy to hide in helper libraries. A 5% retry rate raises spend by roughly 5% for that path if retries are similar in size; a poorly constrained JSON extraction pipeline can do much worse.\n\n### Vector Database Hosting\n\nVector storage is not free. Managed services charge for capacity, replicas, pods, serverless read units, write units, or some mixture of those. Self-hosting Qdrant, Weaviate, Milvus, or pgvector shifts the cost to compute, storage, backups, and operations. For small projects this may be negligible. For tens of millions of vectors with regional replicas, it becomes a serious line item.\n\n### The Full Picture\n\n\n```mermaid\nflowchart TD\n QUERY[User Query]:::primary\n\n subgraph Visible Costs\n LLM[LLM Generation
$0.010 per query]:::orange\n end\n\n subgraph Hidden Costs\n EMB[Query Embedding
$0.0001 per query]:::teal\n RERANK[Re-ranking
$0.002 per query]:::teal\n EVAL[LLM Evaluation
$0.005 per query]:::teal\n RETRY[Retries ~5%
$0.0005 per query]:::teal\n VDB[Vector DB
$0.0025 per query*]:::teal\n end\n\n QUERY --> EMB\n EMB --> VDB\n VDB --> RERANK\n RERANK --> LLM\n LLM --> EVAL\n LLM --> RETRY\n\n LLM --> TOTAL[\"Total: $0.020 per query
(vs $0.010 visible)\"]:::red\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n```\n\n\n*Vector DB cost amortized: monthly hosting cost divided by query volume.*\n\nIn this example, the hidden costs double the per-query price. The LLM call itself is $0.010, but the total is $0.020 when retrieval, re-ranking, evaluation, retries, and vector storage are included. At 10,000 queries per day, that hidden $0.010 per query is about $3,000 of untracked monthly spend.\n\n---\n\n# Building Cost Tracking into Your Application\n\nThe instrumentation goal is straightforward: every API call the application makes should log its token usage and cost, tagged with enough metadata to answer questions like \"which feature is most expensive?\" and \"which users cost the most?\"\n\nThe cost tracker below wraps LLM calls:\n\n\n**main.py**\n\n```python\nimport json\nimport logging\nimport time\nfrom dataclasses import dataclass, field, asdict\nfrom datetime import datetime, timezone\n\n# Example pricing per 1M tokens. Keep this in configuration and version it.\nPRICING = {\n \"frontier-chat\": {\"input\": 1.25, \"cached_input\": 0.125, \"output\": 10.00},\n \"mid-tier-chat\": {\"input\": 0.25, \"cached_input\": 0.025, \"output\": 2.00},\n \"small-chat\": {\"input\": 0.05, \"cached_input\": 0.005, \"output\": 0.40},\n \"embedding-small\": {\"input\": 0.02, \"output\": 0.0},\n \"embedding-large\": {\"input\": 0.13, \"output\": 0.0},\n}\n\n@dataclass\nclass APICallRecord:\n timestamp: str\n model: str\n feature: str\n user_id: str\n input_tokens: int\n output_tokens: int\n cached_input_tokens: int\n input_cost: float\n output_cost: float\n total_cost: float\n latency_ms: float\n success: bool\n is_retry: bool = False\n metadata: dict = field(default_factory=dict)\n\nclass CostTracker:\n def __init__(self, log_file: str = \"cost_log.jsonl\"):\n self.log_file = log_file\n self.logger = logging.getLogger(\"cost_tracker\")\n\n def calculate_cost(\n self,\n model: str,\n input_tokens: int,\n output_tokens: int,\n cached_input_tokens: int = 0,\n ) -> tuple[float, float]:\n \"\"\"Calculate input and output costs for a given model and token counts.\"\"\"\n pricing = PRICING.get(model)\n if pricing is None:\n self.logger.warning(f\"Unknown model '{model}', cannot calculate cost\")\n return 0.0, 0.0\n\n uncached_input_tokens = max(input_tokens - cached_input_tokens, 0)\n cached_rate = pricing.get(\"cached_input\", pricing[\"input\"])\n input_cost = (\n (uncached_input_tokens / 1_000_000) * pricing[\"input\"]\n + (cached_input_tokens / 1_000_000) * cached_rate\n )\n output_cost = (output_tokens / 1_000_000) * pricing[\"output\"]\n return input_cost, output_cost\n\n def log_call(self, record: APICallRecord):\n \"\"\"Append a cost record to the log file.\"\"\"\n with open(self.log_file, \"a\") as f:\n f.write(json.dumps(asdict(record)) + \"\\n\")\n\n def tracked_chat_completion(\n self,\n client,\n model: str,\n messages: list,\n feature: str,\n user_id: str,\n is_retry: bool = False,\n **kwargs,\n ):\n \"\"\"Wrap an OpenAI-style chat completion with cost tracking.\"\"\"\n start_time = time.time()\n response = None\n success = True\n\n try:\n response = client.chat.completions.create(\n model=model, messages=messages, **kwargs\n )\n except Exception:\n success = False\n raise\n finally:\n latency_ms = (time.time() - start_time) * 1000\n\n if success and response is not None:\n usage = response.usage\n input_tokens = getattr(usage, \"prompt_tokens\", 0)\n output_tokens = getattr(usage, \"completion_tokens\", 0)\n details = getattr(usage, \"prompt_tokens_details\", None)\n cached_input_tokens = getattr(details, \"cached_tokens\", 0) if details else 0\n else:\n input_tokens = 0\n output_tokens = 0\n cached_input_tokens = 0\n\n input_cost, output_cost = self.calculate_cost(\n model, input_tokens, output_tokens, cached_input_tokens\n )\n\n record = APICallRecord(\n timestamp=datetime.now(timezone.utc).isoformat(),\n model=model,\n feature=feature,\n user_id=user_id,\n input_tokens=input_tokens,\n output_tokens=output_tokens,\n cached_input_tokens=cached_input_tokens,\n input_cost=input_cost,\n output_cost=output_cost,\n total_cost=input_cost + output_cost,\n latency_ms=latency_ms,\n success=success,\n is_retry=is_retry,\n )\n self.log_call(record)\n\n return response\n```\n\n\nEvery model call emits a structured cost event. JSONL is convenient for a lesson because it is easy to inspect and aggregate, but in production this should go to a metrics pipeline or warehouse. The `feature` tag identifies which product surface generated the cost. `user_id` helps identify heavy usage or abuse, assuming you log it in a privacy-safe way. `cached_input_tokens` shows whether prompt caching is actually working. `is_retry` quantifies how much spend the recovery logic is consuming.\n\nCalling the tracker:\n\n\n**main.py**\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI()\ntracker = CostTracker(log_file=\"costs.jsonl\")\n\n# Tag every call with the feature that triggered it.\nresponse = tracker.tracked_chat_completion(\n client=client,\n model=\"mid-tier-chat\",\n messages=[\n {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n {\"role\": \"user\", \"content\": \"What is gradient descent?\"},\n ],\n feature=\"qa_chatbot\",\n user_id=\"user_123\",\n)\n\nprint(response.choices[0].message.content)\n```\n\n\nEvery call is now logged with token counts, cached-token counts, estimated cost, latency, and metadata. In a real system, keep pricing outside the code path and version it. Avoid storing raw user identifiers if your privacy policy does not allow it; hash or pseudonymize them before logging. When a billing anomaly shows up weeks later, knowing which price sheet was active at the time makes the investigation tractable.\n\n---\n\n# Building a Cost Dashboard\n\nLogging costs without reviewing them solves nothing. A small dashboard answers the recurring questions: how much are we spending per query, per feature, per user, and per day?\n\n\n**main.py**\n\n```python\nimport json\nfrom collections import defaultdict\n\ndef load_cost_records(log_file: str = \"costs.jsonl\") -> list[dict]:\n \"\"\"Load all cost records from the JSONL log file.\"\"\"\n records = []\n with open(log_file, \"r\") as f:\n for line in f:\n line = line.strip()\n if line:\n records.append(json.loads(line))\n return records\n\ndef cost_dashboard(records: list[dict]):\n \"\"\"Print a cost dashboard from logged API call records.\"\"\"\n if not records:\n print(\"No records found.\")\n return\n\n # Overall metrics\n total_cost = sum(r[\"total_cost\"] for r in records)\n total_queries = len(records)\n avg_cost = total_cost / total_queries if total_queries else 0.0\n total_input = sum(r[\"input_tokens\"] for r in records)\n total_output = sum(r[\"output_tokens\"] for r in records)\n retry_count = sum(1 for r in records if r.get(\"is_retry\"))\n\n print(\"=\" * 60)\n print(\"COST DASHBOARD\")\n print(\"=\" * 60)\n print(f\"Total queries: {total_queries:,}\")\n print(f\"Total cost: ${total_cost:,.4f}\")\n print(f\"Avg cost per query: ${avg_cost:,.6f}\")\n print(f\"Total input tokens: {total_input:,}\")\n print(f\"Total output tokens: {total_output:,}\")\n retry_pct = retry_count / total_queries * 100 if total_queries else 0.0\n print(f\"Retry count: {retry_count} ({retry_pct:.1f}%)\")\n\n # Cost by feature\n print(\"\\n--- Cost by Feature ---\")\n feature_costs = defaultdict(lambda: {\"cost\": 0.0, \"count\": 0})\n for r in records:\n feature_costs[r[\"feature\"]][\"cost\"] += r[\"total_cost\"]\n feature_costs[r[\"feature\"]][\"count\"] += 1\n\n for feature, data in sorted(\n feature_costs.items(), key=lambda x: x[1][\"cost\"], reverse=True\n ):\n pct = (data[\"cost\"] / total_cost) * 100 if total_cost else 0.0\n print(\n f\" {feature:<25} ${data['cost']:>8.4f} \"\n f\"({data['count']:>5} calls, {pct:>5.1f}%)\"\n )\n\n # Cost by model\n print(\"\\n--- Cost by Model ---\")\n model_costs = defaultdict(lambda: {\"cost\": 0.0, \"count\": 0})\n for r in records:\n model_costs[r[\"model\"]][\"cost\"] += r[\"total_cost\"]\n model_costs[r[\"model\"]][\"count\"] += 1\n\n for model, data in sorted(\n model_costs.items(), key=lambda x: x[1][\"cost\"], reverse=True\n ):\n pct = (data[\"cost\"] / total_cost) * 100 if total_cost else 0.0\n print(\n f\" {model:<35} ${data['cost']:>8.4f} \"\n f\"({data['count']:>5} calls, {pct:>5.1f}%)\"\n )\n\n # Daily spend trend\n print(\"\\n--- Daily Spend ---\")\n daily_costs = defaultdict(float)\n for r in records:\n day = r[\"timestamp\"][:10] # Extract YYYY-MM-DD\n daily_costs[day] += r[\"total_cost\"]\n\n max_daily = max(daily_costs.values()) if daily_costs else 0\n for day in sorted(daily_costs.keys()):\n bar_len = int(daily_costs[day] / max_daily * 40) if max_daily else 0\n bar = \"#\" * bar_len\n print(f\" {day} ${daily_costs[day]:>8.4f} {bar}\")\n\n # Top 5 most expensive users\n print(\"\\n--- Top 5 Users by Cost ---\")\n user_costs = defaultdict(lambda: {\"cost\": 0.0, \"count\": 0})\n for r in records:\n user_costs[r[\"user_id\"]][\"cost\"] += r[\"total_cost\"]\n user_costs[r[\"user_id\"]][\"count\"] += 1\n\n top_users = sorted(user_costs.items(), key=lambda x: x[1][\"cost\"], reverse=True)[\n :5\n ]\n for user_id, data in top_users:\n avg = data[\"cost\"] / data[\"count\"] if data[\"count\"] else 0.0\n print(\n f\" {user_id:<20} ${data['cost']:>8.4f} \"\n f\"({data['count']:>5} queries, avg ${avg:.6f})\"\n )\n\n# Usage\nrecords = load_cost_records(\"costs.jsonl\")\ncost_dashboard(records)\n```\n\n\nA sample output might look like this:\n\n\n```shell\n============================================================\nCOST DASHBOARD\n============================================================\nTotal queries: 1,000\nTotal cost: $7.1175\nAvg cost per query: $0.007118\nTotal input tokens: 3,250,000\nTotal output tokens: 487,000\nRetry count: 47 (4.7%)\n\n--- Cost by Feature ---\n rag_qa $ 4.2100 ( 420 calls, 59.1%)\n summarization $ 1.5550 ( 310 calls, 21.8%)\n classification $ 0.7525 ( 200 calls, 10.6%)\n embedding_search $ 0.6000 ( 70 calls, 8.4%)\n\n--- Cost by Model ---\n frontier-chat $ 4.9350 ( 350 calls, 69.3%)\n mid-tier-chat $ 1.0825 ( 580 calls, 15.2%)\n embedding-small $ 1.1000 ( 70 calls, 15.5%)\n\n--- Daily Spend ---\n 2025-01-15 $ 2.2600 ##############################\n 2025-01-16 $ 1.9050 #########################\n 2025-01-17 $ 2.9525 ########################################\n\n--- Top 5 Users by Cost ---\n user_892 $ 0.6200 ( 45 queries, avg $0.013778)\n user_231 $ 0.4900 ( 38 queries, avg $0.012895)\n user_107 $ 0.4375 ( 52 queries, avg $0.008413)\n user_445 $ 0.3600 ( 29 queries, avg $0.012414)\n user_663 $ 0.3050 ( 41 queries, avg $0.007439)\n```\n\n\nThe `rag_qa` feature accounts for 59% of spend despite being 42% of calls, which points to retrieved context, model tier, or output length as the cost driver. `user_892` costs almost twice as much per query as the average user, which could be legitimate heavy usage, abuse, or a retrieval path that expands too much context.\n\nThat kind of granularity changes the conversation from \"AI is too expensive\" to \"classification is running on an oversized model, RAG is retrieving too many chunks, and one account needs a quota.\"\n\n\n```mermaid\nflowchart TD\n API[API Calls]:::primary --> LOG[Cost Logger
JSONL File]:::orange\n\n LOG --> DASH[Cost Dashboard]:::teal\n\n DASH --> Q1[\"Per Query
Avg $0.007\"]:::green\n DASH --> Q2[\"Per Feature
RAG: 59% of spend\"]:::green\n DASH --> Q3[\"Per User
Top user: 2x avg\"]:::green\n DASH --> Q4[\"Daily Trend
Spike detection\"]:::green\n\n Q1 --> ACT[Optimization
Decisions]:::pink\n Q2 --> ACT\n Q3 --> ACT\n Q4 --> ACT\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef pink fill:#f783ac,stroke:#000,color:#000\n```\n\n\n---\n\n# Setting Budgets and Alerts\n\nA dashboard shows what already happened. Budgets and alerts stop a runaway pipeline before the invoice does. The system below enforces spending limits per feature, per user, and globally, with notifications fired when usage approaches a threshold.\n\n\n**main.py**\n\n```python\nimport json\nimport os\nfrom dataclasses import dataclass\nfrom datetime import datetime, timezone\nfrom typing import Optional\n\n@dataclass\nclass Budget:\n name: str\n daily_limit: float\n monthly_limit: float\n alert_threshold: float = 0.8 # Alert at 80% of limit\n\nclass BudgetManager:\n def __init__(\n self,\n cost_log_file: str = \"costs.jsonl\",\n budgets: Optional[dict[str, Budget]] = None,\n ):\n self.cost_log_file = cost_log_file\n self.budgets = budgets or {}\n self._alert_callbacks = []\n\n def add_budget(self, key: str, budget: Budget):\n \"\"\"Add a budget for a feature, user, or global scope.\"\"\"\n self.budgets[key] = budget\n\n def on_alert(self, callback):\n \"\"\"Register a callback function for budget alerts.\"\"\"\n self._alert_callbacks.append(callback)\n\n def _fire_alert(self, message: str, severity: str, context: dict):\n \"\"\"Trigger all registered alert callbacks.\"\"\"\n alert = {\n \"message\": message,\n \"severity\": severity,\n \"timestamp\": datetime.now(timezone.utc).isoformat(),\n **context,\n }\n for callback in self._alert_callbacks:\n callback(alert)\n\n def get_spend(self, key: str, period: str = \"daily\") -> float:\n \"\"\"Calculate total spend for a key (feature/user) over a period.\"\"\"\n today = datetime.now(timezone.utc).strftime(\"%Y-%m-%d\")\n month = today[:7] # YYYY-MM\n\n total = 0.0\n if not os.path.exists(self.cost_log_file):\n return total\n\n with open(self.cost_log_file, \"r\") as f:\n for line in f:\n line = line.strip()\n if not line:\n continue\n record = json.loads(line)\n\n # Match by feature or user_id\n if record.get(\"feature\") != key and record.get(\"user_id\") != key:\n if key != \"global\":\n continue\n\n ts = record[\"timestamp\"]\n if period == \"daily\" and not ts.startswith(today):\n continue\n if period == \"monthly\" and not ts.startswith(month):\n continue\n\n total += record[\"total_cost\"]\n\n return total\n\n def check_budget(self, key: str) -> dict:\n \"\"\"Check if spend is within budget. Returns status and triggers alerts.\"\"\"\n if key not in self.budgets:\n return {\"status\": \"no_budget\", \"key\": key}\n\n budget = self.budgets[key]\n daily_spend = self.get_spend(key, \"daily\")\n monthly_spend = self.get_spend(key, \"monthly\")\n\n result = {\n \"key\": key,\n \"daily_spend\": daily_spend,\n \"daily_limit\": budget.daily_limit,\n \"daily_pct\": (daily_spend / budget.daily_limit * 100)\n if budget.daily_limit > 0\n else 0,\n \"monthly_spend\": monthly_spend,\n \"monthly_limit\": budget.monthly_limit,\n \"monthly_pct\": (monthly_spend / budget.monthly_limit * 100)\n if budget.monthly_limit > 0\n else 0,\n }\n\n # Check daily budget\n if budget.daily_limit > 0 and daily_spend >= budget.daily_limit:\n result[\"status\"] = \"daily_exceeded\"\n self._fire_alert(\n f\"Daily budget EXCEEDED for '{key}': \"\n f\"${daily_spend:.2f} / ${budget.daily_limit:.2f}\",\n severity=\"critical\",\n context=result,\n )\n elif budget.daily_limit > 0 and daily_spend >= budget.daily_limit * budget.alert_threshold:\n result[\"status\"] = \"daily_warning\"\n self._fire_alert(\n f\"Daily budget WARNING for '{key}': \"\n f\"${daily_spend:.2f} / ${budget.daily_limit:.2f} \"\n f\"({result['daily_pct']:.0f}%)\",\n severity=\"warning\",\n context=result,\n )\n\n # Check monthly budget\n if budget.monthly_limit > 0 and monthly_spend >= budget.monthly_limit:\n result[\"status\"] = \"monthly_exceeded\"\n self._fire_alert(\n f\"Monthly budget EXCEEDED for '{key}': \"\n f\"${monthly_spend:.2f} / ${budget.monthly_limit:.2f}\",\n severity=\"critical\",\n context=result,\n )\n elif budget.monthly_limit > 0 and monthly_spend >= budget.monthly_limit * budget.alert_threshold:\n if result.get(\"status\") is None:\n result[\"status\"] = \"monthly_warning\"\n self._fire_alert(\n f\"Monthly budget WARNING for '{key}': \"\n f\"${monthly_spend:.2f} / ${budget.monthly_limit:.2f} \"\n f\"({result['monthly_pct']:.0f}%)\",\n severity=\"warning\",\n context=result,\n )\n\n if \"status\" not in result:\n result[\"status\"] = \"ok\"\n\n return result\n\n def should_allow_request(self, feature: str, user_id: str) -> bool:\n \"\"\"Pre-flight check: should we allow this API call?\"\"\"\n for key in [feature, user_id, \"global\"]:\n if key in self.budgets:\n status = self.check_budget(key)\n if status[\"status\"] in (\"daily_exceeded\", \"monthly_exceeded\"):\n return False\n return True\n\n# --- Setup and Usage ---\n\ndef slack_alert(alert: dict):\n \"\"\"Example alert handler. Replace with actual Slack/email integration.\"\"\"\n severity_label = \"CRITICAL\" if alert[\"severity\"] == \"critical\" else \"WARNING\"\n print(f\"[{severity_label}] {alert['message']}\")\n\n# Configure budgets\nmanager = BudgetManager(cost_log_file=\"costs.jsonl\")\n\nmanager.add_budget(\"global\", Budget(\n name=\"Global\", daily_limit=200.0, monthly_limit=5000.0\n))\nmanager.add_budget(\"rag_qa\", Budget(\n name=\"RAG QA Feature\", daily_limit=100.0, monthly_limit=2500.0\n))\nmanager.add_budget(\"user_892\", Budget(\n name=\"Heavy User\", daily_limit=5.0, monthly_limit=50.0\n))\n\nmanager.on_alert(slack_alert)\n\n# Before every API call, check budget\nfeature = \"rag_qa\"\nuser_id = \"user_892\"\n\nif manager.should_allow_request(feature, user_id):\n # Proceed with the API call\n print(\"Request allowed, proceeding...\")\nelse:\n # Return cached response, use cheaper model, or show error\n print(\"Budget exceeded. Falling back to cached response.\")\n```\n\n\nThe pattern is a pre-flight check before each expensive call. When a feature or user has exceeded budget, the application can fall back to a cached response, route to a cheaper model, queue the request, shorten the context, reduce `max_tokens`, or return a graceful error. Either way, the degradation is chosen deliberately rather than discovered on the next monthly invoice.\n\nPractical tips for setting budgets:\n\n- **Start with observation.** Run the cost tracker for a week before setting any limits. A baseline is required to know what \"normal\" looks like.\n- **Set alert thresholds before hard limits.** An 80% warning gives time to investigate before requests start getting rejected.\n- **Budget at multiple levels.** A global daily budget catches everything, but per-feature budgets identify which part of the system is misbehaving.\n- **Per-user budgets prevent abuse.** When an AI feature is exposed to end users, a single account sending 10,000 queries can consume the budget. Per-user limits cap that risk.\n\n\n```mermaid\nflowchart LR\n REQ[Incoming
Request]:::primary --> CHECK{Budget
Check}:::orange\n\n CHECK -->|Under limit| PROCEED[Process
Normally]:::green\n CHECK -->|80% warning| WARN[Process +
Send Alert]:::yellow\n CHECK -->|Exceeded| FALLBACK{Fallback
Strategy}:::red\n\n FALLBACK --> CACHE[Return
Cached Response]:::teal\n FALLBACK --> CHEAP[Route to
Cheaper Model]:::teal\n FALLBACK --> QUEUE[Queue for
Later]:::teal\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef yellow fill:#ffd43b,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n```\n\n\n---\n\n# Quiz\n\n---\n\n### References\n\n- [OpenAI API Pricing](https://openai.com/api/pricing/)\n- [Anthropic API Pricing](https://docs.anthropic.com/en/docs/about-claude/pricing)\n- [Google Gemini API Pricing](https://ai.google.dev/pricing)","pageType":"ai-engineering"}

Get Premium