AlgoMaster Logo

Understanding AI Application Costs

Last Updated: March 15, 2026

Ashish

Ashish Pratap Singh

Building AI applications is not just a technical challenge, it is also a cost management challenge. Unlike traditional software, AI systems introduce new cost drivers such as model inference, embeddings, vector storage, GPU usage, and external API calls. These costs can scale rapidly as usage grows.

In this chapter, we will break down the major cost components of AI applications and understand where money is actually spent.

How Token Pricing Works

If you have been using LLM APIs throughout this course, you know that you pay per token. But the pricing is more nuanced than "X dollars per token," and the differences between providers can significantly affect your bill.

The first thing to understand is that input tokens and output tokens are priced differently. Input tokens (your prompt, system message, retrieved context, conversation history) are cheaper. Output tokens (what the model generates) are more expensive, often 2x to 4x more. This makes sense from the provider's perspective: generating tokens requires sequential computation, while processing input tokens can be parallelized.

The second thing is that model tiers vary wildly in price. A query that costs $0.01 on GPT-4o might cost $0.0001 on GPT-4o-mini. That is a 100x difference for a model that handles most straightforward tasks just fine. Choosing the right model tier for each task is one of the biggest cost levers you have (we will cover model routing in Chapter 11.3).

Here is what pricing looks like across major providers as of early 2025:

Scroll
ProviderModelInput (per 1M tokens)Output (per 1M tokens)Context Window
OpenAIGPT-4o$2.50$10.00128K
OpenAIGPT-4o-mini$0.15$0.60128K
OpenAIo3-mini$1.10$4.40200K
AnthropicClaude 3.5 Sonnet$3.00$15.00200K
AnthropicClaude 3.5 Haiku$0.80$4.00200K
GoogleGemini 1.5 Pro$1.25$5.002M
GoogleGemini 1.5 Flash$0.075$0.301M
MetaLlama 3.1 70B (hosted)$0.35$0.40128K
MetaLlama 3.1 8B (hosted)$0.05$0.08128K

Prices reflect typical rates from providers and hosting platforms. Check current pricing before making decisions, as these change frequently.

A few things jump out from this table. First, the gap between frontier models and small models is enormous. Gemini 1.5 Flash costs roughly 30x less than Claude 3.5 Sonnet for input tokens. Second, output tokens are consistently 2x to 5x more expensive than input tokens. Third, open-source models hosted on platforms like Together AI, Fireworks, or Groq are significantly cheaper than proprietary models, though quality trade-offs vary by task.

The diagram above shows a typical RAG query. Notice that most of the input tokens come from the retrieved context and chat history, not the user's actual question. This is a common pattern: the user types 20 words, but you send 3,600 tokens to the model. Understanding this breakdown is the first step to optimization.

The Hidden Costs

Token costs from your main LLM are the obvious expense. But in a production AI application, especially one using RAG or agents, the LLM generation call is just one of several cost centers. Let's walk through the costs that most teams do not notice until they add up.

Embedding Generation

Every time a user asks a question in a RAG system, you generate an embedding for the query. That is cheap, a fraction of a cent. But you also generated embeddings for all your documents during indexing. If you have 100,000 documents at 500 tokens each, that is 50 million tokens just for the initial index. And every time you update or add documents, you pay again.

Scroll
Embedding ModelPrice (per 1M tokens)Typical Use
OpenAI text-embedding-3-small$0.02Cost-effective for most tasks
OpenAI text-embedding-3-large$0.13Higher quality, larger dimensions
Cohere embed-v3$0.10Multilingual support
Voyage AI voyage-3$0.06Code and technical content

Re-ranking Calls

If you are using a re-ranker (as we discussed in Module 5), you are paying for a separate model call on every query. Re-rankers like Cohere Rerank process your top-k retrieved documents and reorder them. At $2.00 per 1,000 search queries, this adds up fast for high-traffic applications.

Evaluation and Testing

Running LLM-as-judge evaluations (Module 10) means you are making LLM calls to evaluate LLM outputs. If you evaluate every response in production, you are roughly doubling your LLM costs. Even running evaluations on a sample of 10% of queries adds a meaningful line item.

Retries and Fallbacks

API calls fail. Rate limits hit. Responses sometimes do not match your expected format, so you retry. Each retry is a full cost event. If your retry rate is 5%, your actual cost is 5% higher than your token-level math suggests. Some teams have retry rates of 15-20% without realizing it, because they retry silently on validation failures.

Vector Database Hosting

Your vector database is not free. Whether you are using Pinecone, Weaviate Cloud, or self-hosted Qdrant, there is a monthly cost for storage and query processing. Pinecone's standard tier starts around $70/month. For large indexes (millions of vectors), costs can reach $200-500/month.

The Full Picture

Vector DB cost amortized: monthly hosting cost divided by query volume.

In this example, the hidden costs nearly double the per-query price. The LLM call itself is $0.014, but the total is $0.027 when you account for everything. At 10,000 queries per day, that hidden $0.013 per query adds up to $3,900 per month you were not tracking.

Building Cost Tracking into Your Application

Now that you know where costs come from, let's build the instrumentation to track them. The goal is simple: every API call your application makes should log its token usage and cost, tagged with enough metadata to answer questions like "which feature is most expensive?" and "which users cost the most?"

Here is a cost tracker that wraps your LLM calls:

main.py
Loading...

The key design decisions here: we log to a JSONL file (one JSON object per line) so records are easy to query and aggregate later. Every record includes the feature tag, which lets you break down costs by what part of your application generated the call, and user_id, so you can identify expensive users. The is_retry flag helps you track how much money retries are costing you.

Here is how you use it:

main.py
Loading...

Every call is now automatically logged with token counts, costs, latency, and metadata. No changes needed to your application logic, just replace direct API calls with tracked_chat_completion.

Building a Cost Dashboard

Logging costs is useless if you never look at the data. Let's build a simple dashboard that answers the three questions every team needs: how much are we spending per query, per feature, and per day?

main.py
Loading...

A sample output might look like this:

Two things immediately stand out from this dashboard. First, the rag_qa feature accounts for 59% of costs despite being only 42% of calls, because it uses the expensive GPT-4o model with large context windows. Second, user_892 costs almost twice as much per query as the average user, possibly because they ask long questions that retrieve more context.

This is the kind of visibility that turns "we're spending too much on AI" into "we should route classification tasks to GPT-4o-mini, which would save $1,200/month."

Setting Budgets and Alerts

Dashboards tell you what happened. Budgets and alerts prevent bad things from happening in the first place. Here is a budget system that enforces spending limits and sends notifications when you are approaching them.

main.py
Loading...

The pattern here is a pre-flight check before every API call. If the feature or user has exceeded their budget, you have options: fall back to a cached response, route to a cheaper model, queue the request for later, or return a graceful error. The right choice depends on your application, but the important thing is that you have the choice instead of discovering the overspend on your next invoice.

A few practical tips for setting budgets:

  • Start with observation. Run your cost tracker for a week before setting any limits. You need a baseline to know what "normal" looks like.
  • Set alert thresholds before hard limits. An 80% warning gives you time to investigate before requests start getting rejected.
  • Budget at multiple levels. A global daily budget catches everything, but per-feature budgets help you identify which part of the system is misbehaving.
  • Per-user budgets prevent abuse. If you expose an AI feature to end users, one user sending 10,000 queries can blow your budget. Per-user limits protect you.

References