What is a denial of wallet attack?

A denial of wallet attack is a cost-based denial-of-service against LLM applications. The attacker crafts inputs that cause the model to generate very large outputs or trigger expensive agentic loops. Because LLM API pricing is proportional to token count, a small number of requests can produce a disproportionately large bill. Unlike traditional DDoS, which requires high-volume traffic, a denial of wallet attack can be executed with a single request that triggers tens of thousands of tokens of output.

Why isn't request-count rate limiting enough to prevent denial of wallet?

Request-count rate limiting treats all requests as equal cost. But a request that generates 100 tokens costs roughly 100× less than one that generates 10,000 tokens. An attacker who stays under the request-count limit but crafts each request to maximize output tokens can still exhaust your token budget. Token-aware rate limiting — tracking tokens consumed per user per time window rather than just request count — is necessary to address this.

How do I set an appropriate max_tokens value?

Measure the token distribution of legitimate responses in your application using response.usage.completion_tokens . Set max_tokens to the 99th percentile of that distribution with a reasonable buffer. For conversational applications, 256–512 tokens is usually sufficient. For code generation, 1024–2048. For document summarization, 256–512. The key is to set the lowest value that satisfies the use case — not an arbitrary large number as a 'safe' upper bound.

What is the difference between max_tokens and max_completion_tokens?

max_tokens is the parameter used by the OpenAI API (and most compatible APIs) to limit the number of tokens the model generates in the response. max_completion_tokens is an alias used by some newer OpenAI endpoints and the Responses API. For Anthropic's Claude, the parameter is max_tokens . For Google's Gemini, it is max_output_tokens . Always check your provider's API documentation — the behavior is the same (hard ceiling on output tokens) but the parameter name varies.

How do I monitor LLM API spend in production?

Most LLM providers offer spend dashboards and webhook alerts for budget thresholds (OpenAI's usage limits, Anthropic's billing alerts). At the application layer, log response.usage.total_tokens for every API call and aggregate by user, session, and time window. Set alerting thresholds at 50%, 80%, and 100% of your monthly budget. Use a circuit breaker pattern (see M3 above) to halt requests automatically when thresholds are exceeded.

Can prompt caching reduce unbounded consumption risk?

Prompt caching (available on Claude, GPT-4o, and others) reduces the cost of repeated prompt prefixes — useful for long, static system prompts. It does not protect against denial-of-wallet attacks where the attacker varies the input to defeat caching, or attacks that target output token generation rather than input processing. Caching is a cost optimization, not a security control. Use it alongside max_tokens limits and rate limiting, not instead of them.

LLM10: Unbounded Consumption — Cost-Based DoS and Resource Exhaustion in LLM Apps

In 2023, security researchers and LLM application developers documented a class of attack with no direct analog in traditional AppSec: “denial of wallet.” Unlike a classic HTTP flood (which requires the attacker to send many requests), a denial-of-wallet attack against an LLM application can be executed with a single, carefully crafted request — one that causes the model to generate thousands of tokens in response. An LLM API call costs money proportional to the tokens processed (input + output). A request that forces the model into a long generation loop — asking it to repeat a phrase indefinitely, generate a comprehensive book-length response, or solve a recursively expanding problem — can generate millions of tokens across a small number of requests, producing API bills of hundreds or thousands of dollars before any rate limit triggers. For applications that pass API costs to a shared budget, or that run agentic loops without per-run cost caps, a single malicious user can exhaust the entire organization’s monthly spend in minutes.

What is unbounded consumption?

OWASP LLM10 describes the risk of missing or insufficient limits on the resources an LLM application consumes: tokens per request, requests per user or time window, total spend per billing period, and execution time for agentic loops. The threat model has two actors:

External attackers craft requests that maximize resource consumption — prompts that cause verbose outputs, requests that trigger recursive tool call chains in agents, or high-rate API hammering. They may not need to authenticate; if your LLM endpoint is publicly accessible and processes requests before billing them to users, the cost is borne by you.

Internal abuse and runaway logic — even without a malicious actor, a bug in an agentic loop (failure to detect termination conditions), an unusually complex user query, or a prompt that triggers an unexpectedly verbose model response can cause cost spikes. A production LLM application without cost guardrails has a non-deterministic cost model that makes budget forecasting impossible and spikes unpredictable.

The attack surfaces are:

No max_tokens on API calls — the model generates until its context window limit, which can be 4K–200K tokens depending on the model
No per-user or per-session rate limits — a single user can make arbitrarily many requests
No spend monitoring or circuit breakers — costs can accumulate for hours before anyone notices
Agentic loops without iteration or time limits — a recursive agent can run indefinitely, calling tools and generating tokens in a tight loop
Prompt that force long outputs — “list every country in the world with its GDP, population, head of state, and capital city in detail” forces a multi-thousand-token response

The exploit: verbose output trigger

# VULNERABLE: no max_tokens limit — attacker triggers unlimited token generation
import openai
from flask import Flask, request, jsonify

app = Flask(__name__)
client = openai.OpenAI()

@app.route("/ask", methods=["POST"])
def ask():
    user_message = request.json.get("message", "")
    # VULNERABLE: no max_tokens — model generates until context limit
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user",   "content": user_message},
        ],
        # max_tokens not set — defaults to model maximum (up to 16K+ tokens for gpt-4o)
    )
    return jsonify({"reply": response.choices[0].message.content})

# Attacker sends:
# {"message": "Write a complete encyclopedia article about every aspect of world history,
#  covering all civilizations from 10000 BCE to the present, with full detail."}
# → Model generates tens of thousands of tokens
# → At GPT-4o pricing, a single request costs $0.50–$2.00+
# → 1000 such requests per hour = $500–$2000 per hour in API costs

The exploit: unbounded agentic loop

# VULNERABLE: agentic loop with no iteration limit or cost budget
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_openai import ChatOpenAI
from langchain.tools import tool

llm = ChatOpenAI(model="gpt-4o", temperature=0)  # VULNERABLE: no max_tokens

@tool
def web_search(query: str) -> str:
    """Search the web for information."""
    return search_api(query)

@tool
def analyze_results(text: str) -> str:
    """Analyze search results and determine next steps."""
    response = llm.invoke(f"Analyze: {text}")  # VULNERABLE: nested LLM call, no limits
    return response.content

agent = create_tool_calling_agent(llm, [web_search, analyze_results], prompt)
executor = AgentExecutor(
    agent=agent,
    tools=[web_search, analyze_results],
    # VULNERABLE: max_iterations not set — defaults to 15 (or higher in some versions)
    # VULNERABLE: max_execution_time not set — can run for minutes
    verbose=True,
)

# A prompt injection or complex query can cause the agent to loop:
# web_search → analyze_results (calls LLM) → web_search → analyze_results → ...
# Each iteration costs tokens. 100 iterations × 1000 tokens = $0.10–$5.00 per request

Mitigations

M1: Always set max_tokens on every API call

This is the single most impactful control. Set max_tokens to the minimum value that satisfies the use case:

import openai
from flask import Flask, request, jsonify

app = Flask(__name__)
client = openai.OpenAI()

# SAFE: define per-use-case token budgets
TOKEN_LIMITS = {
    "chat":        512,   # typical conversational response
    "summary":     256,   # document summary
    "code":       1024,   # code generation
    "structured":  256,   # JSON extraction
}

@app.route("/ask", methods=["POST"])
def ask():
    user_message = request.json.get("message", "")
    use_case = request.json.get("use_case", "chat")

    max_tokens = TOKEN_LIMITS.get(use_case, 256)  # SAFE: bounded output

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user",   "content": user_message[:2000]},  # SAFE: input length limit too
        ],
        max_tokens=max_tokens,           # SAFE: hard token ceiling
        max_completion_tokens=max_tokens, # SAFE: some models use this parameter
    )
    return jsonify({"reply": response.choices[0].message.content})

M2: Per-user rate limiting with token-aware budgets

Standard request-count rate limiting is not enough — one request can be 100× more expensive than another. Track token usage per user:

import time
from collections import defaultdict
from threading import Lock
from functools import wraps
from flask import request, jsonify, g

class TokenBudgetRateLimiter:
    """Per-user rate limiter that tracks token consumption, not just request count."""

    def __init__(
        self,
        max_tokens_per_window: int = 50_000,
        max_requests_per_window: int = 100,
        window_seconds: int = 3600,
    ):
        self.max_tokens = max_tokens_per_window
        self.max_requests = max_requests_per_window
        self.window_seconds = window_seconds
        self._usage: dict[str, dict] = defaultdict(
            lambda: {"tokens": 0, "requests": 0, "window_start": time.time()}
        )
        self._lock = Lock()

    def check_and_record(self, user_id: str, tokens_used: int) -> bool:
        with self._lock:
            now = time.time()
            usage = self._usage[user_id]
            # Reset window if expired
            if now - usage["window_start"] > self.window_seconds:
                usage["tokens"] = 0
                usage["requests"] = 0
                usage["window_start"] = now

            if (
                usage["tokens"] + tokens_used > self.max_tokens    # SAFE: token budget check
                or usage["requests"] + 1 > self.max_requests       # SAFE: request count check
            ):
                return False  # rate limited

            usage["tokens"] += tokens_used
            usage["requests"] += 1
            return True

limiter = TokenBudgetRateLimiter(
    max_tokens_per_window=50_000,   # SAFE: 50K tokens per user per hour
    max_requests_per_window=100,    # SAFE: 100 requests per user per hour
)

def rate_limited(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        user_id = request.headers.get("X-User-ID", "anonymous")
        # Pre-check with estimated tokens; update after response
        if not limiter.check_and_record(user_id, tokens_used=0):
            return jsonify({"error": "Rate limit exceeded"}), 429
        response = f(*args, **kwargs)
        return response
    return decorated

M3: Cost circuit breaker with tenacity and spend monitoring

Implement a circuit breaker that halts requests when spend anomalies are detected:

import os
import time
import logging
from tenacity import retry, stop_after_attempt, wait_exponential, RetryError
from openai import OpenAI, RateLimitError

logger = logging.getLogger(__name__)
client = OpenAI()

MONTHLY_TOKEN_BUDGET = 10_000_000   # 10M tokens per month
_tokens_used_this_month = 0
_budget_lock = __import__("threading").Lock()

def check_budget(estimated_tokens: int) -> None:
    """SAFE: circuit breaker — raises if monthly budget would be exceeded."""
    global _tokens_used_this_month
    with _budget_lock:
        if _tokens_used_this_month + estimated_tokens > MONTHLY_TOKEN_BUDGET:
            logger.error(
                f"Monthly token budget exceeded: {_tokens_used_this_month}/{MONTHLY_TOKEN_BUDGET}"
            )
            raise RuntimeError("Monthly LLM budget exhausted. Requests halted.")

def record_usage(tokens: int) -> None:
    global _tokens_used_this_month
    with _budget_lock:
        _tokens_used_this_month += tokens
        logger.info(f"Tokens used this month: {_tokens_used_this_month}/{MONTHLY_TOKEN_BUDGET}")

@retry(                                                  # SAFE: exponential backoff
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60),
    retry=lambda e: isinstance(e, RateLimitError),
)
def call_llm_with_budget(messages: list, max_tokens: int = 512) -> str:
    check_budget(estimated_tokens=max_tokens)            # SAFE: pre-call budget check

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=max_tokens,                           # SAFE: hard token limit
    )

    actual_tokens = response.usage.total_tokens
    record_usage(actual_tokens)                          # SAFE: track actual usage

    return response.choices[0].message.content

M4: Bound agentic loops with hard iteration and time limits

For any agent-based workflow, set explicit upper bounds on iterations and wall-clock time:

from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    max_tokens=512,                 # SAFE: per-call output limit
)

executor = AgentExecutor(
    agent=agent,
    tools=tools,
    max_iterations=5,               # SAFE: hard iteration ceiling
    max_execution_time=30.0,        # SAFE: 30-second wall-clock timeout
    return_intermediate_steps=True, # SAFE: audit token consumption per step
    early_stopping_method="generate",
    verbose=True,
)

# For async agents, add a semaphore to limit concurrent runs:
import asyncio

_agent_semaphore = asyncio.Semaphore(10)  # SAFE: max 10 concurrent agent runs

async def run_agent_bounded(input_text: str) -> str:
    async with _agent_semaphore:            # SAFE: concurrency limit
        return await executor.ainvoke({"input": input_text})

Detecting LLM10 with LLMArmor

LLMArmor detects missing token limits and unbounded agent configurations in Python source code — one of its strongest detection categories.

pip install llmarmor
llmarmor scan ./src

Example findings:

LLM10 — Unbounded Consumption [HIGH]
  api.py:18  client.chat.completions.create(model="gpt-4o", messages=messages)
  LLM API call with no max_tokens parameter — unbounded token generation.
  Fix: always set max_tokens to the minimum value required for the use case.
  Ref: https://owasp.org/www-project-top-10-for-large-language-model-applications/

LLM10 — Unbounded Consumption [MEDIUM]
  agent.py:14  AgentExecutor(agent=agent, tools=tools, verbose=True)
  AgentExecutor initialized without max_iterations or max_execution_time.
  Fix: set max_iterations (typically 3–10) and max_execution_time (30–120 seconds).

Frequently asked questions

What is a denial of wallet attack?: A denial of wallet attack is a cost-based denial-of-service against LLM applications. The attacker crafts inputs that cause the model to generate very large outputs or trigger expensive agentic loops. Because LLM API pricing is proportional to token count, a small number of requests can produce a disproportionately large bill. Unlike traditional DDoS, which requires high-volume traffic, a denial of wallet attack can be executed with a single request that triggers tens of thousands of tokens of output.
Why isn't request-count rate limiting enough to prevent denial of wallet?: Request-count rate limiting treats all requests as equal cost. But a request that generates 100 tokens costs roughly 100× less than one that generates 10,000 tokens. An attacker who stays under the request-count limit but crafts each request to maximize output tokens can still exhaust your token budget. Token-aware rate limiting — tracking tokens consumed per user per time window rather than just request count — is necessary to address this.
How do I set an appropriate max_tokens value?: Measure the token distribution of legitimate responses in your application using response.usage.completion_tokens. Set max_tokens to the 99th percentile of that distribution with a reasonable buffer. For conversational applications, 256–512 tokens is usually sufficient. For code generation, 1024–2048. For document summarization, 256–512. The key is to set the lowest value that satisfies the use case — not an arbitrary large number as a 'safe' upper bound.
What is the difference between max_tokens and max_completion_tokens?: max_tokens is the parameter used by the OpenAI API (and most compatible APIs) to limit the number of tokens the model generates in the response. max_completion_tokens is an alias used by some newer OpenAI endpoints and the Responses API. For Anthropic's Claude, the parameter is max_tokens. For Google's Gemini, it is max_output_tokens. Always check your provider's API documentation — the behavior is the same (hard ceiling on output tokens) but the parameter name varies.
How do I monitor LLM API spend in production?: Most LLM providers offer spend dashboards and webhook alerts for budget thresholds (OpenAI's usage limits, Anthropic's billing alerts). At the application layer, log response.usage.total_tokens for every API call and aggregate by user, session, and time window. Set alerting thresholds at 50%, 80%, and 100% of your monthly budget. Use a circuit breaker pattern (see M3 above) to halt requests automatically when thresholds are exceeded.
Can prompt caching reduce unbounded consumption risk?: Prompt caching (available on Claude, GPT-4o, and others) reduces the cost of repeated prompt prefixes — useful for long, static system prompts. It does not protect against denial-of-wallet attacks where the attacker varies the input to defeat caching, or attacks that target output token generation rather than input processing. Caching is a cost optimization, not a security control. Use it alongside max_tokens limits and rate limiting, not instead of them.

OWASP LLM Top 10 Guide Complete guide to all 10 LLM risks with mitigations.

OWASP Coverage Reference LLM10 rule details — what unbounded consumption patterns LLMArmor detects.

LLM08: Excessive Agency Agentic loops without bounds combine LLM08 and LLM10 risks.

LLM06: Insecure Plugin Design Nested LLM calls in plugins multiply token consumption.

LLMArmor vs Promptfoo Static analysis vs dynamic fuzzing for LLM resource controls.

CI/CD Integration Catch missing max_tokens at commit time in your CI pipeline.