LLM10: Unbounded Consumption — Cost-Based DoS and Resource Exhaustion in LLM Apps
In 2023, security researchers and LLM application developers documented a class of attack with no direct analog in traditional AppSec: “denial of wallet.” Unlike a classic HTTP flood (which requires the attacker to send many requests), a denial-of-wallet attack against an LLM application can be executed with a single, carefully crafted request — one that causes the model to generate thousands of tokens in response. An LLM API call costs money proportional to the tokens processed (input + output). A request that forces the model into a long generation loop — asking it to repeat a phrase indefinitely, generate a comprehensive book-length response, or solve a recursively expanding problem — can generate millions of tokens across a small number of requests, producing API bills of hundreds or thousands of dollars before any rate limit triggers. For applications that pass API costs to a shared budget, or that run agentic loops without per-run cost caps, a single malicious user can exhaust the entire organization’s monthly spend in minutes.
What is unbounded consumption?
Section titled “What is unbounded consumption?”OWASP LLM10 describes the risk of missing or insufficient limits on the resources an LLM application consumes: tokens per request, requests per user or time window, total spend per billing period, and execution time for agentic loops. The threat model has two actors:
External attackers craft requests that maximize resource consumption — prompts that cause verbose outputs, requests that trigger recursive tool call chains in agents, or high-rate API hammering. They may not need to authenticate; if your LLM endpoint is publicly accessible and processes requests before billing them to users, the cost is borne by you.
Internal abuse and runaway logic — even without a malicious actor, a bug in an agentic loop (failure to detect termination conditions), an unusually complex user query, or a prompt that triggers an unexpectedly verbose model response can cause cost spikes. A production LLM application without cost guardrails has a non-deterministic cost model that makes budget forecasting impossible and spikes unpredictable.
The attack surfaces are:
- No
max_tokenson API calls — the model generates until its context window limit, which can be 4K–200K tokens depending on the model - No per-user or per-session rate limits — a single user can make arbitrarily many requests
- No spend monitoring or circuit breakers — costs can accumulate for hours before anyone notices
- Agentic loops without iteration or time limits — a recursive agent can run indefinitely, calling tools and generating tokens in a tight loop
- Prompt that force long outputs — “list every country in the world with its GDP, population, head of state, and capital city in detail” forces a multi-thousand-token response
The exploit: verbose output trigger
Section titled “The exploit: verbose output trigger”# VULNERABLE: no max_tokens limit — attacker triggers unlimited token generationimport openaifrom flask import Flask, request, jsonify
app = Flask(__name__)client = openai.OpenAI()
@app.route("/ask", methods=["POST"])def ask(): user_message = request.json.get("message", "") # VULNERABLE: no max_tokens — model generates until context limit response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": user_message}, ], # max_tokens not set — defaults to model maximum (up to 16K+ tokens for gpt-4o) ) return jsonify({"reply": response.choices[0].message.content})
# Attacker sends:# {"message": "Write a complete encyclopedia article about every aspect of world history,# covering all civilizations from 10000 BCE to the present, with full detail."}# → Model generates tens of thousands of tokens# → At GPT-4o pricing, a single request costs $0.50–$2.00+# → 1000 such requests per hour = $500–$2000 per hour in API costsThe exploit: unbounded agentic loop
Section titled “The exploit: unbounded agentic loop”# VULNERABLE: agentic loop with no iteration limit or cost budgetfrom langchain.agents import AgentExecutor, create_tool_calling_agentfrom langchain_openai import ChatOpenAIfrom langchain.tools import tool
llm = ChatOpenAI(model="gpt-4o", temperature=0) # VULNERABLE: no max_tokens
@tooldef web_search(query: str) -> str: """Search the web for information.""" return search_api(query)
@tooldef analyze_results(text: str) -> str: """Analyze search results and determine next steps.""" response = llm.invoke(f"Analyze: {text}") # VULNERABLE: nested LLM call, no limits return response.content
agent = create_tool_calling_agent(llm, [web_search, analyze_results], prompt)executor = AgentExecutor( agent=agent, tools=[web_search, analyze_results], # VULNERABLE: max_iterations not set — defaults to 15 (or higher in some versions) # VULNERABLE: max_execution_time not set — can run for minutes verbose=True,)
# A prompt injection or complex query can cause the agent to loop:# web_search → analyze_results (calls LLM) → web_search → analyze_results → ...# Each iteration costs tokens. 100 iterations × 1000 tokens = $0.10–$5.00 per requestMitigations
Section titled “Mitigations”M1: Always set max_tokens on every API call
Section titled “M1: Always set max_tokens on every API call”This is the single most impactful control. Set max_tokens to the minimum value that satisfies the use case:
import openaifrom flask import Flask, request, jsonify
app = Flask(__name__)client = openai.OpenAI()
# SAFE: define per-use-case token budgetsTOKEN_LIMITS = { "chat": 512, # typical conversational response "summary": 256, # document summary "code": 1024, # code generation "structured": 256, # JSON extraction}
@app.route("/ask", methods=["POST"])def ask(): user_message = request.json.get("message", "") use_case = request.json.get("use_case", "chat")
max_tokens = TOKEN_LIMITS.get(use_case, 256) # SAFE: bounded output
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": user_message[:2000]}, # SAFE: input length limit too ], max_tokens=max_tokens, # SAFE: hard token ceiling max_completion_tokens=max_tokens, # SAFE: some models use this parameter ) return jsonify({"reply": response.choices[0].message.content})M2: Per-user rate limiting with token-aware budgets
Section titled “M2: Per-user rate limiting with token-aware budgets”Standard request-count rate limiting is not enough — one request can be 100× more expensive than another. Track token usage per user:
import timefrom collections import defaultdictfrom threading import Lockfrom functools import wrapsfrom flask import request, jsonify, g
class TokenBudgetRateLimiter: """Per-user rate limiter that tracks token consumption, not just request count."""
def __init__( self, max_tokens_per_window: int = 50_000, max_requests_per_window: int = 100, window_seconds: int = 3600, ): self.max_tokens = max_tokens_per_window self.max_requests = max_requests_per_window self.window_seconds = window_seconds self._usage: dict[str, dict] = defaultdict( lambda: {"tokens": 0, "requests": 0, "window_start": time.time()} ) self._lock = Lock()
def check_and_record(self, user_id: str, tokens_used: int) -> bool: with self._lock: now = time.time() usage = self._usage[user_id] # Reset window if expired if now - usage["window_start"] > self.window_seconds: usage["tokens"] = 0 usage["requests"] = 0 usage["window_start"] = now
if ( usage["tokens"] + tokens_used > self.max_tokens # SAFE: token budget check or usage["requests"] + 1 > self.max_requests # SAFE: request count check ): return False # rate limited
usage["tokens"] += tokens_used usage["requests"] += 1 return True
limiter = TokenBudgetRateLimiter( max_tokens_per_window=50_000, # SAFE: 50K tokens per user per hour max_requests_per_window=100, # SAFE: 100 requests per user per hour)
def rate_limited(f): @wraps(f) def decorated(*args, **kwargs): user_id = request.headers.get("X-User-ID", "anonymous") # Pre-check with estimated tokens; update after response if not limiter.check_and_record(user_id, tokens_used=0): return jsonify({"error": "Rate limit exceeded"}), 429 response = f(*args, **kwargs) return response return decoratedM3: Cost circuit breaker with tenacity and spend monitoring
Section titled “M3: Cost circuit breaker with tenacity and spend monitoring”Implement a circuit breaker that halts requests when spend anomalies are detected:
import osimport timeimport loggingfrom tenacity import retry, stop_after_attempt, wait_exponential, RetryErrorfrom openai import OpenAI, RateLimitError
logger = logging.getLogger(__name__)client = OpenAI()
MONTHLY_TOKEN_BUDGET = 10_000_000 # 10M tokens per month_tokens_used_this_month = 0_budget_lock = __import__("threading").Lock()
def check_budget(estimated_tokens: int) -> None: """SAFE: circuit breaker — raises if monthly budget would be exceeded.""" global _tokens_used_this_month with _budget_lock: if _tokens_used_this_month + estimated_tokens > MONTHLY_TOKEN_BUDGET: logger.error( f"Monthly token budget exceeded: {_tokens_used_this_month}/{MONTHLY_TOKEN_BUDGET}" ) raise RuntimeError("Monthly LLM budget exhausted. Requests halted.")
def record_usage(tokens: int) -> None: global _tokens_used_this_month with _budget_lock: _tokens_used_this_month += tokens logger.info(f"Tokens used this month: {_tokens_used_this_month}/{MONTHLY_TOKEN_BUDGET}")
@retry( # SAFE: exponential backoff stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60), retry=lambda e: isinstance(e, RateLimitError),)def call_llm_with_budget(messages: list, max_tokens: int = 512) -> str: check_budget(estimated_tokens=max_tokens) # SAFE: pre-call budget check
response = client.chat.completions.create( model="gpt-4o", messages=messages, max_tokens=max_tokens, # SAFE: hard token limit )
actual_tokens = response.usage.total_tokens record_usage(actual_tokens) # SAFE: track actual usage
return response.choices[0].message.contentM4: Bound agentic loops with hard iteration and time limits
Section titled “M4: Bound agentic loops with hard iteration and time limits”For any agent-based workflow, set explicit upper bounds on iterations and wall-clock time:
from langchain.agents import AgentExecutor, create_tool_calling_agentfrom langchain_openai import ChatOpenAI
llm = ChatOpenAI( model="gpt-4o", temperature=0, max_tokens=512, # SAFE: per-call output limit)
executor = AgentExecutor( agent=agent, tools=tools, max_iterations=5, # SAFE: hard iteration ceiling max_execution_time=30.0, # SAFE: 30-second wall-clock timeout return_intermediate_steps=True, # SAFE: audit token consumption per step early_stopping_method="generate", verbose=True,)
# For async agents, add a semaphore to limit concurrent runs:import asyncio
_agent_semaphore = asyncio.Semaphore(10) # SAFE: max 10 concurrent agent runs
async def run_agent_bounded(input_text: str) -> str: async with _agent_semaphore: # SAFE: concurrency limit return await executor.ainvoke({"input": input_text})Detecting LLM10 with LLMArmor
Section titled “Detecting LLM10 with LLMArmor”LLMArmor detects missing token limits and unbounded agent configurations in Python source code — one of its strongest detection categories.
pip install llmarmorllmarmor scan ./srcExample findings:
LLM10 — Unbounded Consumption [HIGH] api.py:18 client.chat.completions.create(model="gpt-4o", messages=messages) LLM API call with no max_tokens parameter — unbounded token generation. Fix: always set max_tokens to the minimum value required for the use case. Ref: https://owasp.org/www-project-top-10-for-large-language-model-applications/
LLM10 — Unbounded Consumption [MEDIUM] agent.py:14 AgentExecutor(agent=agent, tools=tools, verbose=True) AgentExecutor initialized without max_iterations or max_execution_time. Fix: set max_iterations (typically 3–10) and max_execution_time (30–120 seconds).Frequently asked questions
Section titled “Frequently asked questions”- What is a denial of wallet attack?
- A denial of wallet attack is a cost-based denial-of-service against LLM applications. The attacker crafts inputs that cause the model to generate very large outputs or trigger expensive agentic loops. Because LLM API pricing is proportional to token count, a small number of requests can produce a disproportionately large bill. Unlike traditional DDoS, which requires high-volume traffic, a denial of wallet attack can be executed with a single request that triggers tens of thousands of tokens of output.
- Why isn't request-count rate limiting enough to prevent denial of wallet?
- Request-count rate limiting treats all requests as equal cost. But a request that generates 100 tokens costs roughly 100× less than one that generates 10,000 tokens. An attacker who stays under the request-count limit but crafts each request to maximize output tokens can still exhaust your token budget. Token-aware rate limiting — tracking tokens consumed per user per time window rather than just request count — is necessary to address this.
- How do I set an appropriate max_tokens value?
- Measure the token distribution of legitimate responses in your application using
response.usage.completion_tokens. Setmax_tokensto the 99th percentile of that distribution with a reasonable buffer. For conversational applications, 256–512 tokens is usually sufficient. For code generation, 1024–2048. For document summarization, 256–512. The key is to set the lowest value that satisfies the use case — not an arbitrary large number as a 'safe' upper bound. - What is the difference between max_tokens and max_completion_tokens?
max_tokensis the parameter used by the OpenAI API (and most compatible APIs) to limit the number of tokens the model generates in the response.max_completion_tokensis an alias used by some newer OpenAI endpoints and the Responses API. For Anthropic's Claude, the parameter ismax_tokens. For Google's Gemini, it ismax_output_tokens. Always check your provider's API documentation — the behavior is the same (hard ceiling on output tokens) but the parameter name varies.- How do I monitor LLM API spend in production?
- Most LLM providers offer spend dashboards and webhook alerts for budget thresholds (OpenAI's usage limits, Anthropic's billing alerts). At the application layer, log
response.usage.total_tokensfor every API call and aggregate by user, session, and time window. Set alerting thresholds at 50%, 80%, and 100% of your monthly budget. Use a circuit breaker pattern (see M3 above) to halt requests automatically when thresholds are exceeded. - Can prompt caching reduce unbounded consumption risk?
- Prompt caching (available on Claude, GPT-4o, and others) reduces the cost of repeated prompt prefixes — useful for long, static system prompts. It does not protect against denial-of-wallet attacks where the attacker varies the input to defeat caching, or attacks that target output token generation rather than input processing. Caching is a cost optimization, not a security control. Use it alongside
max_tokenslimits and rate limiting, not instead of them.