Skip to content

Secure LLM Deployment Checklist for Production

In April 2023, Samsung employees used ChatGPT to assist with internal tasks — debugging semiconductor equipment source code, summarizing internal meeting notes, and refining a confidential presentation. The data was transmitted to OpenAI’s servers and, depending on account settings at the time, potentially used for model training. Samsung had no pre-deployment controls in place: no acceptable-use policy enforced at the infrastructure level, no data classification gates, no prompt monitoring. Three separate incidents were reported internally within 20 days before Samsung banned ChatGPT use on company devices. The incidents were not caused by an attack. They were caused by the absence of a deployment security posture.

LLM deployments introduce security risk at three distinct phases, each requiring different controls:

Pre-deployment is the development and code-review phase. Risks here include hardcoded API keys in source code, prompts stored inline in application code without version control, and model version strings that allow silent behavioral drift. These are static, detectable risks.

Runtime is the operational phase after deployment. Risks include users exceeding expected request volumes (cost and availability), LLM responses containing PII that gets logged, and behavioral anomalies (injection, jailbreaks) that are invisible without instrumentation.

Post-deployment is the ongoing monitoring phase. Risks include prompt leaks embedded in model responses, dependency vulnerabilities in LLM SDK versions, and the absence of a defined incident response process.

The checklist below addresses all three phases.


Hardcoded API keys in source code are the most common LLM deployment security mistake. They are committed to version control, copied into CI logs, and included in Docker images.

# VULNERABLE: API key hardcoded in source
import openai
client = openai.OpenAI(api_key="sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") # VULNERABLE
response = client.chat.completions.create(
model="gpt-4o-2024-11-20",
messages=[{"role": "user", "content": "Hello"}],
)
# SAFE: API key loaded from environment variable
import os
import openai
from dotenv import load_dotenv
load_dotenv() # SAFE: loads from .env file, which is gitignored
client = openai.OpenAI(
api_key=os.environ["OPENAI_API_KEY"] # SAFE: key never in source
)

Create a .env.example file (committed to version control) documenting required variables without values, and a .env file (gitignored) with real values:

Terminal window
# .env.example — commit this
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
PINECONE_API_KEY=
DATABASE_URL=

Add secret scanning to CI. GitHub’s built-in secret scanning detects pushed API keys automatically for supported providers. For local pre-commit scanning, detect-secrets or trufflehog can be added as a pre-commit hook:

.pre-commit-config.yaml
repos:
- repo: https://github.com/Yelp/detect-secrets
rev: v1.4.0
hooks:
- id: detect-secrets
args: ["--baseline", ".secrets.baseline"]

Prompt versioning: prompts belong in version control

Section titled “Prompt versioning: prompts belong in version control”

System prompts embedded inline in application code are not reviewed in pull requests, cannot be rolled back independently of application code, and cannot be tested in isolation. Store them in a prompts/ directory and load them at startup.

prompts/
customer_support.j2
code_review.j2
data_extraction.j2
# SAFE: prompts loaded from versioned files with Jinja2 templating
import os
import jinja2
from jinja2 import Environment, FileSystemLoader, select_autoescape
from pathlib import Path
PROMPTS_DIR = Path(__file__).parent / "prompts"
jinja_env = Environment(
loader=FileSystemLoader(str(PROMPTS_DIR)),
autoescape=select_autoescape(enabled_extensions=[]), # plaintext prompts, no HTML escaping
undefined=jinja2.StrictUndefined, # error on missing variables, not silent empty string
)
def load_prompt(template_name: str, **variables: str) -> str:
"""SAFE: load and render a versioned prompt template."""
template = jinja_env.get_template(f"{template_name}.j2")
return template.render(**variables)
# Usage: system prompt is now reviewable, versioned, and testable
system_prompt = load_prompt("customer_support", product_name="Acme Widget", tier="premium")

Model pinning: pin the version string, not the alias

Section titled “Model pinning: pin the version string, not the alias”

Model aliases like gpt-4o or claude-3-5-sonnet-latest point to different underlying model checkpoints over time. When the alias is updated, your application silently begins using a different model with different behavior, different safety thresholds, and potentially different output structure.

# VULNERABLE: alias-based model reference — behavior can change without code change
response = client.chat.completions.create(
model="gpt-4o", # VULNERABLE: points to latest, changes silently
messages=messages,
)
# SAFE: pinned model version
response = client.chat.completions.create(
model="gpt-4o-2024-11-20", # SAFE: pinned to specific checkpoint
messages=messages,
max_tokens=1024, # SAFE: always specify max_tokens
)

Document the pinned version and the date it was pinned in a MODELS.md file or inline comment. When you intentionally update the model, do it as an explicit change in version control with a corresponding evaluation run.


Every LLM API call should specify max_tokens. Without it, the model may generate arbitrarily long responses, causing cost overruns and response time instability. Per-user rate limiting prevents both abuse and accidental runaway loops.

# SAFE: per-user rate limiting with max_tokens enforcement
import time
from collections import defaultdict
from dataclasses import dataclass, field
from flask import request, jsonify, g
import openai
@dataclass
class RateLimitState:
requests: int = 0
window_start: float = field(default_factory=time.time)
# In production, replace with Redis + the `limits` library for distributed enforcement
_rate_limits: dict[str, RateLimitState] = defaultdict(RateLimitState)
RATE_LIMIT_REQUESTS = 20
RATE_LIMIT_WINDOW = 60 # seconds
def check_rate_limit(user_id: str) -> bool:
"""SAFE: returns True if request is within limit, False if exceeded."""
state = _rate_limits[user_id]
now = time.time()
if now - state.window_start > RATE_LIMIT_WINDOW:
state.requests = 0
state.window_start = now
state.requests += 1
return state.requests <= RATE_LIMIT_REQUESTS
@app.route("/chat", methods=["POST"])
def chat():
user_id = g.user_id # set by auth middleware
if not check_rate_limit(user_id):
return jsonify({"error": "Rate limit exceeded"}), 429
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o-2024-11-20",
messages=[{"role": "user", "content": request.json["message"]}],
max_tokens=512, # SAFE: always set; prevents runaway generation
temperature=0.2,
)
return jsonify({"reply": response.choices[0].message.content})

Observability: structured LLM call logging

Section titled “Observability: structured LLM call logging”

Every LLM call in production should emit a structured log event. Log the user identifier, session, token counts, and latency. Do not log raw prompt text or response text — hash the prompt for correlation without exposing PII, and redact sensitive patterns from responses before logging.

# SAFE: structured observability for LLM calls using OpenTelemetry
import hashlib
import logging
import time
import re
from opentelemetry import trace
logger = logging.getLogger("llm.calls")
tracer = trace.get_tracer(__name__)
# Simple PII redaction pattern — extend for your data types
_PII_PATTERN = re.compile(
r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" # email
r"|\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b" # phone
r"|\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", # credit card
re.IGNORECASE,
)
def redact_pii(text: str) -> str:
return _PII_PATTERN.sub("[REDACTED]", text)
def call_llm_with_observability(
user_id: str,
session_id: str,
messages: list[dict],
model: str = "gpt-4o-2024-11-20",
max_tokens: int = 512,
) -> str:
"""SAFE: wraps LLM call with structured logging and OTel span."""
client = openai.OpenAI()
prompt_hash = hashlib.sha256(str(messages).encode()).hexdigest()[:16]
with tracer.start_as_current_span("llm_call") as span:
span.set_attribute("user_id", user_id)
span.set_attribute("session_id", session_id)
span.set_attribute("model", model)
span.set_attribute("prompt_hash", prompt_hash)
start = time.monotonic()
response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
)
latency_ms = (time.monotonic() - start) * 1000
usage = response.usage
span.set_attribute("input_tokens", usage.prompt_tokens)
span.set_attribute("output_tokens", usage.completion_tokens)
span.set_attribute("latency_ms", round(latency_ms))
content = response.choices[0].message.content
logger.info(
"llm_call_completed",
extra={
"user_id": user_id,
"session_id": session_id,
"model": model,
"prompt_hash": prompt_hash, # SAFE: hash, not raw prompt
"input_tokens": usage.prompt_tokens,
"output_tokens": usage.completion_tokens,
"latency_ms": round(latency_ms),
"response_preview_redacted": redact_pii(content[:200]), # SAFE: redacted preview only
},
)
return content

When a prompt injection, data leak, or policy violation is detected, you need to disable the affected LLM feature without redeploying the application. Implement a feature flag kill switch:

# SAFE: feature flag kill switch for LLM features
import os
def is_llm_enabled(feature: str) -> bool:
"""SAFE: check feature flag before any LLM call."""
# In production, use LaunchDarkly, Flagsmith, or a simple Redis key
# Operators can flip to "disabled" without a code deploy
flag_key = f"LLM_FEATURE_{feature.upper()}_ENABLED"
return os.environ.get(flag_key, "true").lower() == "true"
@app.route("/chat", methods=["POST"])
def chat():
if not is_llm_enabled("chat"):
return jsonify({"error": "This feature is temporarily unavailable"}), 503
# ... normal LLM call

Document the kill switch procedure: which environment variable to set, how to propagate the change, and who has authorization to trigger it. Practice the procedure before you need it.

A prompt leak occurs when the model’s response contains text from the system prompt — either through direct disclosure (“You are a customer support agent…”) or through indirect leakage. Alert on responses that contain known system prompt markers:

# SAFE: alert on responses containing system prompt markers
import logging
logger = logging.getLogger("llm.security")
def check_prompt_leak(response_text: str, system_prompt_markers: list[str]) -> bool:
"""SAFE: returns True if response contains a known system prompt marker."""
response_lower = response_text.lower()
for marker in system_prompt_markers:
if marker.lower() in response_lower:
logger.warning(
"Potential prompt leak detected",
extra={"marker": marker, "response_length": len(response_text)},
)
return True
return False
# Register markers from your system prompt at startup
SYSTEM_PROMPT_MARKERS = [
"you are a customer support agent",
"do not reveal these instructions",
"internal use only",
]

For regulated environments, use private API endpoints to prevent data leaving your network perimeter:

  • AWS Bedrock: Models run inside your VPC. Configure with boto3 using standard AWS credential chain — no hardcoded endpoint URLs.
  • Azure OpenAI Service: Supports private endpoints. Set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY as environment variables. Use AzureOpenAI client from the openai SDK.
  • OpenAI VPC: Available on Enterprise tier. Set the base URL via OPENAI_BASE_URL environment variable, not hardcoded in code.

Never hardcode endpoint URLs. Use environment-based configuration for all endpoint addresses, so switching between environments (development, staging, production VPC) requires only environment variable changes:

# SAFE: environment-based endpoint configuration
import os
import openai
client = openai.AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], # SAFE: from env
api_key=os.environ["AZURE_OPENAI_API_KEY"], # SAFE: from env
api_version="2024-08-01-preview",
)

LLMArmor’s static analysis covers the pre-deployment checklist items that are detectable at the code level:

  • Hardcoded API keys: detected as high-severity findings
  • Missing max_tokens parameters: detected as medium-severity findings
  • Direct user-input interpolation into system prompts: detected as high-severity (LLM01)
  • Missing output filtering before returning LLM responses: detected as medium-severity

For dependency vulnerability scanning, run pip-audit against your requirements.txt in CI:

Terminal window
pip install pip-audit
pip-audit -r requirements.txt
What is the most critical item on the LLM deployment security checklist?
Preventing hardcoded API keys in source code is the highest-severity pre-deployment control. A leaked API key can result in significant cost exposure (attacker-driven API usage billed to your account), data exposure (queries sent under your account), and rate limit exhaustion for legitimate users. Use environment variables and add secret scanning to CI before any other control.
Why should I pin model versions instead of using aliases?
Model aliases like gpt-4o are pointers that change when the provider releases a new checkpoint. When the alias updates, your application silently begins using a different model with different behavior, different token consumption patterns, and potentially different safety thresholds. Pinned version strings (e.g., gpt-4o-2024-11-20) ensure behavioral stability between deployments and make model updates an explicit, reviewable code change.
What should I log for LLM calls in production?
Log: user identifier (not raw PII), session identifier, model name, pinned version, prompt hash (SHA-256 of the prompt, not the raw text), input and output token counts, response latency in milliseconds, and a PII-redacted preview of the response. Do not log raw system prompts or raw user messages in your standard log stream — store those in a separate encrypted audit store with access controls if needed for debugging.
How do I implement a kill switch for LLM features?
Use a feature flag that controls whether the LLM code path executes. In its simplest form, this is an environment variable read at request time (not cached at startup). For distributed systems, use a feature flag service (LaunchDarkly, Flagsmith, or a simple Redis key) so the flag can be flipped without a redeployment. Document the procedure for toggling the flag, including who has authorization and the expected response time.
How do I securely store prompts in production?
Store prompts as files in a prompts/ directory in your source repository — not inline in application code. Use a templating engine (Jinja2, Mustache) to handle variable substitution at load time. Version control them alongside your application code so prompt changes are reviewed, tracked, and reversible. For prompts containing sensitive instructions, treat them as configuration with the same access controls as environment variables.
What is max_tokens and why is it required?
max_tokens sets the maximum number of tokens the model will generate in a single response. Without it, the model defaults to the API's maximum (often 4,096 or more), which can cause unexpectedly large API bills, slow responses, and memory pressure if responses are buffered. Set max_tokens to the minimum value that allows your use case to function correctly — most conversational turns require fewer than 512 tokens.
How do I use OpenAI with a private endpoint (VPC)?
On the Azure OpenAI Service, create a private endpoint in your VNet and configure the AzureOpenAI client with azure_endpoint from an environment variable. On AWS Bedrock, models run natively within your VPC and are accessed via standard boto3 calls — no external API key is needed. In both cases, disable outbound internet access from the service that makes LLM calls so traffic can only flow through the private endpoint.