Skip to content

Rate Limiting

The gateway applies two independent layers of rate limiting:

  1. IP-based limiting — applied to unauthenticated public endpoints to prevent abuse.
  2. Per-provider daily limits — caps daily request volume per upstream provider to protect quotas and control costs.

Unauthenticated endpoints (e.g. /health, /access/request-key) are protected by IP-based rate limiting using Cloudflare Durable Objects.

When the limit is exceeded the gateway returns 429 Too Many Requests with a Retry-After header indicating how many seconds to wait before retrying.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json
{
"error": {
"message": "Too many requests from this IP. Please wait 30 seconds before retrying.",
"type": "rate_limit_error",
"code": null
}
}

Each upstream provider can be assigned a maximum number of requests per calendar day (UTC). Limits are configured via the PROVIDER_LIMITS_JSON environment variable:

{
"openai": 10000,
"anthropic": 5000,
"groq": 2000,
"workers-ai": 50000
}

When a provider’s daily limit is reached, the gateway stops routing to that provider for the remainder of the day. If all configured providers for a model are exhausted, the request fails with 503 Service Unavailable. Counters reset at UTC midnight.

Responses from authenticated endpoints include the following headers:

HeaderDescription
X-RateLimit-LimitMaximum number of requests allowed in the current window
X-RateLimit-RemainingNumber of requests remaining in the current window
X-RateLimit-ResetUnix timestamp (seconds) when the current window resets
Retry-AfterSeconds to wait before retrying (only present on 429 responses)

Example headers on a normal response:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 843
X-RateLimit-Reset: 1731715200

When you receive a 429, wait for the number of seconds specified in Retry-After before retrying. Apply exponential backoff with jitter if Retry-After is not present.

Terminal window
MAX_RETRIES=4
ATTEMPT=0
while [ $ATTEMPT -lt $MAX_RETRIES ]; do
RESPONSE=$(curl -s -w "\n%{http_code}" \
-H "Authorization: Bearer <your-token>" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"Hello"}]}' \
https://your-gateway.workers.dev/v1/chat/completions)
HTTP_STATUS=$(echo "$RESPONSE" | tail -n1)
if [ "$HTTP_STATUS" -eq 200 ]; then
echo "$RESPONSE" | head -n -1
break
elif [ "$HTTP_STATUS" -eq 429 ]; then
WAIT=$(( 2 ** ATTEMPT ))
echo "Rate limited. Retrying in ${WAIT}s..."
sleep $WAIT
ATTEMPT=$(( ATTEMPT + 1 ))
else
echo "Fatal error: $HTTP_STATUS"
break
fi
done
  • Check X-RateLimit-Remaining proactively. If it drops near zero, slow down your request rate before hitting a 429.
  • Respect Retry-After. Always honour the header value exactly — retrying sooner will extend your backoff window on some configurations.
  • Use the OpenAI SDK. The official openai npm package handles 429 retries automatically when pointed at the gateway’s base URL.
  • Distribute load across models. If one model is rate-limited, routing requests to an alternative model served by a different provider avoids daily-limit contention.
  • Monitor via /v1/analytics. Track per-provider request counts to anticipate when daily limits will be reached.