Rate Limiting
Overview
Section titled “Overview”The gateway applies two independent layers of rate limiting:
- IP-based limiting — applied to unauthenticated public endpoints to prevent abuse.
- Per-provider daily limits — caps daily request volume per upstream provider to protect quotas and control costs.
IP-Based Rate Limiting
Section titled “IP-Based Rate Limiting”Unauthenticated endpoints (e.g. /health, /access/request-key) are protected by IP-based rate limiting using Cloudflare Durable Objects.
When the limit is exceeded the gateway returns 429 Too Many Requests with a Retry-After header indicating how many seconds to wait before retrying.
HTTP/1.1 429 Too Many RequestsRetry-After: 30Content-Type: application/json
{ "error": { "message": "Too many requests from this IP. Please wait 30 seconds before retrying.", "type": "rate_limit_error", "code": null }}Per-Provider Daily Limits
Section titled “Per-Provider Daily Limits”Each upstream provider can be assigned a maximum number of requests per calendar day (UTC). Limits are configured via the PROVIDER_LIMITS_JSON environment variable:
{ "openai": 10000, "anthropic": 5000, "groq": 2000, "workers-ai": 50000}When a provider’s daily limit is reached, the gateway stops routing to that provider for the remainder of the day. If all configured providers for a model are exhausted, the request fails with 503 Service Unavailable. Counters reset at UTC midnight.
Rate Limit Headers
Section titled “Rate Limit Headers”Responses from authenticated endpoints include the following headers:
| Header | Description |
|---|---|
X-RateLimit-Limit | Maximum number of requests allowed in the current window |
X-RateLimit-Remaining | Number of requests remaining in the current window |
X-RateLimit-Reset | Unix timestamp (seconds) when the current window resets |
Retry-After | Seconds to wait before retrying (only present on 429 responses) |
Example headers on a normal response:
X-RateLimit-Limit: 1000X-RateLimit-Remaining: 843X-RateLimit-Reset: 1731715200Handling Rate Limits
Section titled “Handling Rate Limits”Exponential Backoff
Section titled “Exponential Backoff”When you receive a 429, wait for the number of seconds specified in Retry-After before retrying. Apply exponential backoff with jitter if Retry-After is not present.
MAX_RETRIES=4ATTEMPT=0
while [ $ATTEMPT -lt $MAX_RETRIES ]; do RESPONSE=$(curl -s -w "\n%{http_code}" \ -H "Authorization: Bearer <your-token>" \ -H "Content-Type: application/json" \ -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Hello"}]}' \ https://your-gateway.workers.dev/v1/chat/completions)
HTTP_STATUS=$(echo "$RESPONSE" | tail -n1)
if [ "$HTTP_STATUS" -eq 200 ]; then echo "$RESPONSE" | head -n -1 break elif [ "$HTTP_STATUS" -eq 429 ]; then WAIT=$(( 2 ** ATTEMPT )) echo "Rate limited. Retrying in ${WAIT}s..." sleep $WAIT ATTEMPT=$(( ATTEMPT + 1 )) else echo "Fatal error: $HTTP_STATUS" break fidoneasync function chatWithRetry(messages, { maxRetries = 4 } = {}) { for (let attempt = 0; attempt <= maxRetries; attempt++) { const response = await fetch( 'https://your-gateway.workers.dev/v1/chat/completions', { method: 'POST', headers: { 'Authorization': 'Bearer <your-token>', 'Content-Type': 'application/json', }, body: JSON.stringify({ model: 'gpt-4o', messages }), } );
if (response.ok) return response.json();
if (response.status === 429 && attempt < maxRetries) { const retryAfter = response.headers.get('Retry-After'); const waitMs = retryAfter ? parseInt(retryAfter, 10) * 1000 : Math.min(1000 * 2 ** attempt + Math.random() * 500, 30_000);
console.warn(`Rate limited. Retrying in ${waitMs}ms (attempt ${attempt + 1}/${maxRetries})`); await new Promise(resolve => setTimeout(resolve, waitMs)); continue; }
const { error } = await response.json(); throw new Error(`[${error.type}] ${error.message}`); }
throw new Error('Max retries exceeded');}Best Practices
Section titled “Best Practices”- Check
X-RateLimit-Remainingproactively. If it drops near zero, slow down your request rate before hitting a429. - Respect
Retry-After. Always honour the header value exactly — retrying sooner will extend your backoff window on some configurations. - Use the OpenAI SDK. The official
openainpm package handles429retries automatically when pointed at the gateway’s base URL. - Distribute load across models. If one model is rate-limited, routing requests to an alternative model served by a different provider avoids daily-limit contention.
- Monitor via
/v1/analytics. Track per-provider request counts to anticipate when daily limits will be reached.