Rate Limiting

Overview

The gateway applies two independent layers of rate limiting:

IP-based limiting — applied to unauthenticated public endpoints to prevent abuse.
Per-provider daily limits — caps daily request volume per upstream provider to protect quotas and control costs.

IP-Based Rate Limiting

Unauthenticated endpoints (e.g. /health, /access/request-key) are protected by IP-based rate limiting using Cloudflare Durable Objects.

When the limit is exceeded the gateway returns 429 Too Many Requests with a Retry-After header indicating how many seconds to wait before retrying.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": {
    "message": "Too many requests from this IP. Please wait 30 seconds before retrying.",
    "type": "rate_limit_error",
    "code": null
  }
}

Per-Provider Daily Limits

Each upstream provider can be assigned a maximum number of requests per calendar day (UTC). Limits are configured via the PROVIDER_LIMITS_JSON environment variable:

{
  "openai": 10000,
  "anthropic": 5000,
  "groq": 2000,
  "workers-ai": 50000
}

When a provider’s daily limit is reached, the gateway stops routing to that provider for the remainder of the day. If all configured providers for a model are exhausted, the request fails with 503 Service Unavailable. Counters reset at UTC midnight.

Rate Limit Headers

Responses from authenticated endpoints include the following headers:

Header	Description
`X-RateLimit-Limit`	Maximum number of requests allowed in the current window
`X-RateLimit-Remaining`	Number of requests remaining in the current window
`X-RateLimit-Reset`	Unix timestamp (seconds) when the current window resets
`Retry-After`	Seconds to wait before retrying (only present on `429` responses)

Example headers on a normal response:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 843
X-RateLimit-Reset: 1731715200

Handling Rate Limits

Exponential Backoff

When you receive a 429, wait for the number of seconds specified in Retry-After before retrying. Apply exponential backoff with jitter if Retry-After is not present.

curl
JavaScript

MAX_RETRIES=4
ATTEMPT=0

while [ $ATTEMPT -lt $MAX_RETRIES ]; do
  RESPONSE=$(curl -s -w "\n%{http_code}" \
    -H "Authorization: Bearer <your-token>" \
    -H "Content-Type: application/json" \
    -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Hello"}]}' \
    https://your-gateway.workers.dev/v1/chat/completions)

  HTTP_STATUS=$(echo "$RESPONSE" | tail -n1)

  if [ "$HTTP_STATUS" -eq 200 ]; then
    echo "$RESPONSE" | head -n -1
    break
  elif [ "$HTTP_STATUS" -eq 429 ]; then
    WAIT=$(( 2 ** ATTEMPT ))
    echo "Rate limited. Retrying in ${WAIT}s..."
    sleep $WAIT
    ATTEMPT=$(( ATTEMPT + 1 ))
  else
    echo "Fatal error: $HTTP_STATUS"
    break
  fi
done

async function chatWithRetry(messages, { maxRetries = 4 } = {}) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await fetch(
      'https://your-gateway.workers.dev/v1/chat/completions',
      {
        method: 'POST',
        headers: {
          'Authorization': 'Bearer <your-token>',
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ model: 'gpt-4o', messages }),
      }
    );

    if (response.ok) return response.json();

    if (response.status === 429 && attempt < maxRetries) {
      const retryAfter = response.headers.get('Retry-After');
      const waitMs = retryAfter
        ? parseInt(retryAfter, 10) * 1000
        : Math.min(1000 * 2 ** attempt + Math.random() * 500, 30_000);

      console.warn(`Rate limited. Retrying in ${waitMs}ms (attempt ${attempt + 1}/${maxRetries})`);
      await new Promise(resolve => setTimeout(resolve, waitMs));
      continue;
    }

    const { error } = await response.json();
    throw new Error(`[${error.type}] ${error.message}`);
  }

  throw new Error('Max retries exceeded');
}

Best Practices

Check X-RateLimit-Remaining proactively. If it drops near zero, slow down your request rate before hitting a 429.
Respect Retry-After. Always honour the header value exactly — retrying sooner will extend your backoff window on some configurations.
Use the OpenAI SDK. The official openai npm package handles 429 retries automatically when pointed at the gateway’s base URL.
Distribute load across models. If one model is rate-limited, routing requests to an alternative model served by a different provider avoids daily-limit contention.
Monitor via /v1/analytics. Track per-provider request counts to anticipate when daily limits will be reached.