Rate limiting on APIs

Why do we even need to rate-limit an API?

Rate limiting is a technique that allows a service to control the consumption (throttle) of resources used by an instance of an application, an individual service or an entire service.

Why do applications rate limit?

Protection against DDOS - For public APIs, to enable fair usage, track by IP or track by API Keys, Client IDs

SLAs - Server stability and consistency - Not let single user exploit and limit other users activity

Cost Control - Pricng, more requests can cost more money, only allow certain APIs under a limit

How do applications rate limit?

Fixed window

Allows a given number of API calls for a fixed window of time.

5 requests per minute, so if you send 5 requests on the 1st second of the period, the rest of the minute you would not be able to send any more requests until the second mnute.

Also if the requests are sent just before and just after a fixed window, the client can use 2 times the rate limit.

Sliding windown counter

Sliding window is a improved version of the fixed window where the number of requests will be calculated as current window requests + previous window requests * window overlap with previous window.

Cloudflare uses this approach.

Sliding window log

This technique has a three step process. 1. Remove outdated timestamps 2. Record current timestamp 3. Calculate the last time frame number of requests, if less than rate limit allow the request, if not discard (still save the request timestamp).

Token bucket technique

A bucket has a number of tokens available, any request coming will consume a token. Bucket is filled with tokens at a consistent rate (10 tokens per second). If the client is consuming the API at a 1 per second rate, every second it will accumulate 9 requests per second. Assuming the bucket capacity is 100, the bucket will be filled up to 100 tokens which can be used as a burst for 100 requests. Then the client will have to wait for 1 second until the bucket refills with 10 tokens for it to use. This way the client can request in bulk in regular intervals. The parameters in play are the bucket size, and the refill rate.

Leaky bucket

Leaky bucket technique has two parameters in play, bucket size (queue size) and process rate (rate at which the queue is being consumed). Assuming a bucket size of 100 and a process rate of 1 request / second, in case of a burst, it will fill the queue upto 100 requests and excess requests will be discarded. Queued requests will be processed in FIFO manner at the process rate.

How many buckets needed ?

It is ideal to have different rates (tokens) for different API endpoints (Social media usecase, accounts endpoint 10/day, articles endpoint 30/day, comments endpoint 100/day).

Implementation

Non functional requirements:

Is high availability required? single rate limiter or a distributed node rate limiter?

Latency requirements, performance expectation

Memory efficiency, lesser resources save on cost

Where to build the rate limiter, client-side, middleware, server-side ?

If client-side, unless we build the client ourselves, there is always a risk of the client exploiting the API.

If server-side, if we have a microservice architecture, it would be hard to implement the rate limiter on all the microservices.

Middle-ware would be a ideal space to implement along with load balancing, authentication and security policies.

Detailed implementation

Total round-trip time = request time (time for the client to reach the server/load balancer/proxy) + rate limiting time + authentication time

Where should we keep the counters based on an identity?

If we use SQL databases it would take a considerable time to read the counters, on the other hand if we use something like a redis (in memory cache it would take much less time).

Server perspective

Find an "identity" for throttling the application, eg: IP address, client name, client API key

Determine the application traffic volume breakpoint, this gives an idea on how to setup rate limits, per user, per application or the entire service.

Use rate limiting libraries or build if need to customize.

    http {
        limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
    }

How to best communicate the throttling to the client? Using correct status codes 429, return the next available slot (x-retry-after, x-wait-before-retry headers). Standard parameters to add, ratelimit , remaining requests, reset at time stamp.

Async requests can use 202 Accepted as request is added to the queue.

How to handle race conditions, use resource locks to handle the counter resources.

In case if multinode distributed applications how to maintain the counters across nodes: 1. Use sticky sessions (has its pros and cons) 2. User centralized datastores (redis, cassandra)

Client perspective

Use retry policies,

Exponential backoff
jitter (add randomness for delays for retries, for a case of retrying in 2 seconds make some of them 2.18s, 1.96s)
Max retry limits (Leave at some point, don't retry forever)

Be fault tolerant

Sources:

https://www.youtube.com/watch?v=9CIjoWPwAhU https://www.youtube.com/watch?v=gVVDo2h6DwA