In our team, we often use downstream services to fetch some data. Most of these services control call rate though limiting Throughput Per Second (TPS) for callers.
We had an allowed TPS of 1000 for one of our downstream services, but we started getting throttling exceptions. The oncall did some analysis, and reported that we are only making about 150 calls / second and we are getting throttled multiple times during that duration, and so they team cut a ticket to service team asking for explanation.
Now lets first look what is wrong with report. The assumption was that since we are making 150 calls / sec, and our TPS is 1000, we are definitely not hitting that…. which is wrong. Now bear with me while I explain why.
TPS is not calculated at second interval but usually at sub second intervals. Lets say that my evaluation window was 100 ms. and lets say during one of those windows I made 101 calls, then my calculated TPS would be (101 / 100) * 1000 = 1010. What this means is if our allowed TPS was 1000, then the 101st call in that 100 ms window would throttle (well not in real world since most TPS systems allow some spikes and are not that strict, but if we make similar call for multiple windows, then it would throttle.)
Lets look at the implication of this. “If evaluation window is 100ms, I could reach 1000 TPS even if I am only calling a service 100 times if I made all those calls in 100ms duration”.
So, how should one look at TPS values: “If allowed TPS is T, then I am allowed to call the service T time in a sec given my calls are uniformly distributed across that second”
The solution to the problem is much simpler and widely known: Implement retries for call instead of failing immediately when call fails, so that the calls are distributed eventually.