We are all aware of the benefits of a Power nap. But ever wondered how a power nap resolved a major bug on production?
The Scenario
A bunch of orders need to be processed. Backend breaks the orders down into chunks of 10, pushes it to an Amazon SQS which triggers a Lambda to process the orders. Simple innit? Not really.
Assume 26 orders need to be bulk processed, only 2 or 4 would be processed and no trace of the remaining orders. Lambda did not show any errors. So clearly Lambda timeout was ruled out. ⌛
The Investigation
First Assumption: Queue is Dropping Requests
My immediate assumption was that the queue might be dropping requests. However, the "No. of messages Received" and the "No. of messages Deleted" parameters in the queue are the same, i.e, 26. So all orders are coming in and all are going out. Queue is also not the issue.
Second Assumption: Batch Size Issues
Let us change the chunk size from 10 to 2. Maybe that is a problem. But still the same behaviour. No trace of where the other orders are getting dropped.
Running out of options, decided to add a lot of logs on the lambda side [in hindsight could've been done earlier] and then test it again. VOILA! 🚀
The Culprit
The culprit was an external API called in the bulk order processing that had imposed a rate limit of only 4 requests per minute.
Now the scenario was clear. Batch size on the SQS trigger was any way set to 2. But now the only way this could be solved was by adding a sleep() for the lambda to send requests after a few seconds of sleep time. A POWER NAP! 😴
Pro tip: Lambda configuration also needs editing to "Reserve concurrency" to 1 so that multiple instances are not spawned and rate limits are not breached again.
Key Takeaways
- Always check for rate limits when integrating with external APIs
- Add comprehensive logging early in the debugging process
- Sometimes the simplest solution (like adding a sleep) is the right one
- Lambda concurrency settings are crucial when dealing with rate-limited resources
So next time you're debugging a production issue and nothing seems to work, remember: sometimes all you need is a good nap! 😴