Technical Weekly Report for the third week of February 2023

This week was focused on dealing with a risk item that was discovered before the holidays. A service that was using Redis and not setting a TTL for the key, but was banking on the redis elimination policy. I see that this service is using Redis with an LRU elimination strategy set up. This strategy may seem perfect, but there are pitfalls when there is a lot of write traffic for a certain shorter period of time. This is when Redis triggers the elimination process and focuses its best efforts on this in order to be able to free up enough space. This means that Redis can’t perform normal operations such as queries very well. This causes dramatic fluctuations in both read and write latency to Redis from the business layer. I’ve had this problem 2 times so far. Moreover, without setting TTL, Redis is always at 100% utilization, and we can’t tell from the capacity how much purchased capacity is enough for the current business volume. It is also uncertain whether the cost can be reduced by downsizing during the low peak period of business. Therefore, it is very important to leave a sufficient available space for Redis.

So I need to modify the service to set a TTL for each new key written. according to the logic of Redis, if the key already exists at the time of writing, a TTL will be set as well. for the sake of business stability, I don’t want all the keys to be set with a TTL. so I discarded the solution of traversing Redis to look for keys without a TTL set. I actually researched this scenario and while traversing, the use requires a cursor to do so. The length of the TTL set is also delicate, in order to prevent a large number of keys from expiring at the same time resulting in large fluctuations in latency, set the TTL when the base length of time after adding a random length. Assuming that the base duration is T, then after adding the random duration, the length of TTL will be between T and 2T. This means that the longer the key is required to stay, the wider the distribution of expiration. This ensures smoother business latency and prevents overstressing the database.