Weekly Technical Report for the second week of January 2023

This week was mainly about ensuring the stability of various services in the run up to the Chinese New Year. Recently, I found that a certain service often reported timeout during peak traffic hours, and I reminded to forward to the service owner to deal with it. But after a few days, the service owner still couldn’t explain the reason. I had to deal with this problem personally, because the alarm has been very serious, and the timeout rate of some nodes can reach 20%. In this period of time, it should be due to the approaching holidays, the traffic has risen significantly, compared with the end of December has risen by 100%. So first of all, it is suspected that the carrying capacity of the service is insufficient, so it first carried out a capacity expansion. However, the expansion did not solve the problem, and the alarm frequency and timeout rate remained basically unchanged. In this case, after analyzing the source code of the service, it was found that the interface of the service would first call a downstream service and then insert data into the database asynchronously. Since the asynchronous operation does not block the worker thread, the first thing that should be suspected is that this downstream call is the problem (in fact, I struggled with the database side for a long time).

I logged into a node from the terminal to check the logs, and found that all the downstream calls were actually going to the same IP and port, and going to the Scarecrow node. So I first did the service to the cloud to rule out the impact of the loss of Scarecrow forwarding. After the completion of the cloud, just cut the traffic when I found a large number of timeouts generated under a certain geographic region (not the original geographic region), after doing the expansion of the capacity is not resolved, very puzzled, so temporarily cut back. I was very curious why the timeout rate became higher after going to the cloud. After monitoring and analyzing the problem from the cloud, I found that after cutting the traffic, one of the downstream service’s nodes on the cloud had a very high CPU usage, while the others had a very low one. I started to suspect a load balancing issue, so I went back to check the source code of the service. Sure enough, after the service got the list of available nodes for the downstream service from the registry, it would only call the first item in the list. So almost all the traffic in a certain geographic area is concentrated on one of the nodes of the downstream service. And when it was originally forwarded through the Scarecrow, the Scarecrow did a load balancing when it called the node on the cloud. So, the initial problem would have been that the traffic was going up a lot and there was no load balancing for that service, so that caused one of the scarecrows to overload, which in turn caused a timeout. The scarecrows are proxy nodes that have no logic of their own, so the number of concurrencies they can handle is much larger, and the timeout problem is not very noticeable. Once in the cloud, the traffic will be directly concentrated in a business node, a large number of traffic influx, will instantly lead to a large number of the business node timeout.

With the above assertion, go back to find the monitoring of the scarecrow, and found that a scarecrow has indeed been overloaded, CPU usage actually reached more than 95. Knowing the cause, the next step was to continue to the cloud, first maximize the elimination of timeout alarms, and then change the code to add a random load balancing algorithm. This was done because modifying the code before the holiday would be more troublesome in terms of process. So first, I doubled the number of cores of a single node of the downstream service, which greatly increases the processing capacity. Then, doing the operation of cutting traffic, the CPU occupancy number of one downstream node suddenly rises, and eventually stays at a much higher ratio than the other nodes, which is to be expected. When the whole service smoothed out, the cloud was successful, and by this time the alarms had been eliminated and the timeout rate went to zero. The next step is to modify the code of the changed service, add the load balancing mechanism, and release a new code version for the service after the various processes have been approved.

After I released the new version, the CPU usage of each node of the downstream service was close, and the problem was finally solved.