Weekly Technical Report for the fourth week of November 2022

This week, I’ve been mostly optimizing a service. This service is written in Java. When there is not much traffic in the production environment, there is also a call batch timeout. And when the timeout is sent, the CPU usage is very low. After observing this, the CPU usage never went up. At this point, it was speculated that the threads were all blocking on some operation, causing the problem.

Most of the services I’ve worked with, including this one, are IO-intensive. These types of services involve a lot of RPC calls, and when RPC calls are made, the worker threads block, making it impossible to process other requests. So the number of worker threads for these types of services are set to be very large to ensure that there are extra threads to handle IO requests, preventing subsequent requests from being processed due to the majority of the threads blocking, and ultimately leading to a large number of timeouts.

This is actually problematic, although Java has taken the NIO model when dealing with sockets and parsing requests, but the logical processing of the request is still using a pool of worker threads. When a worker thread blocked on the IO request, the worker thread can only wait for the IO request to complete or timeout. If the result returned by the IO request is not dependent on the final result of the request, the request will be submitted to another thread pool for processing. This time its processing is asynchronous for that worker thread.

In the end, after troubleshooting, this service had a configuration about the number of worker threads that was not being read due to a framework issue. This resulted in the service starting with the default number of worker threads setting, which is the same number of worker threads as CPU cores. So, when the worker threads were processing downstream RPC calls, any slight fluctuation in the downstream interface’s elapsed time would cause a large number of requests to timeout. At this time, the CPU usage is also small because the worker thread is blocked most of the time waiting for the IO operation to return.

How to troubleshoot it, first of all, you need to set up in the log frame to print the current thread name, for logback log frame, that is, in the log format settings plus %t. Then, a single node in the test environment for pressure testing, to see in a certain QPS, the service for the processing of the request. It is best to add StopWatch to the logical code of the request to assist in the analysis. Finally found that the output log of only a few threads, not right. Then, use jstack to filter out the number of threads. Found that there are indeed only four.

It can be explained that the framework used by this Java service uses the Netty framework for request processing, and finally it will be forwarded to the Netty worker thread pool to process the logic of the request. Here, the Netty worker thread pool corresponds to the prefix nioEventLoopGroup-5. In addition, it can be proved from the cumulative CPU time to assist in proving that there are really only these four threads in the processing of all the main logic of the request.

For this kind of service, under the production environment, the number of worker threads is generally set to 800~1000, and the timeout of RPC calls will be strictly set to prevent a large number of worker threads from being blocked, which ultimately leads to a drastic decrease in the node’s throughput. After updating the framework, the problem was solved.

Additionally, I discovered that one of the service’s main interfaces involved a downstream call based on the HTTP protocol. The call was made by manipulating the OkHttp library, and I found that the call did not set a timeout. This is the wrong situation, if an extreme situation is encountered and the downstream is delayed in returning, then the worker thread will block for a long time.

Therefore, you must set a reasonable timeout for downstream calls to protect the smoothness between upstream and downstream calls and prevent avalanches from occurring. There are three general types of timeouts: connection timeout, read timeout, and write timeout, all of which need to be set. In addition, for the possible existence of a large number of HTTP calls, I turned on OkHttp ConnectionPool.

According to the documentation, the advantage of ConnectionPool is that multiple HTTP or HTTP/2 requests to the same address can share the same connection. However, it should be noted that the premise of this sharing is that the server side supports HTTP long connection.

In addition, this week also mainly went to the Tencent Cloud’s Advanced Architect TCP related content, because the exam is scheduled on the weekend. This TCP exam is a bit more difficult than the original architect and practitioner exams, and it still requires good preparation. I didn’t have much time to read it at work, so I reviewed the heavy stuff directly until 5:00 a.m. on Saturday, and fortunately, I passed the exam in the end. I should be writing an article dedicated to this exam.