Weekly Technical Report for the first week of December 2022

This week’s work, to summarize, is mainly to put a core service on the cloud, and then constantly switch the nodes under the cloud into traffic forwarding nodes. The first step in the cloud is to deploy the service node in the cloud environment: migrate the configuration files, environment, and then compile the image for the cloud environment according to the stable version of the code, and then let the service run up in the cloud environment. After the service runs up and the test is completed, the node on the cloud is currently not any traffic, this time you need to forward some of the traffic under the cloud to the cloud, the first thing is to replace some of the nodes under the cloud with forwarding nodes, the role of the forwarding node is to forward the traffic from the main call to the node on the cloud. Later, this part of the traffic can be used to observe the working status of the nodes on the cloud and check for anomalies, which can be called “traffic grayscale”. Traffic gray scale is generally controlled in the overall traffic 1 %-5% role, depending on the criticality of the service to adjust the proportion of the test environment can be more appropriate, 25 %-50% can be. Traffic gray scale is when a problem occurs, or when you receive a problem ticket, you just need to shut down the traffic forwarding node under the cloud. Of course, everything above must first be operated in the test environment, test properly and then follow the same program to carefully operate the production environment.

Some of the background system is intricate, a call involving multiple services, business logic is confusing, this time try to control the variables for a period of time to make only one change, and so observe a period of time, the situation stabilized and then do the next operation. Or you can split the big step into several small steps, a period of time to do only a small step, after observing a period of time and then continue to promote the next small step. This is less prone to error.

Next, if the gray scale verification passes (usually lasts a week). The next step is to switch routes. Switching routes means switching the routes from the service node on the cloud directly to the on-cloud node, after which the other on-cloud master service will directly access the on-cloud node for that service and will not visit the off-cloud node again. Special care must be taken while switching routes as a lot of online traffic will hit the on-cloud node directly. Before the operation first of all, you need to calculate the number of nodes needed under each geographic region on the cloud based on usual data, generally in accordance with the peak traffic level, if not enough to expand the capacity, so as not to cause a large number of timeouts on the line due to insufficient capacity. At this time, the number of nodes on the cloud would rather have more, because more can be slowly scaled back later, the cost is not high. However, if there is a large number of timeouts caused by less nodes, node expansion takes time (mainly resource scheduling and service startup time consuming), this time is likely to cause user complaints. If the timeout triggers a large number of client retries, it may bring down the whole service and cause online accidents. At that time, I picked a time when the traffic was low, so as to minimize the impact caused by switching jitter.

After switching the route, the nodes under the cloud can gradually switch to forwarding nodes, forwarding all the traffic under the cloud to the cloud for processing. Then the person in charge of the primary service is notified to try to bring the service to the cloud. Because, forwarding nodes forwarding to the cloud has overhead and will increase the latency.