This week I contracted the COVID-19 and was home for a total of 9 days. During this period, the most important thing at work was to assess the impact of the promotion and launch of a small program on the basic service system that I am responsible for. This small program hit the needs of the people of China at that time, and it was expected to have a large influx of traffic, which might have an impact on the core services of the basic service system. Originally, they had a feature that was going to go live, and the traffic was huge, so I had already evaluated and expanded the capacity. However, this time, after they pushed hundreds of millions of volume notifications, a large number of timeouts occurred.
At 8:00 am, I was lying in bed recuperating when I was called up and told that the login interface was experiencing massive timeouts. I immediately took out my laptop, connected to the intranet to see that the interface traffic directly more than 20 times. I pinched a sweat, first of all, I suspect that the number of business containers is not enough to undertake such a large amount of traffic. Then through the screening of logs, found that this is not the problem. The problem is divided into two parts, one of which is an internal interface limited flow, the current flow has exceeded the limit; the other is the cache database hooked up to the database is full.
The internal interface limit flow is relatively easy to solve, through the emergency call to find the corresponding colleagues to raise the threshold of the limit flow. The other cache database, due to the influx of a large number of old users to the database to read the old data to the cache, and the frequency of traffic growth with the continuous increase, and ultimately hit the cache behind the database connected to the full I / O. Subsequent to several core cache database, once again do a batch of expansion, improve the cache capacity and the number of nodes to cope with the subsequent greater traffic and concurrency.
In terms of routine maintenance, the main thing is to adjust the k8s scheduling policy of a gateway service pod, scheduling it to a compute node directly supported by CVM virtualization (one-tier scheduling). Originally, this service was scheduled to a compute node that was virtualized from a physical server purchased by our company and used as a compute node. After scheduling to this, an execution space is virtualized on this CVM (refer to Docker technology), and then Pods are scheduled to this node to start running (two-tier scheduling). On a compute node, there may be several or even a dozen Pods, which run on the same operating system, but with different execution spaces. This results in poor isolation of CPU, memory, network card and other resources, which often interfere with each other. And when this CVM compute node needs to be rebooted or upgraded, all the Pods on it will be evicted. And the wind in the operating system of the compute node will affect all the Pods on it.This is not suitable for this kind of gateway, which is a service with demanding reliability and latency requirements.
And a layer of scheduling compute nodes, in fact, by the cloud to allocate computing resources and virtualize the CVM to execute this program individually, CVM resources are obtained from the entire cloud resource pool. And this approach supports k8s, because this new scheduling approach will be virtualized as a kind of “super node” on k8s. k8s can directly schedule Pod to the super node, Pod can be directly based on a layer of scheduling operation, that is, directly running on a separate CVM. This approach is highly isolated and will not affect each other. Since each CVM runs only one node, when a problem occurs on a single Pod, it can be fixed for that Pod’s problem without affecting other Pods.
By modifying the scheduling policy and scheduling all Pods of the service to the supernode, the whole service runs more stably. The situation of double high timeout rate and CPU utilization during peak hours rarely occurs again.