Weekly report for the first week of March 2023

Recently, compliance requirements have been creeping into the technical side of things. In recent times, there are always products coming to the table about implementing such and such compliance requirements. Or, do some kind of compliance questionnaire. I feel that compliance is mainly about the storage of user information, access needs to be standardized, and then users can gradually start to control their own data. And then there are some personnel, organizational changes on the technical level of the impact of changes, for example, a business has been dispatched to other departments, and then we have been and this business to share the database and other resources. At this time, the issue of cost comes to the fore, that is, the money should be counted over there. Although … Read more

Weekly Technical Report for the fourth week of February 2023

This week I realized that some of the service frameworks are not written very well, especially certain Java frameworks. When the CPU usage reaches about 40%, there are a lot of timeouts. These services, CPU cores and memory capacity is not not enough, the number of working threads is not not enough. However, it is not enough to run the CPU. using Java performance tools to analyze the discovery, in fact, found that most of the working threads in Idel or Waiting state. At present, the comprehensive analysis of all the circumstances, still puzzled. NIO is also used, also used the Netty framework, but the throughput is not up. Through the analysis of threads, found that there is no particularly busy business threads. Inference should be IO or … Read more

Technical Weekly Report for the third week of February 2023

This week was focused on dealing with a risk item that was discovered before the holidays. A service that was using Redis and not setting a TTL for the key, but was banking on the redis elimination policy. I see that this service is using Redis with an LRU elimination strategy set up. This strategy may seem perfect, but there are pitfalls when there is a lot of write traffic for a certain shorter period of time. This is when Redis triggers the elimination process and focuses its best efforts on this in order to be able to free up enough space. This means that Redis can’t perform normal operations such as queries very well. This causes dramatic fluctuations in both read and write latency to Redis from … Read more

Weekly Technical Report for February 2, 2023

From the end of January to the beginning of February, it falls under the Chinese New Year. During this period, the person responsible for securing the operation of the Spring Festival phase needs to be on call to deal with online issues. I was in a constant state of worry, and the good thing is that online problems did not come to me actively. Maintaining overall immobility throughout the Chinese New Year is the best. This week I’m evaluating the impact of a major requirement. I believe that for a new business requirement, especially when applied to a complex business system, there are multiple impacts that need to be considered. If, at this point in time, one is not particularly familiar with the system and has little experience … Read more

Weekly Technical Report for the second week of January 2023

This week was mainly about ensuring the stability of various services in the run up to the Chinese New Year. Recently, I found that a certain service often reported timeout during peak traffic hours, and I reminded to forward to the service owner to deal with it. But after a few days, the service owner still couldn’t explain the reason. I had to deal with this problem personally, because the alarm has been very serious, and the timeout rate of some nodes can reach 20%. In this period of time, it should be due to the approaching holidays, the traffic has risen significantly, compared with the end of December has risen by 100%. So first of all, it is suspected that the carrying capacity of the service is … Read more

Weekly Technical Report for January 1, 2023

As we move into 2023, this year is going to be a tough one. This year will face several challenges, one is to migrate all the data previously deployed on physical servers to the cloud. Then there’s the accelerated development of several new team members who will be able to take on the services involved in the current main business as soon as possible and will be expected to be able to independently resolve user issues and optimize the services. This will allow me to transfer some of the work to them and focus on important goals that are expected to take a long time this year. There is also the fact that I have reached a stage of personal learning in technical and other areas that will … Read more

Technical Review for the fourth week of December 2022

This week I contracted the COVID-19 and was home for a total of 9 days. During this period, the most important thing at work was to assess the impact of the promotion and launch of a small program on the basic service system that I am responsible for. This small program hit the needs of the people of China at that time, and it was expected to have a large influx of traffic, which might have an impact on the core services of the basic service system. Originally, they had a feature that was going to go live, and the traffic was huge, so I had already evaluated and expanded the capacity. However, this time, after they pushed hundreds of millions of volume notifications, a large number of … Read more

Technical Review for the third week of December 2022

This week, the main work was to optimize for a certain Java service. The service has been having problems with CPU usage not being able to ramp up. The first question to consider is whether the service has a problem of insufficient working threads. Later on, it was found that it was not that the CPU usage could not be raised up, but that the raised CPU usage would lead to more timeout problems. The service had feedback from a long time ago that the performance was insufficient and it was not recommended to continue using it. So I have a feeling that the problem comes from the framework, not the business code. After reading and combing through the framework code, the framework uses netty as the NIO … Read more

Weekly Technical Report for the second week of December 2022

This week’s work is mainly a sorting out of this aspect of the work I am responsible for, and many problems have been identified so far. These problems are mainly focused on the data on the cloud, the current problem is mainly how to safely on the cloud, how to transform the current single-geography deployment scheme, how to fix the inconsistency between the data under the cloud and the data on the cloud. In addition, it is found that there are still some services using under-cloud databases, and these under-cloud databases are reasonably to be abandoned. However, these services are some old services, and code changes will bring some risks, which need to be investigated before taking action. The aspects of the investigation include the basic principles of … Read more

Weekly Technical Report for the first week of December 2022

This week’s work, to summarize, is mainly to put a core service on the cloud, and then constantly switch the nodes under the cloud into traffic forwarding nodes. The first step in the cloud is to deploy the service node in the cloud environment: migrate the configuration files, environment, and then compile the image for the cloud environment according to the stable version of the code, and then let the service run up in the cloud environment. After the service runs up and the test is completed, the node on the cloud is currently not any traffic, this time you need to forward some of the traffic under the cloud to the cloud, the first thing is to replace some of the nodes under the cloud with forwarding … Read more