Weekly report for the first week of March 2023

Recently, compliance requirements have been creeping into the technical side of things. In recent times, there are always products coming to the table about implementing such and such compliance requirements. Or, do some kind of compliance questionnaire. I feel that compliance is mainly about the storage of user information, access needs to be standardized, and then users can gradually start to control their own data. And then there are some personnel, organizational changes on the technical level of the impact of changes, for example, a business has been dispatched to other departments, and then we have been and this business to share the database and other resources. At this time, the issue of cost comes to the fore, that is, the money should be counted over there. Although it is a company, a business group, but I feel that perhaps due to the company’s internal accounting mechanism, for these cost issues are still more serious, will always argue what: they use our database, the money is still on our side and so on. Sometimes I feel that these are internal consumption, not out of a development perspective. In addition, other businesses may need to modify some common database structures or interfaces due to any demand adjustment, which is more troublesome. Because at this time, we don’t get any recognition for spending man-hours for them to do these things. However, we will be rushed and bombarded every day by the people there. It’s harder to feel caught in the middle. Then, the PHP gateway migration that I mentioned in my last weekly report was actually in the works for the first month or two. In fact, it was written a long time ago, but in the subsequent switching of traffic to the new service, always encountered such and such problems. Specific performance is that the test environment traffic switch for a long time no one to find, all the official environment such and such a person to find. This interface suddenly can not be used, the interface suddenly reported errors and so on. In this case, because the problem occurs on the line, can only first cut the traffic back and then analyze the problem. Over and over again, a week will pass. Sometimes, switching 4 or 5 days before someone will come to find. Such a process down a long time even longer. leader has been saying how this thing has been done for a long time, but the actual situation is like this. Although the service’s subsequent specific code writing and maintenance work is not me to do, but I also know that a new service to be able to completely replace the existing service must have a process. Because in the absence of documentation and test cases, you don’t know what hidden mechanisms an interface has, and it’s not entirely clear whether the code I wrote is equivalent to the original. So at the implementation level, a lot of things were not explicitly reported, but I felt that management had to be able to appreciate that as well. Although the process was rather iterative and difficult, eventually the refactoring of this service was completed this week, and all traffic was migrated to the new service. This means that the life cycle of this old PHP gateway is over. I have taken over the main backend business for almost half a year, and now I realize that when I first took over the business, due to the lack of experience, I always felt that other people’s technology was unfathomable. However, after a long time of contact, I realized that there are such and such imperfections in it. There are also some design flaws, very serious flaws are there. For example, the use of Redis as a database, to put it bluntly, there is no set cache expiration time, and then the data inside the Redis is very important can not be lost. There are also some old services that lack documentation and have long since left the maintenance staff, which are basically impossible to scrutinize in normal times. However, there are always one or two of these services are still being accessed, or have long been inaccessible, the customer suddenly found a few days and then rushed to urge us to solve. Many security issues also need to be rectified, a certain number of API accounts have some high level permissions need to be converged, but I do not know what this several accounts specifically do. These problems are just frequently encountered on a daily basis, and it’s hard to avoid them. ...

November 15, 2023

Weekly Technical Report for the fourth week of February 2023

This week I realized that some of the service frameworks are not written very well, especially certain Java frameworks. When the CPU usage reaches about 40%, there are a lot of timeouts. These services, CPU cores and memory capacity is not not enough, the number of working threads is not not enough. However, it is not enough to run the CPU. using Java performance tools to analyze the discovery, in fact, found that most of the working threads in Idel or Waiting state. At present, the comprehensive analysis of all the circumstances, still puzzled. NIO is also used, also used the Netty framework, but the throughput is not up. Through the analysis of threads, found that there is no particularly busy business threads. Inference should be IO or some kind of waiting mechanism to cause this low processing efficiency. This time I was going to reduce the cost by cutting down the number of nodes, and I didn’t consider particularly many factors before execution. So when I cut down the node capacity, I just check the CPU load to determine whether the node can withstand it or not. When I raised the average CPU utilization of workload in Beijing to 35%-40%, the whole service experienced a large number of timeouts. At that time, I was shocked. When I analyzed it from the monitor later, almost all the nodes in the background locale timed out, i.e., they were in a state of “shock”. This week, I also continued to refactor an old PHP gateway service, and the new gateway is written in Java. But I’m not really in favor of using Java as a gateway, after all, the execution characteristics of the Java language largely determines that it is not suitable for particularly high concurrency. And, we currently use are JDK 8, and did not introduce lightweight threads, each 4c8g container threads up to 800, more thread switching overhead will be particularly large. So the throughput of a single container is limited, carrying the same amount of traffic requires more containers. At first I was rewritten in Go, this framework is very good and also has a specialized team to maintain, but the leader still let me use the department of self-developed Java framework. Maybe the personnel considerations mostly , I have no choice but to write it first. ...

November 15, 2023

Technical Weekly Report for the third week of February 2023

This week was focused on dealing with a risk item that was discovered before the holidays. A service that was using Redis and not setting a TTL for the key, but was banking on the redis elimination policy. I see that this service is using Redis with an LRU elimination strategy set up. This strategy may seem perfect, but there are pitfalls when there is a lot of write traffic for a certain shorter period of time. This is when Redis triggers the elimination process and focuses its best efforts on this in order to be able to free up enough space. This means that Redis can’t perform normal operations such as queries very well. This causes dramatic fluctuations in both read and write latency to Redis from the business layer. I’ve had this problem 2 times so far. Moreover, without setting TTL, Redis is always at 100% utilization, and we can’t tell from the capacity how much purchased capacity is enough for the current business volume. It is also uncertain whether the cost can be reduced by downsizing during the low peak period of business. Therefore, it is very important to leave a sufficient available space for Redis. So I need to modify the service to set a TTL for each new key written. according to the logic of Redis, if the key already exists at the time of writing, a TTL will be set as well. for the sake of business stability, I don’t want all the keys to be set with a TTL. so I discarded the solution of traversing Redis to look for keys without a TTL set. I actually researched this scenario and while traversing, the use requires a cursor to do so. The length of the TTL set is also delicate, in order to prevent a large number of keys from expiring at the same time resulting in large fluctuations in latency, set the TTL when the base length of time after adding a random length. Assuming that the base duration is T, then after adding the random duration, the length of TTL will be between T and 2T. This means that the longer the key is required to stay, the wider the distribution of expiration. This ensures smoother business latency and prevents overstressing the database. ...

June 21, 2023

Weekly Technical Report for February 2, 2023

From the end of January to the beginning of February, it falls under the Chinese New Year. During this period, the person responsible for securing the operation of the Spring Festival phase needs to be on call to deal with online issues. I was in a constant state of worry, and the good thing is that online problems did not come to me actively. Maintaining overall immobility throughout the Chinese New Year is the best. This week I’m evaluating the impact of a major requirement. I believe that for a new business requirement, especially when applied to a complex business system, there are multiple impacts that need to be considered. If, at this point in time, one is not particularly familiar with the system and has little experience with it, it is best to choose to make minimal changes. It’s not about being conservative, it’s about keeping the impact as small as you can imagine. Because you don’t know where some counterintuitive mechanism is running important business logic. I didn’t come to this conclusion out of my imagination; this post was written six months later, and by the time I wrote it, I’d already encountered this at least twice. At the time, I made a drastic tweak to a service, and at the time the tweak was done, everything was fine. After the release, it also seemed normal. It wasn’t until a number of weeks later that I stumbled upon some mechanism that strung together upstream and downstream, and it was nearly affected by me. In undertaking a business system, there is a high probability that it is transferred and many hands, hiding a lot of history that you do not know, so the framework, core logic can not move, not move. Then, this week completely solved the problem against a security encryption service encryption interface is not compatible with Chinese. The main problem is that it will encrypt the original text directly as Redis Key. when there is Chinese in the original text, you will find that although the Key is stored, but can not find. My solution is that when storing, the content of the Key should not directly contain any business original text, and take the hash first. This can avoid some encoding, compatibility problems, but also can greatly improve the security. ...

June 21, 2023

Weekly Technical Report for the second week of January 2023

This week was mainly about ensuring the stability of various services in the run up to the Chinese New Year. Recently, I found that a certain service often reported timeout during peak traffic hours, and I reminded to forward to the service owner to deal with it. But after a few days, the service owner still couldn’t explain the reason. I had to deal with this problem personally, because the alarm has been very serious, and the timeout rate of some nodes can reach 20%. In this period of time, it should be due to the approaching holidays, the traffic has risen significantly, compared with the end of December has risen by 100%. So first of all, it is suspected that the carrying capacity of the service is insufficient, so it first carried out a capacity expansion. However, the expansion did not solve the problem, and the alarm frequency and timeout rate remained basically unchanged. In this case, after analyzing the source code of the service, it was found that the interface of the service would first call a downstream service and then insert data into the database asynchronously. Since the asynchronous operation does not block the worker thread, the first thing that should be suspected is that this downstream call is the problem (in fact, I struggled with the database side for a long time). I logged into a node from the terminal to check the logs, and found that all the downstream calls were actually going to the same IP and port, and going to the Scarecrow node. So I first did the service to the cloud to rule out the impact of the loss of Scarecrow forwarding. After the completion of the cloud, just cut the traffic when I found a large number of timeouts generated under a certain geographic region (not the original geographic region), after doing the expansion of the capacity is not resolved, very puzzled, so temporarily cut back. I was very curious why the timeout rate became higher after going to the cloud. After monitoring and analyzing the problem from the cloud, I found that after cutting the traffic, one of the downstream service’s nodes on the cloud had a very high CPU usage, while the others had a very low one. I started to suspect a load balancing issue, so I went back to check the source code of the service. Sure enough, after the service got the list of available nodes for the downstream service from the registry, it would only call the first item in the list. So almost all the traffic in a certain geographic area is concentrated on one of the nodes of the downstream service. And when it was originally forwarded through the Scarecrow, the Scarecrow did a load balancing when it called the node on the cloud. So, the initial problem would have been that the traffic was going up a lot and there was no load balancing for that service, so that caused one of the scarecrows to overload, which in turn caused a timeout. The scarecrows are proxy nodes that have no logic of their own, so the number of concurrencies they can handle is much larger, and the timeout problem is not very noticeable. Once in the cloud, the traffic will be directly concentrated in a business node, a large number of traffic influx, will instantly lead to a large number of the business node timeout. With the above assertion, go back to find the monitoring of the scarecrow, and found that a scarecrow has indeed been overloaded, CPU usage actually reached more than 95. Knowing the cause, the next step was to continue to the cloud, first maximize the elimination of timeout alarms, and then change the code to add a random load balancing algorithm. This was done because modifying the code before the holiday would be more troublesome in terms of process. So first, I doubled the number of cores of a single node of the downstream service, which greatly increases the processing capacity. Then, doing the operation of cutting traffic, the CPU occupancy number of one downstream node suddenly rises, and eventually stays at a much higher ratio than the other nodes, which is to be expected. When the whole service smoothed out, the cloud was successful, and by this time the alarms had been eliminated and the timeout rate went to zero. The next step is to modify the code of the changed service, add the load balancing mechanism, and release a new code version for the service after the various processes have been approved. After I released the new version, the CPU usage of each node of the downstream service was close, and the problem was finally solved. ...

February 14, 2023

Weekly Technical Report for January 1, 2023

As we move into 2023, this year is going to be a tough one. This year will face several challenges, one is to migrate all the data previously deployed on physical servers to the cloud. Then there’s the accelerated development of several new team members who will be able to take on the services involved in the current main business as soon as possible and will be expected to be able to independently resolve user issues and optimize the services. This will allow me to transfer some of the work to them and focus on important goals that are expected to take a long time this year. There is also the fact that I have reached a stage of personal learning in technical and other areas that will determine the direction of my life in the next 7-8 years. This week, my main focus was on the design and specification of the logging framework and log tracing for several services. First of all, to solve the problem of log tracing, in order to be able to cross-service on the call process log generated by the unified tracking, need to now TraceId unified. However, the types of TraceId of these services are not uniform, some use Long type and some use string. Moreover, the language and technology stack used by these services are not consistent. Directly using the TraceId in some standard distributed tracing frameworks should not be able to support the current situation of all the services, and can only be used in some services with relatively new technology stacks. Therefore, from the consideration of compatibility and simplicity of transformation, we are going to use a custom generated value of Long type as TraceId and limit the number of digits to 16 digits. The first four bits start with 99, which characterizes the unified TraceId, and then the remaining two bits identify the service. The next four bits are the current microseconds, and the last eight bits are two sets of four random numbers spliced together. Although this TraceId does not guarantee uniqueness, it is sufficient in the current situation. If the service is a Java technology stack, the generation of TraceId need to take into account the thread competition, it is best to assign a random number generator for each thread. Alternatively, use TreadLocal directly. The Java service extracts the TraceId from the request when processing the request. if the TraceId starts with 99, no new TraceId is generated. If not, the TraceId is generated as described above and stored in the MDC. When encountering asynchronous execution, you need to be careful to copy the contents of the MDC to the side thread, otherwise the trace information will be lost. When a downstream service needs to be invoked, the stored TraceId needs to be passed to the downstream service. Finally, when the request is processed, the MDC needs to be emptied to prevent polluting the trace information of the next request. ...

January 20, 2023

Technical Review for the fourth week of December 2022

This week I contracted the COVID-19 and was home for a total of 9 days. During this period, the most important thing at work was to assess the impact of the promotion and launch of a small program on the basic service system that I am responsible for. This small program hit the needs of the people of China at that time, and it was expected to have a large influx of traffic, which might have an impact on the core services of the basic service system. Originally, they had a feature that was going to go live, and the traffic was huge, so I had already evaluated and expanded the capacity. However, this time, after they pushed hundreds of millions of volume notifications, a large number of timeouts occurred. At 8:00 am, I was lying in bed recuperating when I was called up and told that the login interface was experiencing massive timeouts. I immediately took out my laptop, connected to the intranet to see that the interface traffic directly more than 20 times. I pinched a sweat, first of all, I suspect that the number of business containers is not enough to undertake such a large amount of traffic. Then through the screening of logs, found that this is not the problem. The problem is divided into two parts, one of which is an internal interface limited flow, the current flow has exceeded the limit; the other is the cache database hooked up to the database is full. The internal interface limit flow is relatively easy to solve, through the emergency call to find the corresponding colleagues to raise the threshold of the limit flow. The other cache database, due to the influx of a large number of old users to the database to read the old data to the cache, and the frequency of traffic growth with the continuous increase, and ultimately hit the cache behind the database connected to the full I / O. Subsequent to several core cache database, once again do a batch of expansion, improve the cache capacity and the number of nodes to cope with the subsequent greater traffic and concurrency. In terms of routine maintenance, the main thing is to adjust the k8s scheduling policy of a gateway service pod, scheduling it to a compute node directly supported by CVM virtualization (one-tier scheduling). Originally, this service was scheduled to a compute node that was virtualized from a physical server purchased by our company and used as a compute node. After scheduling to this, an execution space is virtualized on this CVM (refer to Docker technology), and then Pods are scheduled to this node to start running (two-tier scheduling). On a compute node, there may be several or even a dozen Pods, which run on the same operating system, but with different execution spaces. This results in poor isolation of CPU, memory, network card and other resources, which often interfere with each other. And when this CVM compute node needs to be rebooted or upgraded, all the Pods on it will be evicted. And the wind in the operating system of the compute node will affect all the Pods on it.This is not suitable for this kind of gateway, which is a service with demanding reliability and latency requirements. And a layer of scheduling compute nodes, in fact, by the cloud to allocate computing resources and virtualize the CVM to execute this program individually, CVM resources are obtained from the entire cloud resource pool. And this approach supports k8s, because this new scheduling approach will be virtualized as a kind of “super node” on k8s. k8s can directly schedule Pod to the super node, Pod can be directly based on a layer of scheduling operation, that is, directly running on a separate CVM. This approach is highly isolated and will not affect each other. Since each CVM runs only one node, when a problem occurs on a single Pod, it can be fixed for that Pod’s problem without affecting other Pods. By modifying the scheduling policy and scheduling all Pods of the service to the supernode, the whole service runs more stably. The situation of double high timeout rate and CPU utilization during peak hours rarely occurs again. ...

January 19, 2023

Technical Review for the third week of December 2022

This week, the main work was to optimize for a certain Java service. The service has been having problems with CPU usage not being able to ramp up. The first question to consider is whether the service has a problem of insufficient working threads. Later on, it was found that it was not that the CPU usage could not be raised up, but that the raised CPU usage would lead to more timeout problems. The service had feedback from a long time ago that the performance was insufficient and it was not recommended to continue using it. So I have a feeling that the problem comes from the framework, not the business code. After reading and combing through the framework code, the framework uses netty as the NIO server framework, and will distribute business processing tasks to the worker thread when executing business logic. Then the worker thread to process the business logic. One problem with the previous framework was that the number of worker threads was too low to cope with high IO situations. In the present case, it is not the same problem as last time, but the number of working threads is enough from the logs. Could it be a client-side issue. In the whole microservice architecture, the service will act as a client to call the interface of other services. This could be an entry point. Reading through the code, I found that the service framework generates a proxy class ObjectProxy by implementing Java’s InvocationHandler interface, which takes over the RPC calls to other services. In the business code, through the RPC method of initiating calls to other service interfaces, ObjectProxy will be associated with the ProtocolInvoker to obtain the target service corresponding to the list of valid nodes (the list of valid nodes is refreshed every 30s). Then, it passes the list to the load balancer LoadBalancer to get the target node of this call. Then, the invocation is made to the target node through the protocol-specific Invoker class, which is responsible for managing the long connection to the target service, and selects a connection to send the request and receive the response when the invocation is made. The specific request method is related to whether the call is asynchronous or synchronous. Each target service that the client needs to invoke consists of multiple nodes. For each node, the framework creates two I/O threads by default for network I/O transfers (NIO mode). In addition, the framework creates a TCP connection for each node by default equal to the number of processors. Each I/O thread contains a selector that polls for events related to the connection. A TCPSession is created for each TCP connection, and whenever a request is sent, a Ticket is created to track the request and its associated response. For synchronous requests, the request is sent and blocked until the response arrives. For asynchronous requests, the Ticket is filled with a callback function, and when the response arrives, the TicketNumber (the unique index of the Ticket) is used to locate the corresponding Ticket and call the pre-populated callback function for subsequent processing. The framework for NIO operations , the underlying use of Java provides the ability to NIO library . Above mentioned TCP link, in fact, is the NIO library in the SocketChannel. then the framework of how to split the packet how it is, the framework through the Buffer to store the current data has been read, and we commonly use the RPC protocol, the packet will be the number of bytes present in the packet’s first, only need to compare the Buffer with the number of bytes of the size of the packet you can know whether or not to read the full If the packet is not read, it will continue to be read. If the packet is not fully read, continue to wait for the subsequent data to arrive. If the packet has been read, then according to the size of the number of bytes specified in the Buffer to split the corresponding data to deal with it. In this framework, an additional worker thread is taken from a thread pool to do the subsequent processing of the Ticket. Currently from the existing code , the work of the thread pool is only used to do this thing , its default number of threads and the same number of cores , the maximum number of threads is twice the number of cores . According to the definition of the Java NIO library, several IO operations are specified in the Channel, and the Selector polls to check if these operations are ready, and if they are ready it returns the SelectionKey. the SelectionKey contains the necessary parameters for the correct operation from the ready channel. After reading the code so far, there are no obvious problems from the client side. The solution used is basically mature and stable, it is possible that the problem occurs on the server side, which has to be followed by the familiar comb. ...

January 19, 2023

Weekly Technical Report for the second week of December 2022

This week’s work is mainly a sorting out of this aspect of the work I am responsible for, and many problems have been identified so far. These problems are mainly focused on the data on the cloud, the current problem is mainly how to safely on the cloud, how to transform the current single-geography deployment scheme, how to fix the inconsistency between the data under the cloud and the data on the cloud. In addition, it is found that there are still some services using under-cloud databases, and these under-cloud databases are reasonably to be abandoned. However, these services are some old services, and code changes will bring some risks, which need to be investigated before taking action. The aspects of the investigation include the basic principles of the existing data to the cloud auxiliary services and related code logic details, and it is best to find out the problems as early as possible and repair them in a timely manner. On the other hand, the data to the cloud process requires real-time monitoring, as comprehensive as possible on the service interface call quality, timeout rate, write failure rate, inconsistency rate, etc. have a clear grasp. This aspect is best done from the report monitoring and log monitoring. Multi-location deployment program, the current intention to use a master multi-slave, master-slave unilateral replication, only write the master library, read only from the library these principles to start. The main purpose of using a multi-location deployment scheme is to improve service stability, reduce the latency of most requests, improve service quality, and eliminate the impact of cross-regional link instability. Multi-location deployment of data synchronization delays can not be ignored, there must be an acceptable delay, this piece of theory and monitoring from two aspects to grasp. Then again, it is important to master a scripting language. Especially when there are a lot of repetitive things need to be processed, or need to analyze some data to obtain a conclusion. Being able to be more proficient in a scripting language like Python has a big advantage. However, I don’t think it would be wise to say that it would be a good idea to take Python and write a large program. Each programming language is like a different knife, all can be used to cut vegetables, but some knives are better suited for cutting meat or bones. ...

December 13, 2022

Weekly Technical Report for the first week of December 2022

This week’s work, to summarize, is mainly to put a core service on the cloud, and then constantly switch the nodes under the cloud into traffic forwarding nodes. The first step in the cloud is to deploy the service node in the cloud environment: migrate the configuration files, environment, and then compile the image for the cloud environment according to the stable version of the code, and then let the service run up in the cloud environment. After the service runs up and the test is completed, the node on the cloud is currently not any traffic, this time you need to forward some of the traffic under the cloud to the cloud, the first thing is to replace some of the nodes under the cloud with forwarding nodes, the role of the forwarding node is to forward the traffic from the main call to the node on the cloud. Later, this part of the traffic can be used to observe the working status of the nodes on the cloud and check for anomalies, which can be called “traffic grayscale”. Traffic gray scale is generally controlled in the overall traffic 1 %-5% role, depending on the criticality of the service to adjust the proportion of the test environment can be more appropriate, 25 %-50% can be. Traffic gray scale is when a problem occurs, or when you receive a problem ticket, you just need to shut down the traffic forwarding node under the cloud. Of course, everything above must first be operated in the test environment, test properly and then follow the same program to carefully operate the production environment. Some of the background system is intricate, a call involving multiple services, business logic is confusing, this time try to control the variables for a period of time to make only one change, and so observe a period of time, the situation stabilized and then do the next operation. Or you can split the big step into several small steps, a period of time to do only a small step, after observing a period of time and then continue to promote the next small step. This is less prone to error. Next, if the gray scale verification passes (usually lasts a week). The next step is to switch routes. Switching routes means switching the routes from the service node on the cloud directly to the on-cloud node, after which the other on-cloud master service will directly access the on-cloud node for that service and will not visit the off-cloud node again. Special care must be taken while switching routes as a lot of online traffic will hit the on-cloud node directly. Before the operation first of all, you need to calculate the number of nodes needed under each geographic region on the cloud based on usual data, generally in accordance with the peak traffic level, if not enough to expand the capacity, so as not to cause a large number of timeouts on the line due to insufficient capacity. At this time, the number of nodes on the cloud would rather have more, because more can be slowly scaled back later, the cost is not high. However, if there is a large number of timeouts caused by less nodes, node expansion takes time (mainly resource scheduling and service startup time consuming), this time is likely to cause user complaints. If the timeout triggers a large number of client retries, it may bring down the whole service and cause online accidents. At that time, I picked a time when the traffic was low, so as to minimize the impact caused by switching jitter. After switching the route, the nodes under the cloud can gradually switch to forwarding nodes, forwarding all the traffic under the cloud to the cloud for processing. Then the person in charge of the primary service is notified to try to bring the service to the cloud. Because, forwarding nodes forwarding to the cloud has overhead and will increase the latency. ...

December 9, 2022