当前位置：网站首页>The degradation mechanism is not designed properly, and the online system crashes instantly

The degradation mechanism is not designed properly, and the online system crashes instantly

2022-07-18 14:40:00 【Notes on Shi Shan's architecture】

V-xin：ruyuanhadeng get 600+ Page original boutique article summary PDF

Background introduction

The background is like this ： An online system , During a certain peak MQ In case of middleware failure , Triggered the degradation mechanism , As a result, the degradation mechanism ran for a while after it was triggered , Suddenly the system is completely stuck , Unable to respond to any requests .

Let me briefly introduce the overall architecture of this system , This system simply has a very core behavior , Is to MQ Write data in , But this way MQ The data written in is very core and critical , There must be no loss .

So a degradation mechanism was designed at the beginning , If once MQ Middleware failure , Then the system will immediately write the core data to the local disk file .

But if the concurrency is high in the peak period , After receiving a piece of data, write the local disk file synchronously immediately , This performance is absolutely extremely poor , It will cause the throughput of the system to drop dramatically , This degradation mechanism can never run in a production environment , Because you will be overwhelmed by high concurrency requests .

So when it was designed , The degradation mechanism has been carefully designed .

Our core idea is once MQ Middleware failure , After triggering the degradation mechanism , The system receives a request and does not immediately write to the local disk , instead Memory double buffering + Mechanism of batch disk brushing .

Simply speaking , The system will write the memory buffer immediately after receiving a message , Then start a background thread to flush the data buffered in memory to the disk .

The whole process , Let's take a look at the picture below , You know the .
Insert picture description here

This memory buffer is actually designed , Divided into two areas .

One is current Area , Used for the system to write data , The other is ready Area , Used for background threads to refresh data to disk .

The buffer size set for each memory area is 512kb, When the system receives the request, it writes current buffer , however current The total buffer is 512kb Of memory space , So it must be full .

Again , Let's combine the following figure , Let's see .
Insert picture description here

current After the buffer is full , Will exchange current Buffers and ready buffer . After the exchange ,ready The buffer holds the previously filled 512kb The data of .

then current The buffer is empty at this time , You can continue, and then the system continues to write the new data after the exchange current buffer .

The whole process is shown in the following figure ：
Insert picture description here

here , The background thread can put ready The data in the buffer passes Java NIO Of API, Direct high performance append Write to the local disk file in the way .

Of course , Here, the background thread will have a complete set of mechanisms , For example, a disk file has a fixed size , If it reaches a certain size , Automatically open a new disk file to write data .

Bury the hidden danger

good ！ Through the above mechanism , Even in the rush hour , It can also successfully resist high concurrency requests , Everything looks beautiful ！

however , At that time, the degradation mechanism was under development , The idea we take , Hidden dangers are buried behind ！

The idea adopted at that time was ： If current After the buffer is full , All threads fall into one while Loop infinite wait .

When to wait ？ Wait until ready The data in the buffer is brushed to the disk file , Empty it ready buffer , Then follow current Swap buffers .

such current The buffer will become empty again , Before the worker thread can continue to write data .

But have you ever considered that an abnormal situation may happen ？

Is the background thread refresh ready Buffer data to disk file , In fact, it also takes a little time .

In case he is in the process of refreshing data to disk file ,current The buffer is suddenly full ？

At this time, all working threads of the system cannot write current buffer , All threads are stuck .

Here's a picture , Look at this question ！
Insert picture description here

This is the most fundamental problem of the double buffer mechanism of the system degradation mechanism , After developing this degradation mechanism , Tested with normal request pressure , It is found that two buffers are set to 512kb Under the circumstances , It works well , No problem .

Peak request , The problems

But the problem is at the peak . A certain peak , The system request pressure has reached the normal 10 More than times .

Of course, under the normal process , During the rush hour , The write request is actually written directly to MQ Middleware cluster , So even if your peak traffic increases 10 Times doesn't matter ,MQ Clusters are naturally resistant to high concurrency .

But unfortunately at that time , In the rush hour ,MQ The middleware cluster suddenly fails temporarily , This is also less than a few times a year .

This causes the system to suddenly trigger the degradation mechanism , Then start writing data into the memory double buffer .

Need to know , It's peak time , The request volume is normal 10 times ！ therefore 10 Times the request pressure instantly led to a problem .

The problem is that the transient influx of high concurrency requests will current Buffer full , Then the two buffers are exchanged , The background thread starts to refresh ready The data in the buffer is transferred to the disk file .

As a result, the rush of requests was too fast , Lead to ready The data in the buffer has not been flushed to the disk file , here current The buffer is suddenly full again ...

This is awkward , The online system suddenly began to appear abnormal ...

The typical performance is , All threads of instances deployed on all machines are stuck , be in wait The state of .

Location problem , An antidote against the disease

therefore , This system can't respond to any requests at the beginning of peak period . Later, after emergency online troubleshooting 、 Positioning and emergency repair , Just solved this problem .

In fact, the solution is also very simple , We go through jvm dump Take a snapshot for analysis , Check the specific link where the system thread is stuck , Then I found a large number of threads stuck waiting current Where the buffer zone .

It is obvious to know the reason , The solution is to expand the size of the dual segment buffer for the online system , from 512kb Expand to a buffer 10mb.

In this way, in the peak period of online , It can also make the double buffer mechanism of the degradation mechanism run smoothly , It won't be said that the request of instant peak influx fills two buffers .

Because the larger the buffer , You can make ready The buffer is flush To disk file ,current The buffer is not full so fast .

But this online fault feedback is a lesson , It is any more complex mechanism for system design and development , We must refer to the maximum flow rate during the peak period of the line to conduct the pressure test . That's the only way , To ensure that any complex mechanism on the system can withstand the test of online peak traffic .

V-xin：ruyuanhadeng get 600+ Page original boutique article summary PDF

原网站

版权声明
本文为[Notes on Shi Shan's architecture]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/199/202207160539068603.html