当前位置:网站首页>The degradation mechanism is not designed properly, and the online system crashes instantly
The degradation mechanism is not designed properly, and the online system crashes instantly
2022-07-18 14:40:00 【Notes on Shi Shan's architecture】
Background introduction
The background is like this : An online system , During a certain peak MQ In case of middleware failure , Triggered the degradation mechanism , As a result, the degradation mechanism ran for a while after it was triggered , Suddenly the system is completely stuck , Unable to respond to any requests .
Let me briefly introduce the overall architecture of this system , This system simply has a very core behavior , Is to MQ Write data in , But this way MQ The data written in is very core and critical , There must be no loss .
So a degradation mechanism was designed at the beginning , If once MQ Middleware failure , Then the system will immediately write the core data to the local disk file .
But if the concurrency is high in the peak period , After receiving a piece of data, write the local disk file synchronously immediately , This performance is absolutely extremely poor , It will cause the throughput of the system to drop dramatically , This degradation mechanism can never run in a production environment , Because you will be overwhelmed by high concurrency requests .
So when it was designed , The degradation mechanism has been carefully designed .
Our core idea is once MQ Middleware failure , After triggering the degradation mechanism , The system receives a request and does not immediately write to the local disk , instead Memory double buffering + Mechanism of batch disk brushing .
Simply speaking , The system will write the memory buffer immediately after receiving a message , Then start a background thread to flush the data buffered in memory to the disk .
The whole process , Let's take a look at the picture below , You know the .
This memory buffer is actually designed , Divided into two areas .
One is current Area , Used for the system to write data , The other is ready Area , Used for background threads to refresh data to disk .
The buffer size set for each memory area is 512kb, When the system receives the request, it writes current buffer , however current The total buffer is 512kb Of memory space , So it must be full .
Again , Let's combine the following figure , Let's see .
current After the buffer is full , Will exchange current Buffers and ready buffer . After the exchange ,ready The buffer holds the previously filled 512kb The data of .
then current The buffer is empty at this time , You can continue, and then the system continues to write the new data after the exchange current buffer .
The whole process is shown in the following figure :
here , The background thread can put ready The data in the buffer passes Java NIO Of API, Direct high performance append Write to the local disk file in the way .
Of course , Here, the background thread will have a complete set of mechanisms , For example, a disk file has a fixed size , If it reaches a certain size , Automatically open a new disk file to write data .
Bury the hidden danger
good ! Through the above mechanism , Even in the rush hour , It can also successfully resist high concurrency requests , Everything looks beautiful !
however , At that time, the degradation mechanism was under development , The idea we take , Hidden dangers are buried behind !
The idea adopted at that time was : If current After the buffer is full , All threads fall into one while Loop infinite wait .
When to wait ? Wait until ready The data in the buffer is brushed to the disk file , Empty it ready buffer , Then follow current Swap buffers .
such current The buffer will become empty again , Before the worker thread can continue to write data .
But have you ever considered that an abnormal situation may happen ?
Is the background thread refresh ready Buffer data to disk file , In fact, it also takes a little time .
In case he is in the process of refreshing data to disk file ,current The buffer is suddenly full ?
At this time, all working threads of the system cannot write current buffer , All threads are stuck .
Here's a picture , Look at this question !
This is the most fundamental problem of the double buffer mechanism of the system degradation mechanism , After developing this degradation mechanism , Tested with normal request pressure , It is found that two buffers are set to 512kb Under the circumstances , It works well , No problem .
Peak request , The problems
But the problem is at the peak . A certain peak , The system request pressure has reached the normal 10 More than times .
Of course, under the normal process , During the rush hour , The write request is actually written directly to MQ Middleware cluster , So even if your peak traffic increases 10 Times doesn't matter ,MQ Clusters are naturally resistant to high concurrency .
But unfortunately at that time , In the rush hour ,MQ The middleware cluster suddenly fails temporarily , This is also less than a few times a year .
This causes the system to suddenly trigger the degradation mechanism , Then start writing data into the memory double buffer .
Need to know , It's peak time , The request volume is normal 10 times ! therefore 10 Times the request pressure instantly led to a problem .
The problem is that the transient influx of high concurrency requests will current Buffer full , Then the two buffers are exchanged , The background thread starts to refresh ready The data in the buffer is transferred to the disk file .
As a result, the rush of requests was too fast , Lead to ready The data in the buffer has not been flushed to the disk file , here current The buffer is suddenly full again ...
This is awkward , The online system suddenly began to appear abnormal ...
The typical performance is , All threads of instances deployed on all machines are stuck , be in wait The state of .
Location problem , An antidote against the disease
therefore , This system can't respond to any requests at the beginning of peak period . Later, after emergency online troubleshooting 、 Positioning and emergency repair , Just solved this problem .
In fact, the solution is also very simple , We go through jvm dump Take a snapshot for analysis , Check the specific link where the system thread is stuck , Then I found a large number of threads stuck waiting current Where the buffer zone .
It is obvious to know the reason , The solution is to expand the size of the dual segment buffer for the online system , from 512kb Expand to a buffer 10mb.
In this way, in the peak period of online , It can also make the double buffer mechanism of the degradation mechanism run smoothly , It won't be said that the request of instant peak influx fills two buffers .
Because the larger the buffer , You can make ready The buffer is flush To disk file ,current The buffer is not full so fast .
But this online fault feedback is a lesson , It is any more complex mechanism for system design and development , We must refer to the maximum flow rate during the peak period of the line to conduct the pressure test . That's the only way , To ensure that any complex mechanism on the system can withstand the test of online peak traffic .
边栏推荐
- Chapter I environment configuration
- 刷题笔记-排序
- 二进制搭建 Kubernetes
- 5. Redis architecture design to usage scenario - storage principle - data type infrastructure
- [image recognition based on yolov5]
- UE adds two buttons on the resource right-click menu of editor
- Thinkphp5.1.37 deserialization chain analysis
- Qt自定义控件--pagenavigation(页面导航)
- Complete set of signal functions:
- 如何用常数时间插入、删除和获取随机元素
猜你喜欢

Kotlin | 为 Kotlin 编译器任务推出构建报告

Desai wisdom number - discount (gradient stacking chart): per capita disposable income of national residents

分库分表和 NewSQL 到底怎么选?

基于单片机的氢气监测系统设计(#0491)

基于单片机的可燃气烟雾系统设计(#0489)

What if win11 prompts outlook for search errors? Win11 prompt outlook search error

Design of intelligent speech recognition Bluetooth headset based on wtk6900h speech recognition single chip

Educational Codeforces Round 131 A - D

STM32 general timer

Design of combustible gas smoke system based on single chip microcomputer (0488)
随机推荐
Comparison and summary of five deep learning models for time series prediction: from simulated statistical model to unsupervised model that can be pre trained
nn. Gru() use
Understanding multi bank capital system (II) -- today's homepage
我为 TDengine “带盐”!“高价”招募出镜开发者
Huawei ECS cloud database creation read-only process
If you don't want to step on those holes in SaaS, you must first understand the "SaaS architecture"
[leetcode brush questions]
ES6中Array对象的方法和扩展、string的扩展方法、数组的遍历
HAL 固件库
STM32通用定时器
Educational Codeforces Round 131 A - D
[每周一更]-(第3期):Web开发安全注意事项
Chapter 5 network communication practice
QT custom control -- pagenavigation
合宙Air820ug昇級固件要點
Clone warehouse code when creating a new project in idea
408 day attendance linked list insertion sorting and double bubble sorting after class exercise code
Is it safe for Huatai Securities to open an account? Is it a regular securities company?
数百亿数据压缩至 600GB,TDengine 落地协鑫能科移动能源平台
Fleet | "background exploration" issue 4: distributed transactions