rabbitmq 引发的服务器IOPS故障-其它-复制代码（fuzhicode.com）

【原创】rabbitmq 引发的服务器IOPS故障

作者: whooyun发表于: 2025-07-02 18:45

问题现象描述：
微服务从nacos上掉线，并发量最大的接口出现部分http请求，无法请求到接口的重试提示，rabbitmq消息处理缓慢（最长8分钟），并出现大量堆积，重启服务后现象一致，仍然未出来好转

现场处理措施：
修改了MQ最大的消息

事后问题分析经过：
有个定时任务从数据库扫描7天内所有失败消息到应用服务器，重新进行mq业务（订单状态，费用生成）推送，如果业务消费者（订单状态，费用生成）失败则再次写入到消息失败表，导致消费生产者和消息消费者形成了死循环，消息不停的写入数据库磁盘中。

事后监控分析经过：
异常时间点，服务器cpu，内存，jvm都正常，使用率都低于50%，但是磁盘IOPS是6000，是平常日期的100倍左右

问题归根节点还是业务代码引起的，没考虑什么样的mq消息需要存入到数据库中，上线后没有观察数据问题

Problem Description:
The microservice went offline from Nacos. The interface with the highest concurrency experienced partial HTTP requests failing to reach the endpoint, resulting in retry-related error messages. RabbitMQ message processing became slow (taking up to 8 minutes per message), and a large backlog of messages occurred. Restarting the service did not resolve the issue.

On-site Handling Measures:
Increased the maximum number of messages allowed in the message queue (MQ).

Post-event Analysis:
A scheduled task was scanning all failed messages from the database for the past 7 days and pushing them back to the application server for reprocessing via MQ — specifically for order status updates and fee generation. If the consumers (order status, fee generation) failed again, these messages were written back into the failed message table in the database. This created a dead loop between message producers and consumers, causing continuous writing of messages into the database disk.

Monitoring and Analysis After the Incident:
At the time of the anomaly, CPU usage, memory usage, and JVM metrics remained normal — all below 50% utilization. However, disk IOPS reached around 6000, which is approximately 100 times higher than usual.

Root Cause:
The root cause was traced back to the business code logic, where proper consideration was not given to which types of MQ messages should be persisted into the database. Additionally, data behavior was not closely monitored after deployment.