作者: whooyun发表于: 2025-07-02 18:45
Problem Description:
The microservice went offline from Nacos. The interface with the highest concurrency experienced partial HTTP requests failing to reach the endpoint, resulting in retry-related error messages. RabbitMQ message processing became slow (taking up to 8 minutes per message), and a large backlog of messages occurred. Restarting the service did not resolve the issue.
On-site Handling Measures:
Increased the maximum number of messages allowed in the message queue (MQ).
Post-event Analysis:
A scheduled task was scanning all failed messages from the database for the past 7 days and pushing them back to the application server for reprocessing via MQ — specifically for order status updates and fee generation. If the consumers (order status, fee generation) failed again, these messages were written back into the failed message table in the database. This created a dead loop between message producers and consumers, causing continuous writing of messages into the database disk.
Monitoring and Analysis After the Incident:
At the time of the anomaly, CPU usage, memory usage, and JVM metrics remained normal — all below 50% utilization. However, disk IOPS reached around 6000, which is approximately 100 times higher than usual.
Root Cause:
The root cause was traced back to the business code logic, where proper consideration was not given to which types of MQ messages should be persisted into the database. Additionally, data behavior was not closely monitored after deployment.