How to optimize the entanglement between timed tasks and feign timeout?

1 background

The service timer application will often trigger the alarm email of fuse abnormality in the middle of the night

Find and summarize the following table according to the class prompted by the email

number Error reporting method Application to which the interface belongs Scheduled task class
A VipTradeReportFeignService#getShopTradeReportByDate pinka-mod-stats ShopOrderSturctureTask
B VipMemberStatsFeignService#statMemberRecord pinka-mod-stats MemberStatTask
C VipPartnerWalletFeignService.handlePartnerWithdraw pinka-mod-customer PartnerWithdrawCheckTask
D VipWeixinBabyActivityFeignService.getBabyActivityNoticePage pinka-mod-weixin VipWeixinBabyNoticeTask

The above A~D are generated by external feign microservice calls in a distributed timer event processing application (pinkA mod scheduler), which is equivalent to four types of tasks. Each type will call the external feign microservice interface one or more times, and there is a problem with the A~D interface

Both A and B are exceptions in the following forms

Both C and D are exceptions in the following forms

feign.RetryableException: failed to respond executing POST http://pinka-mod-customer/vip/partner/wallet/handlePartnerWithdraw
at feign.FeignException.errorExecuting(
at feign.SynchronousMethodHandler.executeAndDecode(
at feign.SynchronousMethodHandler.invoke(
at feign.hystrix.HystrixInvocationHandler$
Caused by: org.apache.http.NoHttpResponseException: failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(

2 tracing

2.1 HystrixTimeoutException timeout exception

The exceptions of A and B occur almost every day, and the prompt is obvious. It is caused by the timeout set in Hystrix (currently 10s) and the execution timeout. Why did it time out? De interface implementation discovery is A time-consuming logic with A for loop scenario

It takes time to check the history through Kibana log system, and it can also be found that they are basically > 13s, so the cause of such exceptions is basically true

2.1.1 solution and thinking

This is actually a typical scenario. The timer task is executed and the processing logic is in another micro service, and the processing logic is complex and time-consuming. What should I do?

A. Increasing the timeout is a rude idea, because setting it too long may lead to greater problems, because the timeout is originally for fastfail. After setting it for 20s, you may encounter scenes that take 30s or even longer. Therefore, this scheme cannot be used in the public default timeout of all calls;

However, it can be used on some interfaces, such as the VipTradeReportFeignService#getShopTradeReportByDate interface. The normal evaluation time is more than 15s, so set it separately. Relevant configuration methods:

#Default public timeout
#Set timeout for a feign interface separately

B. Optimize the logical execution time of the interface provider. For example, whether the for loop in the above VipTradeReportFeignService#getShopTradeReportByDate can be moved to the interface caller is equivalent to that the interface provider only performs the operation of the for loop once at a time. To put it bluntly, ensure that the interface returns within the timeout, which is also in line with the design principle of microservice interface.

C. Another idea is to asynchronize interface processing, that is, the interface provider returns immediately and processes the final logic with asynchronous threads. However, this alone will lead to unreliable task execution, that is, the successful return of the interface does not mean that the real execution is successful. If the interface provider restarts or exceptions at this time, half of the time-consuming asynchronous logic execution is interrupted, but the distributed timed task scheduling mechanism cannot be used to retry the execution. Therefore, when using this idea, the interface returns immediately, but the task cannot be completed as a success immediately. It needs to cooperate with some asynchronous notification mechanisms, that is, the interface provider truly and successfully ends the time-consuming operation, notifies the interface caller, and then the interface caller reports the task as a success return.

2.2 feign.RetryableException failed to respond executing exception

This is the abnormality of C and D. It is a random low-frequency alarm. It literally means that the interface request is unresponsive. Combined with the word "fusing" in the email, it is naturally speculated that it is the problem of the application provided by the interface (it turns out that it is cratered by the word "fusing"). Therefore, we traced the monitoring indicators of pinkA mod customer before and after the alarm, and found that there were no abnormal conditions in tcp connection, CPU, memory and network traffic. In addition, if it is a fuse, the interface must fail many times, and each scheduled task calls the interface only once.

At this time, check the interface provider's controller layer log and find that the provider does not enter the controller for processing at the alarm time.

It can be inferred that there is no problem with the provider's application itself. While checking the caller's application logs and performance indicators, there are no exceptions at that time, and logs are constantly generated for other application calls. Combined with this exception log, it is speculated that the reason is due to the network flash failure of a call between the caller and the provider (so it is a random low frequency).

However, the reason why the "fuse" is turned on cannot be explained. At this time, trace the code source of the email alarm. The essence of the alarm is realized by rewriting the getFallback method in the official HystrixCommand creation logic of openfeign, that is, when you enter the fallback logic, you will send an email

At this time, the truth is revealed. In fact, it is only in the fallback degradation, which does not mean that the fuse is turned on. For example, throwing an exception in the run of HystrixCommand will enter the fallback, the run execution timeout will enter the fallback, and the fuse will also enter the fallback. That is, the A~D exceptions, although the e-mail said that the fuse was blown, in fact, they did not turn on the fuse, but just entered the fallback degradation!

So feign Retryableexception failed to respond executing is actually just an accidental call failure and a fallback. It is not as complex as previously thought.

2.2.1 solution and thinking

Naturally, the email alarm logic needs to be modified to distinguish between fusing and degradation. If the fuse is to be judged, the following methods can be used

protected Object getFallback() {
        if (this.isCircuitBreakerOpen()) {
          // Fuse alarm mode
          // The non fuse degradation alarm can not be written if no alarm is required

"Structured life, iterative life" -- Deep Lao Xia, search summer_deep wechat official account for more help

Tags: Spring

Posted by allex01 on Mon, 16 May 2022 01:32:34 +0300