Contents
  1. 1. 什么是quarantine
  2. 2. 解决办法

之前开发了一套基于akka cluster的调度|负载均衡系统,运行了半年均较稳定,但最近有节点两次down掉。由于没有太多经验,第一次down掉时经验主义地以为是内存吃紧导致程序异常(由于计算资源紧张,很多服务混布,内存确实非常紧张,时常有类似故障),第二次仔细检查了日志发现如下日志:

1
[WARN] [03/17/2018 21:51:38.769] [cluster-akka.remote.default-remote-dispatcher-91] [akka.remote.Remoting] Tried to associate with unreachable remote address [akka.tcp://cluster@10.104.3.35:7712]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]

同时在其它正常的节点上有如下日志

1
[INFO] [03/13/2018 21:36:12.659] [cluster-akka.remote.default-remote-dispatcher-35339] [akka.remote.Remoting] Quarantined address [akka.tcp://cluster@10.104.3.36:7712] is still unreachable or has not been restarted. Keeping it quarantined.

同时master上还有记录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[ERROR] [03/17/2018 21:51:37.662] [cluster-akka.remote.default-remote-dispatcher-127527] [akka.remote.Remoting] Association to [akka.tcp://cluster@10.104.3.36:7712] with UID [1258718596] irrecoverably failed. Quarantining address.
java.util.concurrent.TimeoutException: Remote system has been silent for too long. (more than 48.0 hours)
at akka.remote.ReliableDeliverySupervisor$$anonfun$idle$1.applyOrElse(Endpoint.scala:383)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
at akka.actor.ActorCell.invoke(ActorCell.scala:496)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[03/17/2018 21:51:37.749] [cluster-akka.actor.default-dispatcher-105] [akka.tcp://cluster@10.104.3.35:7712/system/cluster/core/daemon] Cluster Node [akka.tcp://cluster@10.104.3.35:7712] - Marking node as TERMINATED [akka.tcp://cluster@10.104.3.36:7712], due to quarantine. Node roles [dc-default]

什么是quarantine

字面意思是隔离,(题外话:这个单词‘隔离’含义的起源是有典故的), 那么大致猜测是GC或者网络抖动导致集群认为此节点不健康,被驱逐。于是检索了一下资料。

akka cluster如果判定某节点会损害集群健康,就会把它隔离,可能的原因有如下三种:

  1. System message delivery failure 系统消息传递失败

  2. Remote DeathWatch trigger 远程死亡监控触发

  3. Cluster MemberRemoved event 集群移除节点

解决办法

根据akka的文档,可以调整akka.cluster.failure-detector.threshold来设定判定阈值,来避免因为偶然拉动而导致的误判,但也不宜过大。
另外,为了避免cluster系统与业务线程竞争,可为其设置单独的线程池. 在配置中增加

1
2
3
4
5
6
7
8
9
10
akka.cluster.use-dispatcher = cluster-dispatcher
cluster-dispatcher {
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 2
parallelism-max = 4
}
}

akka.cluster.use-dispatcher的默认配置为空。

最后,以上办法都无法保证节点永远不down,最好的方式还是做好容错。

Contents
  1. 1. 什么是quarantine
  2. 2. 解决办法