rabbitmq无法重新加入集群,启动失败的问题

问题描述

原有的 rabbitmq 集群出现问题,无法启动,尝试删除 /var/lib/rabbitmq/.erlang.cookie 重新组集群,依旧无法启动

1
2
# systemctl start rabbitmq-server.service
Job for rabbitmq-server.service failed because the control process exited with error code. See "systemctl status rabbitmq-server.service" and "journalctl -xe" for details.

解决分析

查看错误日志

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# journalctl -xe
-- Subject: Unit rabbitmq-server.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit rabbitmq-server.service has begun starting up.
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: BOOT FAILED
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: ===========
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: Error description:
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: {error,{inconsistent_cluster,"Node rabbit@controller03 thinks it's clustered with node rabbit@controller02, but rabbit@controller02 disagrees"}}
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: Log files (may contain more information):
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: /var/log/rabbitmq/rabbit@controller03.log
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: /var/log/rabbitmq/rabbit@controller03-sasl.log
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: Stack trace:
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: [{rabbit_mnesia,check_cluster_consistency,0,
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: [{file,"src/rabbit_mnesia.erl"},{line,598}]},
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: {rabbit,'-boot/0-fun-0-',0,[{file,"src/rabbit.erl"},{line,275}]},
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: {rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,296}]},
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: {init,start_it,1,[]},
Nov 24 14:26:20 controller03 rabbitmq-server[13522]: {init,start_em,1,[]}]
Nov 24 14:26:21 controller03 rabbitmq-server[13522]: {"init terminating in do_boot",{error,{inconsistent_cluster,"Node rabbit@controller03 thinks it's clustered with node rabbit@controller02, but rabbit@controller02 disagrees"}}}
Nov 24 14:26:21 controller03 rabbitmq-server[13522]: init terminating in do_boot ()
Nov 24 14:26:22 controller03 rabbitmq-server[13522]: Crash dump is being written to: erl_crash.dump...done
Nov 24 14:26:22 controller03 systemd[1]: rabbitmq-server.service: main process exited, code=exited, status=1/FAILURE

可以看到错误描述

1
 {error,{inconsistent_cluster,"Node rabbit@controller03 thinks it's clustered with node rabbit@controller02, but rabbit@controller02 disagrees"}}

controller03 认为 controller02 是其 cluster node,但是controller02并不是

推测是之前集群残留的cluster信息,导致认证失败。官网查询到因为mnesia的信息残留,故会认证失败。

解决办法

1. 删除已有 mnesia 信息

1
# rm /var/lib/rabbitmq/mnesia

2. 重启服务,状态恢复正常

1
2
3
4
5
6
7
8
9
# systemctl restart rabbitmq-server.service

rabbitmqctl cluster_status
Cluster status of node rabbit@controller03 ...
[{nodes,[{disc,[rabbit@controller03]}]},
 {running_nodes,[rabbit@controller03]},
 {cluster_name,<<"rabbit@controller03">>},
 {partitions,[]},
 {alarms,[{rabbit@controller03,[]}]}]

3. 加入集群查看状态

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# rabbitmqctl stop_app
Stopping node rabbit@controller03 ...
[root@controller03 ~]# rabbitmqctl join_cluster --ram rabbit@controller01
Clustering node rabbit@controller03 with rabbit@controller01 ...


# rabbitmqctl start_app
Starting node rabbit@controller03 ...

# rabbitmqctl cluster_status
Cluster status of node rabbit@controller03 ...
[{nodes,[{disc,[rabbit@controller01]},
         {ram,[rabbit@controller03,rabbit@controller02]}]},
 {running_nodes,[rabbit@controller01,rabbit@controller03]},
 {cluster_name,<<"rabbit@controller01">>},
 {partitions,[]},
 {alarms,[{rabbit@controller01,[]},{rabbit@controller03,[]}]}]
一个默默无闻的工程师的日常
Built with Hugo
主题 StackJimmy 设计