详细分析Redis集群故障
故障表象:
业务层面显示提示查询redis失败
集群组成:
3主3从,每个节点的数据有8GB
机器分布:
在同一个机架中,
xx.x.xxx.199
xx.x.xxx.200
xx.x.xxx.201
redis-server进程状态:
通过命令ps-eopid,lstart|grep$pid,
发现进程已经持续运行了3个月
发生故障前集群的节点状态:
xx.x.xxx.200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65)master
xx.x.xxx.199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6)slave
xx.x.xxx.201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b)master
xx.x.xxx.201:8372(826607654f5ec81c3756a4a21f357e644efe605a)slave
xx.x.xxx.199:8370(462cadcb41e635d460425430d318f2fe464665c5)master
xx.x.xxx.200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf)slave
---------------------------------下面是日志分析--------------------------------------
步1:
主节点8371失去和从节点8373的连接:
46590:M09Sep18:57:51.379#Connectionwithslavexx.x.xxx.199:8373lost.
步2:
主节点8370/8375判定8371失联:
42645:M09Sep18:57:50.117*Markingnodebedab2c537fe94f8c0363ac4ae97d56832316e65asfailing(quorumreached).
步3:
从节点8372/8373/8374收到主节点8375说8371失联:
46986:S09Sep18:57:50.120*FAILmessagereceivedfrom5ab4f85306da6d633e4834b4d3327f45af02171baboutbedab2c537fe94f8c0363ac4ae97d56832316e65
步4:
主节点8370/8375授权8373升级为主节点转移:
42645:M09Sep18:57:51.055#Failoverauthgrantedto792020fe66c00ae56e27cd7a048ba6bb2b67adb6forepoch16
步5:
原主节点8371修改自己的配置,成为8373的从节点:
46590:M09Sep18:57:51.488#Configurationchangedetected.Reconfiguringmyselfasareplicaof792020fe66c00ae56e27cd7a048ba6bb2b67adb6
步6:
主节点8370/8375/8373明确8371失败状态:
42645:M09Sep18:57:51.522*ClearFAILstatefornodebedab2c537fe94f8c0363ac4ae97d56832316e65:masterwithoutslotsisreachableagain.
步7:
新从节点8371开始从新主节点8373,第一次全量同步数据:
8373日志::
4255:M09Sep18:57:51.906*Fullresyncrequestedbyslavexx.x.xxx.200:8371
4255:M09Sep18:57:51.906*StartingBGSAVEforSYNCwithtarget:disk
4255:M09Sep18:57:51.941*Backgroundsavingstartedbypid5230
8371日志::
46590:S09Sep18:57:51.948*Fullresyncfrommaster:d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993
步8:
主节点8370/8375判定8373(新主)失联:
42645:M09Sep18:58:00.320*Markingnode792020fe66c00ae56e27cd7a048ba6bb2b67adb6asfailing(quorumreached).
步9:
主节点8370/8375判定8373(新主)恢复:
60295:M09Sep18:58:18.181*ClearFAILstatefornode792020fe66c00ae56e27cd7a048ba6bb2b67adb6:isreachableagainandnobodyisservingitsslotsaftersometime.
步10:
主节点8373完成全量同步所需要的BGSAVE操作:
5230:C09Sep18:59:01.474*DBsavedondisk
5230:C09Sep18:59:01.491*RDB:7112MBofmemoryusedbycopy-on-write
4255:M09Sep18:59:01.877*Backgroundsavingterminatedwithsuccess
步11:
从节点8371开始从主节点8373接收到数据:
46590:S09Sep18:59:02.263*MASTER<->SLAVEsync:receiving2657606930bytesfrommaster
步12:
主节点8373发现从节点8371对outputbuffer作了限制:
4255:M09Sep19:00:19.014#Clientid=14259015addr=xx.x.xxx.200:21772fd=844name=age=148idle=148flags=Sdb=0sub=0psub=0multi=-1qbuf=0qbuf-free=0obl=16349oll=4103omem=95944066events=rwcmd=psyncscheduledtobeclosedASAPforovercomingofoutputbufferlimits.
4255:M09Sep19:00:19.015#Connectionwithslavexx.x.xxx.200:8371lost.
步13:
从节点8371从主节点8373同步数据失败,连接断了,第一次全量同步失败:
46590:S09Sep19:00:19.018#I/OerrortryingtosyncwithMASTER:connectionlost
46590:S09Sep19:00:20.102*ConnectingtoMASTERxx.x.xxx.199:8373
46590:S09Sep19:00:20.102*MASTER<->SLAVEsyncstarted
步14:
从节点8371重新开始同步,连接失败,主节点8373的连接数满了:
46590:S09Sep19:00:21.103*ConnectingtoMASTERxx.x.xxx.199:8373
46590:S09Sep19:00:21.103*MASTER<->SLAVEsyncstarted
46590:S09Sep19:00:21.104*NonblockingconnectforSYNCfiredtheevent.
46590:S09Sep19:00:21.104#ErrorreplytoPINGfrommaster:'-ERRmaxnumberofclientsreached'
步15:
从节点8371重新连上主节点8373,第二次开始全量同步:
8371日志:
46590:S09Sep19:00:49.175*ConnectingtoMASTERxx.x.xxx.199:8373
46590:S09Sep19:00:49.175*MASTER<->SLAVEsyncstarted
46590:S09Sep19:00:49.175*NonblockingconnectforSYNCfiredtheevent.
46590:S09Sep19:00:49.176*MasterrepliedtoPING,replicationcancontinue...
46590:S09Sep19:00:49.179*Partialresynchronizationnotpossible(nocachedmaster)
46590:S09Sep19:00:49.501*Fullresyncfrommaster:d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454
8373日志:
4255:M09Sep19:00:49.176*Slavexx.x.xxx.200:8371asksforsynchronization
4255:M09Sep19:00:49.176*Fullresyncrequestedbyslavexx.x.xxx.200:8371
4255:M09Sep19:00:49.176*StartingBGSAVEforSYNCwithtarget:disk
4255:M09Sep19:00:49.498*Backgroundsavingstartedbypid18413
18413:C09Sep19:01:52.466*DBsavedondisk
18413:C09Sep19:01:52.620*RDB:2124MBofmemoryusedbycopy-on-write
4255:M09Sep19:01:53.186*Backgroundsavingterminatedwithsuccess
步16:
从节点8371同步数据成功,开始加载经内存:
46590:S09Sep19:01:53.190*MASTER<->SLAVEsync:receiving2637183250bytesfrommaster
46590:S09Sep19:04:51.485*MASTER<->SLAVEsync:Flushingolddata
46590:S09Sep19:05:58.695*MASTER<->SLAVEsync:LoadingDBinmemory
步17:
集群恢复正常:
42645:M09Sep19:05:58.786*ClearFAILstatefornodebedab2c537fe94f8c0363ac4ae97d56832316e65:slaveisreachableagain.
步18:
从节点8371同步数据成功,耗时7分钟:
46590:S09Sep19:08:19.303*MASTER<->SLAVEsync:Finishedwithsuccess
8371失联原因分析:
由于几台机器在同一个机架,不太可能发生网络中断的情况,于是通过SLOWLOGGET命令查看了慢查询日志,发现有一个KEYS命令被执行了,耗时8.3秒,再查看集群节点超时设置,发现是5s(cluster-node-timeout5000)
出现节点失联的原因:
客户端执行了耗时1条8.3s的命令,
2016/9/918:57:43开始执行KEYS命令
2016/9/918:57:508371被判断失联(redis日志)
2016/9/918:57:51执行完KEYS命令
总结来说,有以下几个问题:
1.由于cluster-node-timeout设置比较短,慢查询KEYS导致了集群判断节点8371失联
2.由于8371失联,导致8373升级为主,开始主从同步
3.由于配置client-output-buffer-limit的限制,导致第一次全量同步失败了
4.又由于PHP客户端的连接池有问题,疯狂连接服务器,产生了类似SYN攻击的效果
5.第一次全量同步失败后,从节点重连主节点花了30秒(超过了最大连接数1w)
关于client-output-buffer-limit参数:
#Thesyntaxofeveryclient-output-buffer-limitdirectiveisthefollowing: # #client-output-buffer-limit# #Aclientisimmediatelydisconnectedoncethehardlimitisreached,orif #thesoftlimitisreachedandremainsreachedforthespecifiednumberof #seconds(continuously). #Soforinstanceifthehardlimitis32megabytesandthesoftlimitis #16megabytes/10seconds,theclientwillgetdisconnectedimmediately #ifthesizeoftheoutputbuffersreach32megabytes,butwillalsoget #disconnectediftheclientreaches16megabytesandcontinuouslyovercomes #thelimitfor10seconds. # #Bydefaultnormalclientsarenotlimitedbecausetheydon'treceivedata #withoutasking(inapushway),butjustafterarequest,soonly #asynchronousclientsmaycreateascenariowheredataisrequestedfaster #thanitcanread. # #Insteadthereisadefaultlimitforpubsubandslaveclients,since #subscribersandslavesreceivedatainapushfashion. # #Boththehardorthesoftlimitcanbedisabledbysettingthemtozero. client-output-buffer-limitnormal000 client-output-buffer-limitslave256mb64mb60 client-output-buffer-limitpubsub32mb8mb60
采取措施:
1.单实例的切割到4G以下,否则发生主从切换会耗时很长
2.调整client-output-buffer-limit参数,防止同步进行到一半失败
3.调整cluster-node-timeout,不能少于15s
4.禁止任何耗时超过cluster-node-timeout的慢查询,因为会导致主从切换
5.修复客户端类似SYN攻击的疯狂连接方式
总结
以上就是本文关于详细分析Redis集群故障的全部内容,希望对大家有所帮助。感兴趣的朋友可以参阅:SpringAOP实现Redis缓存数据库查询源码、简述Redis和MySQL的区别、oracle数据库启动阶段分析等,如有不足之处,请留言之处。小编会及时更正。感谢朋友们对毛票票网站的支持!