CentOS 通过Sentinel（哨兵）实现redis集群的高可用

布满荆棘的人生 2022-02-19 14:45 221阅读 0赞

### **一 Redis sentinel介绍（哨兵）** ###

Sentinel(哨兵)进程是用于监控redis集群中Master主服务器工作的状态，在Master主服务器发生故障的时候，可以实现Master和Slave服务器的切换，保证系统的高可用。哨兵(Sentinel) 是一个分布式系统，你可以在一个架构中运行多个哨兵(sentinel) 进程，这些进程使用**流言协议**(gossipprotocols)来接收关于Master主服务器是否下线的信息，并使用**投票协议**(Agreement Protocols)来决定是否执行自动故障迁移,以及选择哪个Slave作为新的Master。每个哨兵(Sentinel)进程会向其它哨兵(Sentinel)、Master、Slave定时发送消息，以确认对方是否”活”着，如果发现对方在指定配置时间(可配置的)内未得到回应，则暂时认为对方已掉线，也就是所谓的”**主观认为宕机**” ，英文名称：Subjective Down，简称SDOWN。有主观宕机，肯定就有客观宕机。当“哨兵群”中的多数Sentinel进程在对Master主服务器做出 SDOWN 的判断，并且通过 SENTINEL is-master-down-by-addr 命令互相交流之后，得出的Master Server下线判断，这种方式就是“**客观宕机**”，英文名称是：Objectively Down， 简称 ODOWN。通过一定的vote算法，从剩下的slave从服务器节点中，选一台提升为Master服务器节点，然后自动修改相关配置，并开启**故障转移**（failover）。  
  
       哨兵(sentinel) 虽然有一个单独的可执行文件 redis-sentinel ,但实际上它只是一个运行在特殊模式下的 Redis 服务器。  
      Sentinel集群之间会互相通信，沟通交流redis节点的状态，做出相应的判断并进行处理，这里的主观下线状态和客观下线状态是比较重要的状态，它们决定了是否进行故障转移，可以通过订阅指定的频道信息，当服务器出现故障得时候通知管理员，客户端可以将 Sentinel 看作是一个只提供了订阅功能的 Redis 服务器，你不可以使用 PUBLISH 命令向这个服务器发送信息，但你可以用 SUBSCRIBE 命令或者 PSUBSCRIBE 命令， 通过订阅给定的频道来获取相应的事件提醒。一个频道能够接收和这个频道的名字相同的事件。 比如说， 名为 +sdown 的频道就可以接收所有实例进入主观下线（SDOWN）状态的事件。

--------------------

**1、Sentinel（哨兵）进程的作用**

（1）、监控(Monitoring): 哨兵(sentinel) 会不断地检查你的Master和Slave是否运作正常。  
     （2）、提醒(Notification)：当被监控的某个Redis节点出现问题时, 哨兵(sentinel) 可以通过 API 向管理员或者其他应用程序发送通知。  
     （3）、自动故障迁移(Automatic failover)：当一个Master不能正常工作时，哨兵(sentinel) 会开始一次自动故障迁移操作，它会将失效Master的其中一个Slave升级为新的Master, 并让失效Master的其他Slave改为复制新的Master；当客户端试图连接失效的Master时，集群也会向客户端返回新Master的地址，使得集群可以使用现在的Master替换失效Master。Master和Slave服务器切换后，Master的redis.conf、Slave的redis.conf和sentinel.conf的配置文件的内容都会发生相应的改变，即，Master主服务器的redis.conf配置文件中会多一行slaveof的配置，sentinel.conf的监控目标会随之调换。

--------------------

**2、Sentinel（哨兵）进程的工作方式**

（1）、每个Sentinel（哨兵）进程以**每秒钟一次**的频率向整个集群中的Master主服务器，Slave从服务器以及其他Sentinel（哨兵）进程发送一个 PING 命令。  
           （2）、如果一个实例（instance）距离最后一次有效回复 PING 命令的时间超过 down-after-milliseconds 选项所指定的值， 则这个实例会被 Sentinel（哨兵）进程标记为主观下线（SDOWN）。  
            （3）、如果一个Master主服务器被标记为主观下线（SDOWN），则正在监视这个Master主服务器的所有 Sentinel（哨兵）进程要以每秒一次的频率确认Master主服务器的确进入了主观下线状态。  
            （4）、当有足够数量的 Sentinel（哨兵）进程（大于等于配置文件指定的值）在指定的时间范围内确认Master主服务器进入了主观下线状态（SDOWN）， 则Master主服务器会被标记为客观下线（ODOWN）。  
            （5）、在一般情况下， 每个 Sentinel（哨兵）进程会以每 10 秒一次的频率向集群中的所有Master主服务器、Slave从服务器发送 INFO 命令。  
           （6）、当Master主服务器被 Sentinel（哨兵）进程标记为客观下线（ODOWN）时，Sentinel（哨兵）进程向下线的 Master主服务器的所有 Slave从服务器发送 INFO 命令的频率会从 10 秒一次改为每秒一次。  
            （7）、若没有足够数量的 Sentinel（哨兵）进程同意 Master主服务器下线， Master主服务器的客观下线状态就会被移除。若 Master主服务器重新向 Sentinel（哨兵）进程发送 PING 命令返回有效回复，Master主服务器的主观下线状态就会被移除。

--------------------

3、**哨兵环境的配置**

主从分别设置哨兵的配置监控

### **二 配置文件参数详情** ###

首先需要修改sentinel的配置文件

（1）port：我们需要修改，因为要启动三个节点，端口必须是不一样的。

（2）dir： sentinel的运行时目录。

（3）sentinel monitor <master-name> <ip> <redis-port> <quorum>:

监视一个名叫 <master-name>的master，我们不需要监视slave，监视了master的话，slave会自动加入到sentinel里边。后边的quorum表示达成一致的最小数目，至少quorum台机器达成一致，才能保证一致性。

（4）sentinel down-after-milliseconds <master-name> <milliseconds>：

表示监视的节点在<milliseconds>后没有回复就会被认为主观下线，当quorum个节点都认为此节点下线了以后就会被认为客观下线。

（5）sentinel parallel-syncs <master-name> <numslaves>表示在故障转移的时候最多有numslaves在同步更新新的master。

我们修改过的三个sentinel.conf是sentinel1.conf，sentinel2.conf，sentinel3.conf，具体内容如下：

**sentinel1.conf:**

# Example sentinel.conf
    
    # port <sentinel-port>
    
    # The port that this sentinel instance will run on
    
    port 27000     #端口
    
    daemonize yes  #是否后台运行
    
    
    
    # dir <working-directory>
    
    # unmounting filesystems.
    
    #工作目录
    
    dir /tmp  
    
    # sentinel monitor <master-name> <ip> <redis-port> <quorum>
    
    # Note: master name should not include special characters or spaces.
    
    # The valid charset is A-z 0-9 and the three characters ".-_".
    
    
    
    #监视主服务master，需要2台机器以上才能保证数据一致性
    
    sentinel monitor master1 192.168.98.105 7000 2 
    
    sentinel monitor master2 192.168.98.105 2
    
    sentinel monitor master3 192.168.98.105 2
    
    
    
    # specified period) in order to consider it in S_DOWN state (Subjectively
    
    # Down).
    
    # Default is 30 seconds.
    
    
    
    #监视主服务master，30秒没有回复，需要2台机器表决才能主观下机
    
    sentinel down-after-milliseconds master1 30000
    
    sentinel down-after-milliseconds master2 30000
    
    sentinel down-after-milliseconds master3 30000
    
    
    
    # sentinel parallel-syncs <master-name> <numslaves>
    
    #
    
    # How many slaves we can reconfigure to point to the new slave simultaneously
    
    # during the failover. Use a low number if you use the slaves to serve query
    
    # to avoid that all the slaves will be unreachable at about the same
    
    # time while performing the synchronization with the master.
    
    
    
    #1表示故障转移的时候最多有numslaves在同步更新新的master
    
    sentinel parallel-syncs master1 1
    
    sentinel parallel-syncs master2 1
    
    sentinel parallel-syncs master3 1
    
    
    
    # sentinel failover-timeout <master-name> <milliseconds>
    
    # - The time needed to cancel a failover that is already in progress but
    
    # Default is 3 minutes.
    
    
    
    #指定故障切换允许的毫秒数，超过这个时间，就认为故障切换失败，默认为3分钟
    
    sentinel failover-timeout master1 180000
    
    sentinel failover-timeout master2 180000
    
    sentinel failover-timeout master3 180000

### **三 宕机测试** ###

1、配合好sentinel1.conf, sentinel2.conf, sentinel3.conf之后启动

redis-sentinel  /usr/local/redis-3.0.7/sentinel1.conf

redis-sentinel  /usr/local/redis-3.0.7/sentinel2.conf

redis-sentinel  /usr/local/redis-3.0.7/sentinel3.conf

2、进行宕机测试

关闭集群中的7003主服务，对应的从服务是7000

30秒左右将7000切换为master

此时7003如果手动启动，也是不能挂在到集群里面

解决方案：

方案一：将所有的集群中的节点都关闭，或者等到所有集群中的节点都宕机，重启服务

方案二：（需要重置，这种会把以前的数据清空，谨慎操作）

> redis-trib.rb check ip:端口
> 
> redis-trib.rb fix ip:端口
> 
> 运行flushdb

用redis-cli 登录到每个节点执行  flushall  和 cluster reset  就可以了。

然后重新执行群集脚本命令：

> ./redis-trib.rb create --replicas 1 192.168.\*.\*:7001 192.168.\*.\*:7002 192.168.\*.\*:7003 192.168.\*.\*:7004 192.168.\*.\*:7005  192.168.\*.\*:7006