| <!DOCTYPE html> |
| <html> |
| <head> |
| <meta charset="utf-8"> |
| <meta name="viewport" content="width=device-width, initial-scale=1"> |
| <title>Pegasus | Rolling Update</title> |
| <link rel="stylesheet" href="/zh/assets/css/app.css"> |
| <link rel="shortcut icon" href="/zh/assets/images/favicon.ico"> |
| <link rel="stylesheet" href="/zh/assets/css/utilities.min.css"> |
| <link rel="stylesheet" href="/zh/assets/css/docsearch.v3.css"> |
| <script src="/assets/js/jquery.min.js"></script> |
| <script src="/assets/js/all.min.js"></script> |
| <script src="/assets/js/docsearch.v3.js"></script> |
| <!-- Begin Jekyll SEO tag v2.8.0 --> |
| <title>Rolling Update | Pegasus</title> |
| <meta name="generator" content="Jekyll v4.3.3" /> |
| <meta property="og:title" content="Rolling Update" /> |
| <meta property="og:locale" content="en_US" /> |
| <meta name="description" content="功能目标" /> |
| <meta property="og:description" content="功能目标" /> |
| <meta property="og:site_name" content="Pegasus" /> |
| <meta property="og:type" content="article" /> |
| <meta property="article:published_time" content="2024-04-22T13:02:52+00:00" /> |
| <meta name="twitter:card" content="summary" /> |
| <meta property="twitter:title" content="Rolling Update" /> |
| <script type="application/ld+json"> |
| {"@context":"https://schema.org","@type":"BlogPosting","dateModified":"2024-04-22T13:02:52+00:00","datePublished":"2024-04-22T13:02:52+00:00","description":"功能目标","headline":"Rolling Update","mainEntityOfPage":{"@type":"WebPage","@id":"/administration/rolling-update"},"url":"/administration/rolling-update"}</script> |
| <!-- End Jekyll SEO tag --> |
| </head> |
| |
| |
| <body> |
| <div class="dashboard is-full-height"> |
| <!-- left panel --> |
| <div class="dashboard-panel is-medium is-hidden-mobile pl-0"> |
| <div class="dashboard-panel-header has-text-centered"> |
| <a href="/zh/"> |
| <img src="/assets/images/pegasus-logo-inv.png" style="width: 80%;"> |
| </a> |
| |
| </div> |
| <div class="dashboard-panel-main is-scrollable pl-6"> |
| |
| |
| <aside class="menu"> |
| |
| <p class="menu-label">Pegasus 产品文档</p> |
| <ul class="menu-list"> |
| |
| <li> |
| <a href="/zh/docs/downloads" |
| class=""> |
| 下载 |
| </a> |
| </li> |
| |
| </ul> |
| |
| <p class="menu-label">编译构建</p> |
| <ul class="menu-list"> |
| |
| <li> |
| <a href="/zh/docs/build/compile-by-docker" |
| class=""> |
| 使用 Docker 完成编译(推荐) |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/docs/build/compile-from-source" |
| class=""> |
| 从源码编译 |
| </a> |
| </li> |
| |
| </ul> |
| |
| <p class="menu-label">客户端库</p> |
| <ul class="menu-list"> |
| |
| <li> |
| <a href="/zh/clients/java-client" |
| class=""> |
| Java 客户端 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/clients/cpp-client" |
| class=""> |
| C++ 客户端 |
| </a> |
| </li> |
| |
| <li> |
| <a href="https://github.com/apache/incubator-pegasus/tree/master/go-client" |
| class=""> |
| Golang 客户端 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/clients/python-client" |
| class=""> |
| Python 客户端 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/clients/node-client" |
| class=""> |
| NodeJS 客户端 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/clients/scala-client" |
| class=""> |
| Scala 客户端 |
| </a> |
| </li> |
| |
| </ul> |
| |
| <p class="menu-label">生态工具</p> |
| <ul class="menu-list"> |
| |
| <li> |
| <a href="/zh/docs/tools/shell" |
| class=""> |
| Pegasus Shell 工具 |
| </a> |
| </li> |
| |
| <li> |
| <a href="https://github.com/pegasus-kv/admin-cli" |
| class=""> |
| 集群管理命令行 |
| </a> |
| </li> |
| |
| <li> |
| <a href="https://github.com/pegasus-kv/pegic" |
| class=""> |
| 数据访问命令行 |
| </a> |
| </li> |
| |
| </ul> |
| |
| <p class="menu-label">用户接口</p> |
| <ul class="menu-list"> |
| |
| <li> |
| <a href="/zh/api/ttl" |
| class=""> |
| TTL |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/api/single-atomic" |
| class=""> |
| 单行原子操作 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/api/redis" |
| class=""> |
| Redis 适配 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/api/geo" |
| class=""> |
| GEO 支持 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/api/http" |
| class=""> |
| HTTP 接口 |
| </a> |
| </li> |
| |
| </ul> |
| |
| <p class="menu-label">高效运维</p> |
| <ul class="menu-list"> |
| |
| <li> |
| <a href="/zh/administration/deployment" |
| class=""> |
| 集群部署 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/config" |
| class=""> |
| 配置说明 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/rebalance" |
| class=""> |
| 负载均衡 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/monitoring" |
| class=""> |
| 可视化监控 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/rolling-update" |
| class="is-active"> |
| 集群重启和升级 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/scale-in-out" |
| class=""> |
| 集群扩容缩容 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/resource-management" |
| class=""> |
| 资源管理 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/cold-backup" |
| class=""> |
| 冷备份 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/meta-recovery" |
| class=""> |
| 元数据恢复 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/replica-recovery" |
| class=""> |
| Replica 数据恢复 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/zk-migration" |
| class=""> |
| Zookeeper 迁移 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/table-migration" |
| class=""> |
| Table 迁移 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/table-soft-delete" |
| class=""> |
| Table 软删除 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/table-env" |
| class=""> |
| Table 环境变量 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/remote-commands" |
| class=""> |
| 远程命令 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/partition-split" |
| class=""> |
| Partition-Split |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/duplication" |
| class=""> |
| 跨机房同步 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/compression" |
| class=""> |
| 数据压缩 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/throttling" |
| class=""> |
| 流量控制 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/experiences" |
| class=""> |
| 运维经验 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/manual-compact" |
| class=""> |
| Manual Compact 功能 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/usage-scenario" |
| class=""> |
| Usage Scenario 功能 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/bad-disk" |
| class=""> |
| 坏盘检修 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/whitelist" |
| class=""> |
| Replica Server 白名单 |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/backup-request" |
| class=""> |
| Backup Request |
| </a> |
| </li> |
| |
| <li> |
| <a href="/zh/administration/hotspot-detection" |
| class=""> |
| 热点检测 |
| </a> |
| </li> |
| |
| </ul> |
| |
| </aside> |
| </div> |
| </div> |
| |
| <!-- main section --> |
| <div class="dashboard-main is-scrollable"> |
| <nav class="navbar is-hidden-desktop"> |
| <div class="navbar-brand"> |
| <a href="/zh/" class="navbar-item"> |
| <!-- Pegasus Icon --> |
| <img src="/assets/images/pegasus-square.png"> |
| </a> |
| <div class="navbar-item"> |
| |
| |
| <!--A simple language switch button that only supports zh and en.--> |
| <!--IF its language is zh, then switches to en.--> |
| |
| <!--If you don't want a url to be relativized, you can add a space explicitly into the href to |
| prevents a url from being relativized by polyglot.--> |
| <a class="button is-light is-outlined is-inverted" href=" /administration/rolling-update"><strong>En</strong></a> |
| |
| </div> |
| <a role="button" class="navbar-burger burger" aria-label="menu" aria-expanded="false" data-target="navMenu"> |
| <!-- Appears in mobile mode only --> |
| <span aria-hidden="true"></span> |
| <span aria-hidden="true"></span> |
| <span aria-hidden="true"></span> |
| </a> |
| </div> |
| <div class="navbar-menu" id="navMenu"> |
| <div class="navbar-end"> |
| |
| <!--dropdown--> |
| <div class="navbar-item has-dropdown is-hoverable"> |
| <a href="" |
| class="navbar-link "> |
| <span> |
| Pegasus 产品文档 |
| </span> |
| </a> |
| <div class="navbar-dropdown"> |
| |
| <a href="/zh/docs/downloads" |
| class="navbar-item "> |
| 下载 |
| </a> |
| |
| </div> |
| </div> |
| |
| <!--dropdown--> |
| <div class="navbar-item has-dropdown is-hoverable"> |
| <a href="" |
| class="navbar-link "> |
| <span> |
| 编译构建 |
| </span> |
| </a> |
| <div class="navbar-dropdown"> |
| |
| <a href="/zh/docs/build/compile-by-docker" |
| class="navbar-item "> |
| 使用 Docker 完成编译(推荐) |
| </a> |
| |
| <a href="/zh/docs/build/compile-from-source" |
| class="navbar-item "> |
| 从源码编译 |
| </a> |
| |
| </div> |
| </div> |
| |
| <!--dropdown--> |
| <div class="navbar-item has-dropdown is-hoverable"> |
| <a href="" |
| class="navbar-link "> |
| <span> |
| 客户端库 |
| </span> |
| </a> |
| <div class="navbar-dropdown"> |
| |
| <a href="/zh/clients/java-client" |
| class="navbar-item "> |
| Java 客户端 |
| </a> |
| |
| <a href="/zh/clients/cpp-client" |
| class="navbar-item "> |
| C++ 客户端 |
| </a> |
| |
| <a href="https://github.com/apache/incubator-pegasus/tree/master/go-client" |
| class="navbar-item "> |
| Golang 客户端 |
| </a> |
| |
| <a href="/zh/clients/python-client" |
| class="navbar-item "> |
| Python 客户端 |
| </a> |
| |
| <a href="/zh/clients/node-client" |
| class="navbar-item "> |
| NodeJS 客户端 |
| </a> |
| |
| <a href="/zh/clients/scala-client" |
| class="navbar-item "> |
| Scala 客户端 |
| </a> |
| |
| </div> |
| </div> |
| |
| <!--dropdown--> |
| <div class="navbar-item has-dropdown is-hoverable"> |
| <a href="" |
| class="navbar-link "> |
| <span> |
| 生态工具 |
| </span> |
| </a> |
| <div class="navbar-dropdown"> |
| |
| <a href="/zh/docs/tools/shell" |
| class="navbar-item "> |
| Pegasus Shell 工具 |
| </a> |
| |
| <a href="https://github.com/pegasus-kv/admin-cli" |
| class="navbar-item "> |
| 集群管理命令行 |
| </a> |
| |
| <a href="https://github.com/pegasus-kv/pegic" |
| class="navbar-item "> |
| 数据访问命令行 |
| </a> |
| |
| </div> |
| </div> |
| |
| <!--dropdown--> |
| <div class="navbar-item has-dropdown is-hoverable"> |
| <a href="" |
| class="navbar-link "> |
| <span> |
| 用户接口 |
| </span> |
| </a> |
| <div class="navbar-dropdown"> |
| |
| <a href="/zh/api/ttl" |
| class="navbar-item "> |
| TTL |
| </a> |
| |
| <a href="/zh/api/single-atomic" |
| class="navbar-item "> |
| 单行原子操作 |
| </a> |
| |
| <a href="/zh/api/redis" |
| class="navbar-item "> |
| Redis 适配 |
| </a> |
| |
| <a href="/zh/api/geo" |
| class="navbar-item "> |
| GEO 支持 |
| </a> |
| |
| <a href="/zh/api/http" |
| class="navbar-item "> |
| HTTP 接口 |
| </a> |
| |
| </div> |
| </div> |
| |
| <!--dropdown--> |
| <div class="navbar-item has-dropdown is-hoverable"> |
| <a href="" |
| class="navbar-link "> |
| <span> |
| 高效运维 |
| </span> |
| </a> |
| <div class="navbar-dropdown"> |
| |
| <a href="/zh/administration/deployment" |
| class="navbar-item "> |
| 集群部署 |
| </a> |
| |
| <a href="/zh/administration/config" |
| class="navbar-item "> |
| 配置说明 |
| </a> |
| |
| <a href="/zh/administration/rebalance" |
| class="navbar-item "> |
| 负载均衡 |
| </a> |
| |
| <a href="/zh/administration/monitoring" |
| class="navbar-item "> |
| 可视化监控 |
| </a> |
| |
| <a href="/zh/administration/rolling-update" |
| class="navbar-item is-active"> |
| 集群重启和升级 |
| </a> |
| |
| <a href="/zh/administration/scale-in-out" |
| class="navbar-item "> |
| 集群扩容缩容 |
| </a> |
| |
| <a href="/zh/administration/resource-management" |
| class="navbar-item "> |
| 资源管理 |
| </a> |
| |
| <a href="/zh/administration/cold-backup" |
| class="navbar-item "> |
| 冷备份 |
| </a> |
| |
| <a href="/zh/administration/meta-recovery" |
| class="navbar-item "> |
| 元数据恢复 |
| </a> |
| |
| <a href="/zh/administration/replica-recovery" |
| class="navbar-item "> |
| Replica 数据恢复 |
| </a> |
| |
| <a href="/zh/administration/zk-migration" |
| class="navbar-item "> |
| Zookeeper 迁移 |
| </a> |
| |
| <a href="/zh/administration/table-migration" |
| class="navbar-item "> |
| Table 迁移 |
| </a> |
| |
| <a href="/zh/administration/table-soft-delete" |
| class="navbar-item "> |
| Table 软删除 |
| </a> |
| |
| <a href="/zh/administration/table-env" |
| class="navbar-item "> |
| Table 环境变量 |
| </a> |
| |
| <a href="/zh/administration/remote-commands" |
| class="navbar-item "> |
| 远程命令 |
| </a> |
| |
| <a href="/zh/administration/partition-split" |
| class="navbar-item "> |
| Partition-Split |
| </a> |
| |
| <a href="/zh/administration/duplication" |
| class="navbar-item "> |
| 跨机房同步 |
| </a> |
| |
| <a href="/zh/administration/compression" |
| class="navbar-item "> |
| 数据压缩 |
| </a> |
| |
| <a href="/zh/administration/throttling" |
| class="navbar-item "> |
| 流量控制 |
| </a> |
| |
| <a href="/zh/administration/experiences" |
| class="navbar-item "> |
| 运维经验 |
| </a> |
| |
| <a href="/zh/administration/manual-compact" |
| class="navbar-item "> |
| Manual Compact 功能 |
| </a> |
| |
| <a href="/zh/administration/usage-scenario" |
| class="navbar-item "> |
| Usage Scenario 功能 |
| </a> |
| |
| <a href="/zh/administration/bad-disk" |
| class="navbar-item "> |
| 坏盘检修 |
| </a> |
| |
| <a href="/zh/administration/whitelist" |
| class="navbar-item "> |
| Replica Server 白名单 |
| </a> |
| |
| <a href="/zh/administration/backup-request" |
| class="navbar-item "> |
| Backup Request |
| </a> |
| |
| <a href="/zh/administration/hotspot-detection" |
| class="navbar-item "> |
| 热点检测 |
| </a> |
| |
| </div> |
| </div> |
| |
| </div> |
| </div> |
| </nav> |
| |
| <nav class="navbar is-hidden-mobile"> |
| <div class="navbar-start w-full"> |
| <div class="navbar-item pl-0 w-full"> |
| <!--TODO(wutao): Given the limitation of docsearch that couldn't handle multiple input, |
| I make searchbox only shown in desktop. Fix this issue when docsearch.js v3 released. |
| Related issue: https://github.com/algolia/docsearch/issues/230--> |
| <div id="docsearch"></div> |
| </div> |
| </div> |
| <div class="navbar-end"> |
| <div class="navbar-item"> |
| |
| |
| <!--A simple language switch button that only supports zh and en.--> |
| <!--IF its language is zh, then switches to en.--> |
| |
| <!--If you don't want a url to be relativized, you can add a space explicitly into the href to |
| prevents a url from being relativized by polyglot.--> |
| <a class="button is-light is-outlined is-inverted" href=" /administration/rolling-update"><strong>En</strong></a> |
| |
| </div> |
| </div> |
| </nav> |
| |
| <section class="hero is-info lg:mr-3"> |
| <div class="hero-body"> |
| |
| <p class="title is-size-2 is-centered">集群重启和升级</p> |
| </div> |
| </section> |
| <section class="section" style="padding-top: 2rem;"> |
| <div class="content"> |
| <h1 id="功能目标">功能目标</h1> |
| |
| <p>当需要升级 Pegasus server 版本或者持久化修改配置时,都需要对集群进行重启。对于分布式集群来说,常用的重启方法是滚动重启 (Rolling-Restart),即不停止集群服务,而对 server 逐个进行重启。</p> |
| |
| <blockquote> |
| <p>以下文档假定 Pegasus 集群中表的副本数为 3。</p> |
| </blockquote> |
| |
| <p>集群重启的重要目标是不停服,并且对可用性的影响降至最低。在重启过程中,影响服务可用性的有如下几点:</p> |
| <ul> |
| <li>Replica Server 进程被 kill 后,该进程服务的 replica 无法提供服务: |
| <ul> |
| <li>对于 primary replica:因为 primary replica 直接向客户端提供读写服务,所以进程被 kill 后肯定会影响读写,需要等 Meta Server 重新分派新的 primary replica 后才能恢复。Meta Server 通过心跳维护 Replica Server 的存活状态,Failure Detector 的时间延迟取决于配置参数 <code class="language-plaintext highlighter-rouge">fd_grace_seconds</code>,默认为 10 秒,即最多需要经过 10 秒,Meta Server 才能知道 Replica Server 宕机了,然后重新分派新的 primary replica。</li> |
| <li>对于 secondary replica:由于 secondary replica 不服务读,所以理论上对读无影响。但是会影响写,因为 PacificA 一致性协议要求所有副本都写成功,写操作才能提交。进程被 kill 后,primary replica 在执行写操作过程中会发现该 secondary replica 已失联,然后通知 Meta Server 将其踢出,经过 <code class="language-plaintext highlighter-rouge">reconfiguration</code> 阶段后变成一主一备,继续提供写服务。对于在该切换过程中尚未完成的写操作,即使有 <code class="language-plaintext highlighter-rouge">reconciliation</code> 阶段重新执行,但客户端可能已经超时,这对可用性是有一定影响的。但是这个影响相对较小,因为 `` 的速度是比较快的,通常能在 1 秒内完成。</li> |
| </ul> |
| </li> |
| <li>重启 Meta Server:重启 Meta Server 对可用性的影响几乎可以忽略不计。因为客户端首次从 Meta Server 获取到各 partition 的服务节点信息后,会在本地缓存该信息,通常不需要再次向 Meta Server 查询,因此 Meta Server 重启过程中的短暂失联对客户端基本没有影响。不过考虑到 Meta Server 需要与 Replica Server 维持心跳,所以要避免长时间停止 Meta Server 进程,造成 Replica Server 失联。</li> |
| <li>重启 Collector:重启 Collector 对可用性没有影响。但是可用性统计是在 Collector 上进行的,所以可能会对 metrics 数据有轻微影响。</li> |
| </ul> |
| |
| <p>因此,可以考虑如下几点来保持集群重启过程中的可用性:</p> |
| <ul> |
| <li>一次只能重启一个进程,且在该进程重启并完全恢复进入服务状态后,才能重启下一个进程。因为: |
| <ul> |
| <li>如果重启一个进程后,集群没有恢复到完全健康状态,有的 partition 还只有一主一备,这时如果再 kill 一个 Replica Server 进程,很可能进入只有一主的状态,从而无法提供写服务。</li> |
| <li>等待集群所有 partition 都恢复三副本后再重启下一个进程,也能降低数据丢失的风险。</li> |
| </ul> |
| </li> |
| <li>尽量主动迁移 replica,而不是被动迁移 replica,避免 Failure Detector 的延迟影响可用性。因为: |
| <ul> |
| <li>被动迁移需要等待 Failure Detector 来感知节点失联,而主动迁移就是在 kill 掉 Replica Server 之前,先将这个进程服务的 primary replica 都迁移到其他节点上,这个 <code class="language-plaintext highlighter-rouge">reconfiguration</code> 过程是很快的,基本 1 秒以内完成。</li> |
| </ul> |
| </li> |
| <li>尽量在 kill 掉 Replica Server 之前,将该进程服务的 secondary replica 手动降级。因为: |
| <ul> |
| <li>将 <code class="language-plaintext highlighter-rouge">reconfiguration</code> 过程由写失败时的被动触发变为主动触发,进一步降低对可用性的影响。</li> |
| </ul> |
| </li> |
| <li>尽量减少进程重启时恢复过程的工作量,以缩短进程重启时间。 |
| <ul> |
| <li>Replica Server 在重启时需要 replay WAL log 来恢复数据。如果直接 kill 掉,则需要 replay 的数据量可能很大。但是如果在 kill 之前,先主动触发 memtable 的 flush 操作,让内存数据持久化到磁盘,在重启时需要 replay 的数据量就会大大减少,重启时间会缩短很多,而整个集群重启所需的时间也能大大缩短。</li> |
| </ul> |
| </li> |
| <li>尽量减少不必要的节点间数据拷贝,避免因为增加 CPU、网络 IO、磁盘 IO 的负载带来的可用性影响。 |
| <ul> |
| <li>Replica Server 挂掉后,部分 partition 进入一主一备的状态。如果 Meta Server 立即在其他 Replica Server 上补充副本,会带来大量的跨节点数据拷贝,增加 CPU、网络 IO、磁盘 IO 负载压力,影响集群稳定性。Pegasus 解决这个问题的办法是,允许在一段时间内维持一主一备状态,给重启的 Replica Server 一个维护窗口。如果长时间没有恢复,才会在其他的 Replica Server 上补充副本。这样兼顾了数据的完整性和集群的稳定性。可以通过配置参数 <code class="language-plaintext highlighter-rouge">replica_assign_delay_ms_for_dropouts</code> 控制等待时间,默认为 5 分钟。</li> |
| </ul> |
| </li> |
| </ul> |
| |
| <h1 id="重启流程">重启流程</h1> |
| |
| <h2 id="高可用重启">高可用重启</h2> |
| |
| <ul> |
| <li>如果是升级,请先准备好新的 server 程序包和配置文件</li> |
| <li>使用 shell 工具将集群的 meta level 设置为 <code class="language-plaintext highlighter-rouge">steady</code>,关闭 <a href="rebalance">负载均衡功能</a>,避免不必要的 replica 迁移 |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> set_meta_level steady |
| </code></pre></div> </div> |
| </li> |
| <li>使用 shell 工具设置单 Replica Server 的维护时间 |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> remote_command -t meta-server meta.lb.assign_delay_ms $value |
| </code></pre></div> </div> |
| <p>其中 <code class="language-plaintext highlighter-rouge">value</code> 为 Meta Server 发现 Replica Server 失联后,到其他节点补充副本的触发时间。例如配置为 <code class="language-plaintext highlighter-rouge">3600000</code>。</p> |
| </li> |
| <li>重启 Replica Server 进程,采用逐个重启的策略。重启单个 Replica Server: |
| <ul> |
| <li>通过 shell 工具向 Meta Server 发送 <a href="remote-commands#meta-server">远程命令</a>,临时禁掉 <code class="language-plaintext highlighter-rouge">add_secondary</code> 操作: |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> remote_command -t meta-server meta.lb.add_secondary_max_count_for_one_node 0 |
| </code></pre></div> </div> |
| </li> |
| <li>通过 migrate_node 命令,将 Replica Server 上的 primary replica 都转移到其他节点: |
| <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>./run.sh migrate_node <span class="nt">-c</span> <span class="nv">$meta_list</span> <span class="nt">-n</span> <span class="nv">$node</span> <span class="nt">-t</span> run |
| </code></pre></div> </div> |
| <p>通过 shell 工具的 <code class="language-plaintext highlighter-rouge">nodes -d</code> 命令查看该节点服务的 replica 情况,等待 <strong>primary</strong> replica 的个数变为 0。如果长时间不变为 0,请重新执行该命令。</p> |
| </li> |
| <li>通过 downgrade_node 命令,将 Replica Server 上的 secondary replica 都降级为 <code class="language-plaintext highlighter-rouge">INACTIVE</code>: |
| <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>./run.sh downgrade_node <span class="nt">-c</span> <span class="nv">$meta_list</span> <span class="nt">-n</span> <span class="nv">$node</span> <span class="nt">-t</span> run |
| </code></pre></div> </div> |
| <p>通过 shell 工具的 <code class="language-plaintext highlighter-rouge">nodes -d</code> 命令查看该节点的服务 replica 情况,等待 <strong>secondary</strong> replica 的个数变为 0。如果长时间不变为 0,请重新执行该命令。</p> |
| </li> |
| <li>通过 shell 工具向 Replica Server 发送远程命令,将所有 replica 都关闭,以触发 flush 操作: |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> remote_command -l $node replica.kill_partition |
| </code></pre></div> </div> |
| <p>等待大约 1 分钟,让数据刷到磁盘完成。</p> |
| </li> |
| <li>如果是升级操作,则替换程序包和配置文件</li> |
| <li>重启 Replica Server 进程</li> |
| <li>通过 shell 工具向 Meta Server 发送 <a href="remote-commands#meta-server">远程命令</a>,开启 <code class="language-plaintext highlighter-rouge">add_secondary</code> 操作,让其快速补充副本: |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> remote_command -t meta-server meta.lb.add_secondary_max_count_for_one_node 100 |
| </code></pre></div> </div> |
| </li> |
| <li>使用 shell 工具的 <code class="language-plaintext highlighter-rouge">ls -d</code> 命令查看集群状态,等待所有 partition 都完全恢复健康</li> |
| <li>继续操作下一个 Replica Server</li> |
| </ul> |
| </li> |
| <li>重启 Meta Server 进程,采用逐个重启的策略。重启单个 Meta Server: |
| <ul> |
| <li>如果是升级操作,替换程序包和配置文件</li> |
| <li>重启 Meta Server 进程</li> |
| <li>等待 30 秒以上,保证 Meta Server 与 Replica Server 心跳的连续性</li> |
| <li>继续操作下一个 Meta Server</li> |
| </ul> |
| </li> |
| <li>重启 Collector 进程: |
| <ul> |
| <li>如果是升级操作,替换程序包和配置文件</li> |
| <li>重启 Collector 进程</li> |
| </ul> |
| </li> |
| <li>重置参数 |
| <ul> |
| <li>通过 shell 工具重置以上步骤修改过的参数: |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> remote_command -t meta-server meta.lb.add_secondary_max_count_for_one_node DEFAULT |
| >>> remote_command -t meta-server meta.lb.assign_delay_ms DEFAULT |
| </code></pre></div> </div> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| |
| <h2 id="简化版重启">简化版重启</h2> |
| |
| <p>如果对可用性要求不高,重启流程可简化如下:</p> |
| <ul> |
| <li>如果是升级操作,请准备好新的 server 程序包和配置文件</li> |
| <li>使用 shell 工具将集群的 meta level 设置为 <code class="language-plaintext highlighter-rouge">steady</code>,关闭 <a href="rebalance">负载均衡功能</a>,避免不必要的 replica 迁移 |
| <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> set_meta_level steady |
| </code></pre></div> </div> |
| </li> |
| <li>重启 Replica Server 进程,采用逐个重启的策略。重启单个 Replica Server: |
| <ul> |
| <li>如果是升级操作,则替换程序包和配置文件</li> |
| <li>重启 Replica Server 进程</li> |
| <li>使用 shell 工具的 <code class="language-plaintext highlighter-rouge">ls -d</code> 命令查看集群状态,等待所有 partition 都完全恢复健康</li> |
| <li>继续操作下一个 Replica Server</li> |
| </ul> |
| </li> |
| <li>重启 Meta Server 进程,采用逐个重启的策略。重启单个 Meta Server: |
| <ul> |
| <li>如果是升级操作,替换程序包和配置文件</li> |
| <li>重启 Meta Server 进程</li> |
| <li>等待 30 秒以上,保证 Meta Server 与 Replica Server 心跳的连续性</li> |
| <li>继续操作下一个 Meta Server</li> |
| </ul> |
| </li> |
| <li>重启 Collector 进程: |
| <ul> |
| <li>如果是升级操作,替换程序包和配置文件</li> |
| <li>重启 Collector 进程</li> |
| </ul> |
| </li> |
| </ul> |
| |
| <h1 id="重启脚本">重启脚本</h1> |
| |
| <p>可参考基于 <a href="https://github.com/XiaoMi/minos">Minos</a> 和 <a href="#高可用重启">高可用重启</a> 流程的脚本:<a href="https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_rolling_update.sh">scripts/pegasus_rolling_update.sh</a>。</p> |
| |
| </div> |
| </section> |
| <footer class="footer"> |
| <div class="container"> |
| <div class="content is-small has-text-centered"> |
| <div style="margin-bottom: 20px;"> |
| <a href="http://incubator.apache.org"> |
| <img src="/assets/images/egg-logo.png" |
| width="15%" |
| alt="Apache Incubator"/> |
| </a> |
| </div> |
| Copyright © 2023 <a href="http://www.apache.org">The Apache Software Foundation</a>. |
| Licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version |
| 2.0</a>. |
| <br><br> |
| |
| Apache Pegasus is an effort undergoing incubation at The Apache Software Foundation (ASF), |
| sponsored by the Apache Incubator. Incubation is required of all newly accepted projects |
| until a further review indicates that the infrastructure, communications, and decision making process |
| have stabilized in a manner consistent with other successful ASF projects. While incubation status is |
| not necessarily a reflection of the completeness or stability of the code, it does indicate that the |
| project has yet to be fully endorsed by the ASF. |
| |
| <br><br> |
| Apache Pegasus, Pegasus, Apache, the Apache feather logo, and the Apache Pegasus project logo are either |
| registered trademarks or trademarks of The Apache Software Foundation in the United States and other |
| countries. |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <!-- right panel --> |
| <div class="dashboard-panel is-small is-scrollable is-hidden-mobile"> |
| <p class="menu-label"> |
| <span class="icon"> |
| <i class="fa fa-bars" aria-hidden="true"></i> |
| </span> |
| 本页导航 |
| </p> |
| <ul class="menu-list"> |
| <li><a href="#功能目标">功能目标</a></li> |
| <li><a href="#重启流程">重启流程</a> |
| <ul> |
| <li><a href="#高可用重启">高可用重启</a></li> |
| <li><a href="#简化版重启">简化版重启</a></li> |
| </ul> |
| </li> |
| <li><a href="#重启脚本">重启脚本</a></li> |
| </ul> |
| |
| </div> |
| </div> |
| |
| <script src="/assets/js/app.js" type="text/javascript"></script> |
| <script> |
| docsearch({ |
| container: '#docsearch', |
| appId: 'QRN30RBW0S', |
| indexName: 'pegasus-apache', |
| apiKey: 'd3a3252fa344359766707a106c4ed88f', |
| debug: true |
| }); |
| </script> |
| |
| </body> |
| |
| </html> |