Skip to content

Elasticsearch 生产环境最佳实践方案

版本:v2.0
基于ES版本:7.17.x / 8.x(LTS版本推荐)
编写日期:2026年5月
适用场景:日志分析、全文检索、APM存储、向量搜索


目录

第一部分:基础架构篇

  1. 硬件选型与容量规划
  2. 集群拓扑设计
  3. 操作系统配置优化

第二部分:部署配置篇

  1. Elasticsearch核心配置详解
  2. JVM参数最佳实践
  3. 索引模板与生命周期管理

第三部分:性能调优篇

  1. 写入性能调优
  2. 查询性能调优
  3. 分片与副本策略
  4. 缓存配置优化

第四部分:运维管理篇

  1. 集群监控与告警
  2. 备份与恢复策略
  3. 滚动升级与平滑扩容
  4. 常见故障排查手册

第五部分:安全与合规

  1. 安全配置与访问控制
  2. 数据加密与合规

第一部分:基础架构篇


1. 硬件选型与容量规划

1.1 硬件选型标准

组件要求说明
CPU物理核 ≥ 16核,主频 ≥ 2.5GHzES是CPU密集型(查询)+ IO密集型(写入)混合型
内存32GB - 64GB 物理内存JVM堆内存 ≤ 31GB(避免压缩指针失效)
磁盘SSD / NVMe SSD,拒绝HDD随机读写性能至关重要,NVMe > SSD
网络万兆以太网(10Gbps)节点间数据同步、查询聚合消耗大量带宽
RAIDRAID0(有多副本)或RAID10副本≥2时可用RAID0获得最大性能

1.2 不同场景硬件配比

场景类型CPU:内存:磁盘比例典型配置
日志分析(写入重)1:2:2016核/32GB/6TB NVMe
全文检索(查询重)2:1:1032核/64GB/1TB NVMe
APM监控(混合)1:1.5:1516核/24GB/4TB SSD
向量搜索(计算重)3:2:548核/32GB/500GB NVMe

1.3 容量规划计算方法

【总数据量计算】
原始数据量 × (1 + 副本数) × 1.5(索引膨胀系数)= 存储需求

示例:
日均数据量:500GB
保留天数:7天
副本数:1
存储需求 = 500GB × 7 × (1+1) × 1.5 = 10.5 TB

【节点数量计算】
单节点推荐承载:2-5TB(热数据)
节点数 = 总存储需求 ÷ 单节点承载量 = 10.5TB ÷ 4TB ≈ 3节点

【内存配置计算】
JVM堆内存 = min(物理内存 × 0.5, 31GB)
例如:64GB物理内存 → 31GB堆内存

⚠️ 关键原则

  • JVM堆内存绝对不要超过32GB(压缩指针失效临界值)
  • 推荐设置为31GB留出安全余量
  • 剩余内存留给操作系统Page Cache(ES极度依赖)

2. 集群拓扑设计

2.1 节点角色划分

角色职责是否专用配置建议
Master集群管理、元数据管理✅ 专用(3个)低配置:4C8GB即可,不需要大磁盘
Data数据存储、索引、查询✅ 专用(按需扩展)高配置:16C64GB NVMe
Coordinating请求路由、结果聚合⚠️ 大集群专用中配置:8C32GB
Ingest数据预处理、管道执行⚠️ 高写入场景专用中配置:8C16GB
ML Node机器学习、异常检测❌ 可选高配置:GPU加速

2.2 典型集群拓扑

方案A:小型集群(3节点,<10TB数据)

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Node 1     │  │  Node 2     │  │  Node 3     │
│  Master+Data│  │  Master+Data│  │  Master+Data│
│  16C/64GB   │  │  16C/64GB   │  │  16C/64GB   │
│  4TB NVMe   │  │  4TB NVMe   │  │  4TB NVMe   │
└─────────────┘  └─────────────┘  └─────────────┘
       │                │                │
       └────────────────┴────────────────┘

                ┌──────────────┐
                │  Kibana/APM  │
                └──────────────┘

适用场景:开发测试、小型生产环境(日志/监控) 优点:架构简单、成本低 注意:Master节点要参与数据读写,压力较大

方案B:中型集群(5-9节点,10-50TB数据)

                  ┌──────────────┐
                  │  VIP/SLB     │
                  └──────┬───────┘

        ┌────────────────┼────────────────┐
        │                │                │
   ┌────▼────┐     ┌────▼────┐     ┌────▼────┐
   │ Master 1 │     │ Master 2 │     │ Master 3 │
   │  4C8GB   │     │  4C8GB   │     │  4C8GB   │
   │ (仅管理)  │     │ (仅管理)  │     │ (仅管理)  │
   └──────────┘     └──────────┘     └──────────┘
        │                │                │
        └────────────────┼────────────────┘

        ┌────────────────┼────────────────┐
        │                │                │
   ┌────▼────┐     ┌────▼────┐     ┌────▼────┐
   │ Data 1  │     │ Data 2  │     │ Data 3  │
   │ 16C64GB │     │ 16C64GB │     │ 16C64GB │
   │ 4TB NVMe│     │ 4TB NVMe│     │ 4TB NVMe│
   └──────────┘     └──────────┘     └──────────┘

适用场景:中型生产环境、核心业务系统 优点:Master/Data分离,集群稳定性高 扩展:Data节点可水平扩展到6-9个

方案C:大型集群(10+节点,>50TB数据)

                  ┌──────────────┐
                  │  VIP/SLB     │
                  └──────┬───────┘

              ┌──────────┴──────────┐
              │  Coordinating (x3)  │
              │  请求分发、结果聚合  │
              └──────────┬──────────┘

        ┌────────────────┼────────────────┐
        │                │                │
   ┌────▼────┐     ┌────▼────┐     ┌────▼────┐
   │ Master 1 │     │ Master 2 │     │ Master 3 │
   │  专用    │     │  专用    │     │  专用    │
   └──────────┘     └──────────┘     └──────────┘
        │                │                │
        └────────────────┼────────────────┘

   ┌─────────────────────────────────────────────┐
   │           Data Node Pool (6-100+)            │
   │  按冷热分层:Hot(SSD) → Warm(HDD) → Frozen  │
   └─────────────────────────────────────────────┘

2.3 集群规模黄金法则

规模节点数Master数量单分片大小
小型33(混合角色)10-30GB
中型5-93(专用)20-50GB
大型10+3(专用) + Coordinating30-70GB

Master节点永远是奇数个(避免脑裂),且数量=3(不要5/7个,没必要)


3. 操作系统配置优化

3.1 内核参数调整(/etc/sysctl.conf)

bash
# ================ 内存相关 ================
# 关闭内存交换(ES禁止swap)
vm.swappiness = 1
vm.max_map_count = 262144  # 必须,ES启动会检查

# ================ 网络相关 ================
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 10000
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 3

# ================ 文件系统 ================
# 增加文件描述符限制
fs.file-max = 655360

# 虚拟内存脏数据比例调整(优化写入)
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.dirty_expire_centisecs = 12000
vm.dirty_writeback_centisecs = 3000

# ================ 使配置生效 ================
sysctl -p

3.2 资源限制配置(/etc/security/limits.conf)

bash
elasticsearch  soft  nofile  655360
elasticsearch  hard  nofile  655360
elasticsearch  soft  nproc   655360
elasticsearch  hard  nproc   655360
elasticsearch  soft  memlock unlimited
elasticsearch  hard  memlock unlimited
elasticsearch  soft  as      unlimited
elasticsearch  hard  as      unlimited

3.3 磁盘挂载优化

bash
# ================ 禁用atime ================
# /etc/fstab 修改挂载参数
UUID=xxxx-xxxx  /data  ext4  defaults,noatime,nodiratime  0  0

# ================ I/O调度器 ================
# SSD/NVMe使用none(原noop),HDD使用deadline
echo none > /sys/block/nvme0n1/queue/scheduler

# ================ 预读优化 ================
blockdev --setra 4096 /dev/nvme0n1

# ================ 验证挂载 ================
mount -o remount /data

3.4 禁用Swap

bash
# 临时关闭
swapoff -a

# 永久关闭(编辑/etc/fstab,注释掉swap行)
sed -i '/ swap / s/^/#/' /etc/fstab

# 验证
free -h  # Swap行应该都是0

3.5 系统服务优化

bash
# 禁用不必要的服务
systemctl stop firewalld
systemctl disable firewalld
systemctl stop postfix
systemctl disable postfix

# 设置ntp时间同步
yum install -y ntp
systemctl enable ntpd
systemctl start ntpd

# 关闭透明大页(THP导致ES性能问题)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# 永久关闭THP
cat >> /etc/rc.d/rc.local << EOF
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
EOF
chmod +x /etc/rc.d/rc.local

第二部分:部署配置篇


4. Elasticsearch核心配置详解

4.1 elasticsearch.yml 完整配置模板

yaml
# ======================================
# Elasticsearch 生产环境配置模板
# ======================================

# ---------------- Cluster ----------------
cluster.name: my-production-cluster
cluster.routing.allocation.cluster_concurrent_rebalance: 30
cluster.routing.allocation.node_concurrent_recoveries: 30
cluster.routing.allocation.node_initial_primaries_recoveries: 30

# ---------------- Node -------------------
node.name: node-01
node.roles: [data, master]  # 7.17+ 新配置方式
# 7.x 旧方式:node.master: true, node.data: true

# 节点属性(用于冷热分层)
node.attr.rack: r1
node.attr.temp: hot

# ---------------- Paths ------------------
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch

# ---------------- Memory -----------------
# 锁定内存(禁止swap)
bootstrap.memory_lock: true

# ---------------- Network ----------------
network.host: 0.0.0.0
network.publish_host: 192.168.1.10  # 内网IP

# HTTP端口
http.port: 9200
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-headers: X-Requested-With,Content-Type,Content-Length,Authorization

# Transport(节点间通信)
transport.port: 9300
transport.tcp.compress: true
transport.tcp.connect_timeout: 60s

# ---------------- Discovery --------------
discovery.seed_hosts:
  - 192.168.1.10:9300
  - 192.168.1.11:9300
  - 192.168.1.12:9300

cluster.initial_master_nodes:
  - node-01
  - node-02
  - node-03

# 节点选举超时
discovery.zen.ping_timeout: 3s
discovery.zen.join_timeout: 60s

# ---------------- Gateway ----------------
gateway.recover_after_nodes: 2
gateway.expected_nodes: 3
gateway.recover_after_time: 5m

# ---------------- X-Pack -----------------
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12

xpack.security.http.ssl.enabled: false  # 内网可关闭,公网必须开启

# 监控
xpack.monitoring.collection.enabled: true
xpack.monitoring.history.duration: 7d

# ---------------- Performance -------------
# 线程池配置
thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 1000
thread_pool.get.queue_size: 1000

# 断路器配置(防止OOM)
indices.breaker.total.limit: 95%
indices.breaker.fielddata.limit: 40%
indices.breaker.request.limit: 40%

# 段合并策略
index.merge.scheduler.max_thread_count: 4
index.merge.policy.max_merged_segment: 5gb
index.merge.policy.segments_per_tier: 10

# ---------------- Recovery ----------------
indices.recovery.max_bytes_per_sec: 100mb
indices.recovery.concurrent_streams: 5
indices.recovery.concurrent_small_file_streams: 20

4.2 jvm.options 关键配置

bash
# ======================================
# JVM 配置 - 31GB 模板
# ======================================

# 堆内存(关键!不要超过31GB)
-Xms31g
-Xmx31g

# GC 配置(G1GC,JDK11+推荐)
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=35
-XX:+ExplicitGCInvokesConcurrent
-XX:+ParallelRefProcEnabled
-XX:+AlwaysPreTouch

# G1GC 优化(JDK11+)
-XX:+UseStringDeduplication
-XX:+ParallelRefProcEnabled
-XX:MaxTenuringThreshold=7

# 垃圾回收日志
-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

# OOM 时生成堆转储
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=data

# 错误日志
-XX:ErrorFile=logs/hs_err_pid%p.log

# JDK 版本检查
## Lucene 需要
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.math=ALL-UNNAMED

5. JVM参数最佳实践

5.1 内存设置黄金法则

物理内存JVM堆内存(Xms/Xmx)系统Page Cache说明
16GB8GB8GB最小生产配置
32GB16GB16GB平衡配置
64GB31GB33GB推荐配置(充分利用压缩指针)
128GB31GB97GB查询密集型,更多给Page Cache

💡 为什么31GB?

  • JVM使用压缩指针(Compressed Oops)优化对象引用
  • 堆内存 ≤ 32GB时,每个引用从8字节压缩到4字节,节省约50%内存
  • 临界值约在31.8GB左右,保守设置31GB确保压缩指针生效

5.2 GC调优策略

GC算法适用场景优点缺点
G1GC通用场景,堆>8GB可预测停顿,大堆友好JDK8下有已知Bug
CMS堆<8GB,低延迟优先停顿时间短碎片化,已废弃
ZGCJDK15+,超大堆亚毫秒停顿需要ES 8.x+

5.3 生产JVM配置示例

bash
# 64GB物理内存,JDK 11+
-Xms31g
-Xmx31g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=35
-XX:G1HeapRegionSize=16m
-XX:G1NewSizePercent=15
-XX:G1MaxNewSizePercent=30
-XX:+ParallelRefProcEnabled
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

5.4 JVM健康检查

bash
# 验证压缩指针是否生效
jcmd <pid> VM.info | grep -i "compressed oops"
# 输出:using compressed ordinary object pointers

# 查看GC情况
jstat -gcutil <pid> 1000 10

# 查看堆内存详情
jmap -heap <pid>

6. 索引模板与生命周期管理

6.1 索引模板最佳实践

json
PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "priority": 100,
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "30s",
      "index.codec": "best_compression",
      "index.translog.durability": "async",
      "index.translog.sync_interval": "5s",
      "index.translog.flush_threshold_size": "512mb",
      "index.query.default_field": ["message"],
      "analysis": {
        "analyzer": {
          "ik_smart": {
            "type": "ik_smart"
          },
          "ik_max_word": {
            "type": "ik_max_word"
          }
        }
      }
    },
    "mappings": {
      "dynamic_templates": [
        {
          "strings_as_keyword": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        }
      ],
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "message": {
          "type": "text",
          "analyzer": "ik_max_word",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 2048
            }
          }
        },
        "level": {
          "type": "keyword"
        },
        "service.name": {
          "type": "keyword"
        },
        "host.ip": {
          "type": "ip"
        }
      }
    }
  },
  "composed_of": ["common-settings"],
  "version": 1
}

6.2 ILM(索引生命周期管理)策略

json
PUT _ilm/policy/logs-lifecycle-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "set_priority": {
            "priority": 100
          },
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d",
            "max_docs": 100000000
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "allocate": {
            "require": {
              "temp": "warm"
            }
          },
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "readonly": {}
        }
      },
      "cold": {
        "min_age": "7d",
        "actions": {
          "set_priority": {
            "priority": 0
          },
          "allocate": {
            "require": {
              "temp": "cold"
            }
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

6.3 应用ILM策略到模板

json
PUT _index_template/logs-template
{
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-lifecycle-policy",
      "index.lifecycle.rollover_alias": "logs-alias"
    }
  }
}

第三部分:性能调优篇


7. 写入性能调优

7.1 写入优化核心参数

yaml
# ================ 索引级别 ================
# 刷新间隔(从默认1s调大)
"refresh_interval": "30s"  # 日志场景可设为60s,甚至-1禁用

# Translog配置(异步刷盘,牺牲小概率数据安全换性能)
"index.translog.durability": "async"
"index.translog.sync_interval": "5s"
"index.translog.flush_threshold_size": "512mb"

# 段合并
"index.merge.scheduler.max_thread_count": 4  # SSD=4, HDD=1
"index.merge.policy.max_merged_segment": "5gb"

# ================ 集群级别 ================
# 写入线程池
thread_pool.write.queue_size: 1000
thread_pool.bulk.queue_size: 500

# 限流保护
indices.memory.index_buffer_size: 30%

7.2 Bulk API最佳实践

优化项推荐值说明
Bulk大小5-15MB不要按条数,按字节数
并发数CPU核数 × 1.5不要超过CPU × 2
超时时间60s给足合并时间
重试机制指数退避处理EsRejectedExecutionException

Python示例

python
from elasticsearch import helpers
import time

def bulk_with_backoff(es, actions, max_retries=5):
    for attempt in range(max_retries):
        try:
            success, failed = helpers.bulk(
                es, actions,
                chunk_size=1000,  # 每批条数
                request_timeout=60
            )
            return success, failed
        except Exception as e:
            wait_time = 2 ** attempt  # 指数退避: 1,2,4,8...
            time.sleep(wait_time)
    raise Exception("Max retries exceeded")

7.3 高写入场景极限优化

bash
# 1. 临时关闭副本(批量导入时)
PUT my-index/_settings
{ "number_of_replicas": 0 }

# 2. 禁用刷新
PUT my-index/_settings
{ "refresh_interval": -1 }

# 3. 导入完成后恢复
PUT my-index/_settings
{
  "number_of_replicas": 1,
  "refresh_interval": "30s"
}

# 4. 强制合并段(可选)
POST my-index/_forcemerge?max_num_segments=1

7.4 写入性能基准参考

硬件配置单节点写入TPS说明
16C/64GB/NVMe20,000 - 40,000文档大小≈1KB
8C/32GB/SSD10,000 - 20,000文档大小≈1KB
8C/32GB/HDD2,000 - 5,000不推荐生产

8. 查询性能调优

8.1 慢查询日志配置

yaml
# elasticsearch.yml 配置慢查询日志
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms

8.2 Profile API查询分析

json
# 使用Profile分析查询瓶颈
GET my-index/_search
{
  "profile": true,
  "query": {
    "match": {
      "message": "error"
    }
  }
}

8.3 查询优化技巧

优化项方法效果
避免深度分页使用search_after/scroll性能提升10-100倍
使用Keyword类型精确匹配不要用text节省内存,加速查询
开启路由?routing=user_id查询范围缩小N倍
Filter上下文用filter代替must利用缓存,不计算评分
禁用_source不需要完整文档时禁用减少IO和反序列化
时间范围前置bool查询中先过滤时间快速缩小扫描范围

8.4 高效查询示例

json
# ❌ 低效写法
GET logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "message": "error" }},
        { "range": { "@timestamp": { "gte": "now-7d" }}}
      ]
    }
  },
  "size": 10000
}

# ✅ 高效写法
GET logs-*/_search
{
  "query": {
    "bool": {
      "filter": [  # filter不计算评分,可缓存
        { "range": { "@timestamp": { "gte": "now-7d" }}},  # 先过滤时间
        { "term": { "level": "ERROR" }}  # 精确匹配用term
      ],
      "must": [
        { "match": { "message": "error" }}
      ]
    }
  },
  "size": 100,  # 不要拿太多
  "_source": ["@timestamp", "level", "service.name"]  # 只取需要的字段
}

9. 分片与副本策略

9.1 分片数计算黄金公式

分片数 = MAX(
  向上取整(预计总数据量 ÷ 30GB),  # 按数据量
  数据节点数 × 1.5                 # 按节点数
)

示例1:
总数据量:100GB,数据节点:3
分片数 = MAX(100÷30≈4, 3×1.5≈5) = 5

示例2(日志场景):
每日数据:50GB,保留7天,副本1
每日索引分片数 = MAX(50÷30≈2, 3×1.5≈5) = 3

9.2 分片大小最佳实践

分片大小评价说明
< 10GB偏小分片过多导致集群压力
10-50GB✅ 最佳推荐范围
50-70GB⚠️ 可接受查询密集型上限
> 100GB❌ 过大迁移、恢复、查询都慢

9.3 副本数配置策略

场景副本数说明
开发测试0节省资源
一般业务1平衡性能与可靠性
核心业务2高可用优先
日志/监控1可接受少量数据丢失

9.4 分片分配感知

json
# 机架感知(避免单机架故障)
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "rack",
    "cluster.routing.allocation.awareness.force.rack.values": ["rack1", "rack2", "rack3"]
  }
}

# 节点配置 elasticsearch.yml:
# node.attr.rack: rack1

10. 缓存配置优化

10.1 ES三类缓存对比

缓存类型用途大小生命周期
Page Cache操作系统文件缓存物理内存 - JVM堆系统管理
Query CacheFilter查询结果缓存10%堆内存LRU,查询级
Field Data Cache聚合/排序使用的字段数据未限制(断路器保护)索引加载
Request Cache查询请求结果缓存1%堆内存LRU,分片级

10.2 缓存配置优化

yaml
# 集群设置
PUT _cluster/settings
{
  "persistent": {
    "indices.queries.cache.size": "15%",  # Query Cache
    "indices.requests.cache.size": "2%",   # Request Cache
    "indices.breaker.fielddata.limit": "40%",
    "indices.breaker.total.limit": "95%"
  }
}

# 索引设置(禁用不需要的缓存)
PUT my-index/_settings
{
  "index.queries.cache.enabled": false,  # 时序数据不需要
  "index.requests.cache.enable": false
}

10.3 缓存命中率监控

json
# 查看缓存使用情况
GET _nodes/stats/indices?filter_path=**.query_cache,**.fielddata,**.request_cache

# 关注指标
# query_cache.hit_count / miss_count → 计算命中率
# fielddata.memory_size_in_bytes → 监控是否超标

第四部分:运维管理篇


11. 集群监控与告警

11.1 核心监控指标

监控类别关键指标告警阈值
集群状态cluster.statusred立即,yellow 5分钟
pending_tasks>100
unassigned_shards>0持续5分钟
节点健康jvm.mem.heap_used_percent>85%
jvm.gc.old.collection_time1分钟>5秒
process.cpu.percent>80%持续5分钟
磁盘fs.total.available_in_bytes<10%或<50GB
disk.io.util>90%持续10分钟
性能indexing_rate突增/突降50%
search_rate同上
search_query_time>500ms

11.2 Prometheus监控配置

yaml
# elasticsearch-exporter 指标抓取配置
scrape_configs:
  - job_name: 'elasticsearch'
    scrape_interval: 30s
    static_configs:
      - targets: ['es-node-01:9200', 'es-node-02:9200', 'es-node-03:9200']
    metrics_path: '/_prometheus/metrics'
    basic_auth:
      username: 'monitor'
      password: 'your-password'

11.3 关键告警规则(Prometheus)

yaml
groups:
  - name: elasticsearch
    rules:
      # 集群Red告警
      - alert: ElasticsearchClusterRed
        expr: elasticsearch_cluster_health_status{color="red"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Elasticsearch cluster is RED"

      # JVM堆内存过高
      - alert: ElasticsearchHeapTooHigh
        expr: elasticsearch_jvm_memory_heap_used_percent > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ES node {{ $labels.name }} heap usage is high: {{ $value }}%"

      # 旧GC时间过长
      - alert: ElasticsearchOldGCTimeHigh
        expr: rate(elasticsearch_jvm_gc_old_collection_time_seconds_sum[5m]) > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "ES node {{ $labels.name }} Old GC time > 5s/s"

12. 备份与恢复策略

12.1 快照仓库配置

json
# 1. 文件系统仓库
PUT _snapshot/my-fs-repo
{
  "type": "fs",
  "settings": {
    "location": "/backup/elasticsearch",
    "compress": true,
    "max_snapshot_bytes_per_sec": "50mb",
    "max_restore_bytes_per_sec": "50mb"
  }
}

# 2. S3兼容仓库(推荐生产)
PUT _snapshot/my-s3-repo
{
  "type": "s3",
  "settings": {
    "bucket": "es-backup",
    "region": "cn-north-1",
    "access_key": "xxx",
    "secret_key": "xxx",
    "compress": true
  }
}

12.2 自动快照策略

json
# 创建SLM策略(Snapshot Lifecycle Management)
PUT _slm/policy/daily-snapshot
{
  "schedule": "0 30 1 * * ?",  # 每天凌晨1:30
  "name": "<daily-snap-{now/d}>",
  "repository": "my-s3-repo",
  "config": {
    "indices": ["*"],
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

# 立即执行快照
POST _slm/policy/daily-snapshot/_execute

12.3 数据恢复流程

json
# 1. 查看可用快照
GET _snapshot/my-s3-repo/_all

# 2. 恢复索引
POST _snapshot/my-s3-repo/daily-snap-2026.05.22/_restore
{
  "indices": "logs-2026.05.22",
  "ignore_unavailable": true,
  "include_global_state": false,
  "index_settings": {
    "index.number_of_replicas": 0  # 恢复时先关副本
  }
}

# 3. 查看恢复进度
GET _recovery

# 4. 恢复后打开副本
PUT logs-2026.05.22/_settings
{ "number_of_replicas": 1 }

13. 滚动升级与平滑扩容

13.1 滚动升级步骤(7.x → 7.x)

bash
# ================ 升级前准备 ================
# 1. 禁用分片分配
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

# 2. 停止不必要的索引
POST _flush/synced

# 3. 关闭ML(如果有)
POST _ml/set_upgrade_mode?enabled=true

# ================ 逐个节点升级 ================
# 4. 停止Elasticsearch
systemctl stop elasticsearch

# 5. 升级版本(保留配置)
rpm -Uvh elasticsearch-7.17.10-x86_64.rpm
# 或解压新包,复制config/和data/

# 6. 启动节点
systemctl start elasticsearch

# 7. 等待节点加入集群,黄/绿
# 8. 恢复分片分配
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

# 9. 等待集群变绿再升级下一个节点

# ================ 全部升级后 ================
# 10. 关闭升级模式
POST _ml/set_upgrade_mode?enabled=false

13.2 平滑扩容数据节点

bash
# 1. 准备新节点(OS配置、ES安装、配置相同)

# 2. 启动新节点(自动加入集群,发现列表已包含)

# 3. 监控分片重平衡
GET _cat/allocation?v&s=node

# 4. (可选)手动迁移热点分片
POST _cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "hot-index",
        "shard": 0,
        "from_node": "old-node",
        "to_node": "new-node"
      }
    }
  ]
}

# 5. 调整分片分配速度(可选,加速)
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.cluster_concurrent_rebalance": 50,
    "indices.recovery.max_bytes_per_sec": "200mb"
  }
}

14. 常见故障排查手册

14.1 集群Red状态排查

bash
# 步骤1:定位问题分片
GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state

# 步骤2:查看未分配原因
GET _cluster/allocation/explain

# 步骤3:常见原因与处理
# ├─ NODE_LEFT → 等待节点恢复或手动分配
# ├─ ALLOCATION_FAILED → 查看节点日志
# └─ DECIDERS_NO → 磁盘空间不足或分片分配策略

# 步骤4:强制重新分配(最后手段)
POST _cluster/reroute?retry_failed=true

14.2 OOM与内存问题

bash
# 症状:节点退出,日志有OutOfMemoryError

# 排查步骤:
# 1. 查看堆转储文件(配置了HeapDumpOnOutOfMemoryError)
ls -lh /var/lib/elasticsearch/java_pid*.hprof

# 2. 查看断路器触发情况
GET _nodes/stats/breaker

# 3. 紧急处理:
# - 扩容节点
# - 调大断路器阈值(临时)
# - 关闭占用内存的索引
# - 清理fielddata
POST my-index/_cache/clear?fielddata=true

14.3 磁盘水位线问题

bash
# 默认水位线(低/高/洪水):85%/90%/95%
GET _cluster/settings?include_defaults=true&filter_path=*.disk

# 临时调整水位线(紧急处理)
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
  }
}

# 根本解决:
# 1. 扩容磁盘
# 2. 删除旧数据
# 3. 增加数据节点

14.4 慢查询排查流程

bash
# 步骤1:查看慢查询日志
tail -f /var/log/elasticsearch/my-cluster_index_search_slowlog.log

# 步骤2:使用Profile API分析
GET my-index/_search
{ "profile": true, "query": { ... } }

# 步骤3:查看热点线程
GET _nodes/hot_threads

# 步骤4:优化常见问题
# - 大结果集分页 → search_after
# - 通配符前缀查询 → keyword+wildcard或n-gram
# - 深度聚合 → 预聚合、rollup

第五部分:安全与合规


15. 安全配置与访问控制

15.1 启用X-Pack安全

bash
# 1. 生成证书
bin/elasticsearch-certutil ca
bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12

# 2. 将证书放到所有节点的 config/certs/ 目录

# 3. 配置 elasticsearch.yml
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12

# 4. 初始化内置用户密码
bin/elasticsearch-setup-passwords auto
# 或 interactive手动设置

15.2 RBAC角色权限管理

json
# 创建只读角色
PUT _security/role/read_only
{
  "cluster": ["monitor"],
  "indices": [
    {
      "names": ["logs-*", "metrics-*"],
      "privileges": ["read", "view_index_metadata"]
    }
  ]
}

# 创建用户
PUT _security/user/analyst
{
  "password": "StrongPassword123!",
  "roles": ["read_only", "kibana_user"],
  "full_name": "Data Analyst",
  "email": "analyst@company.com"
}

# 基于字段级安全(FLS)
PUT _security/role/pii_masked
{
  "indices": [
    {
      "names": ["users-*"],
      "privileges": ["read"],
      "field_security": {
        "grant": ["*"],
        "except": ["phone", "id_card", "email"]
      }
    }
  ]
}

15.3 API访问控制最佳实践

✅ 不要用elastic超级用户做业务访问
✅ 为每个应用创建独立用户
✅ 遵循最小权限原则
✅ 开启审计日志(记录所有访问)
✅ 内网部署,公网必须HTTPS
✅ 定期轮换密码

16. 数据加密与合规

16.1 传输加密(TLS)

yaml
# HTTPS加密(公网必须)
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/http.p12
xpack.security.http.ssl.client_authentication: optional

# TLS版本控制(仅启用安全版本)
xpack.security.http.ssl.supported_protocols: ["TLSv1.3", "TLSv1.2"]
xpack.security.http.ssl.cipher_suites: [
  "TLS_AES_256_GCM_SHA384",
  "TLS_AES_128_GCM_SHA256"
]

16.2 静态加密(磁盘级)

yaml
# 1. 密钥库配置
xpack.security.encryption_key: "your-32-byte-encryption-key-here"

# 2. 索引级加密(需要Enterprise License)
PUT secure-index
{
  "settings": {
    "index.encrypted": true
  }
}

# 3. 或使用操作系统层面的磁盘加密
# LUKS / AWS EBS加密 / Azure Disk加密

16.3 GDPR/等保合规要点

合规要求ES实现方式
数据加密TLS传输加密 + 磁盘静态加密
访问审计xpack.security.audit.enabled
数据留存ILM生命周期自动删除
权限隔离RBAC角色权限 + 字段级安全
数据可擦除DELETE API + 段合并
操作留痕审计日志 + Kibana操作日志

附录:快速检查清单

生产部署前Checklist

✅ 硬件
  [ ] SSD/NVMe磁盘,拒绝HDD
  [ ] 物理内存 ≥ 32GB,JVM堆 = 31GB
  [ ] 万兆网络
  [ ] 时钟同步(NTP)

✅ 操作系统
  [ ] vm.swappiness = 1
  [ ] vm.max_map_count = 262144
  [ ] nofile = 655360
  [ ] 禁用swap
  [ ] 禁用透明大页(THP)
  [ ] 磁盘挂载 noatime

✅ Elasticsearch配置
  [ ] bootstrap.memory_lock = true
  [ ] 3个专用Master节点
  [ ] JVM Xms = Xmx = 31GB
  [ ] 启用X-Pack安全
  [ ] ILM策略配置
  [ ] 慢查询日志开启

✅ 运维保障
  [ ] 监控与告警配置完成
  [ ] 快照备份策略配置
  [ ] 容灾方案验证
  [ ] 运维人员培训完成

文档结束

💡 持续更新:本方案基于Elasticsearch 7.17 LTS编写,8.x用户可参考官方迁移指南。生产环境部署前建议进行充分的性能测试和容灾演练。

褚成志 · 简历中心