在Debian上配置HBase监控告警
一 监控体系与端口
二 方案一 Prometheus JMX Exporter 告警
rules:
- pattern: "Hadoop:service=HBase,name=Master,sub=Server"
- pattern: "Hadoop:service=HBase,name=Master,sub=IPC"
- pattern: "Hadoop:service=HBase,name=Master,sub=Balancer"
- pattern: "Hadoop:service=HBase,name=RegionServer,sub=Server"
- pattern: "Hadoop:service=HBase,name=RegionServer,sub=IPC"
- pattern: "Hadoop:service=HBase,name=RegionServer,sub=WAL"
- pattern: "java.lang:type=Memory"
- pattern: "java.lang:type=GarbageCollector,name=.*"
java -javaagent:/opt/prometheus/jmx_prometheus_javaagent-<ver>.jar=9404:/opt/prometheus/hbase_jmx_config.yml \
-cp $HBASE_HOME/lib/*:$HBASE_HOME/conf \
org.apache.hadoop.hbase.master.HMaster start
说明:将 JMX Exporter 端口(示例 9404)与 HBase 进程的 JMX 端口(Master 16010、RegionServer 16020)区分开,Exporter 通过 JMX 远程连接 HBase。scrape_configs:
- job_name: 'hbase-master'
static_configs:
- targets: ['master.example.com:9404']
- job_name: 'hbase-regionserver'
static_configs:
- targets: ['rs1.example.com:9404','rs2.example.com:9404']
groups:
- name: hbase
rules:
- alert: HBaseMasterDown
expr: up{job="hbase-master"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "HBase Master 不可用"
description: "HBase Master {{ $labels.instance }} 已宕机超过 1 分钟。"
- alert: HBaseRegionServerDown
expr: up{job="hbase-regionserver"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "HBase RegionServer 不可用"
description: "RegionServer {{ $labels.instance }} 已宕机超过 1 分钟。"
- alert: HBaseMemstoreUsageHigh
expr: (1 - (hadoop_hbase_RegionServer_MemStoreMemoryUsage{}/hadoop_hbase_RegionServer_MemStoreLimit)) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "MemStore 使用率过高"
description: "RegionServer {{ $labels.instance }} MemStore 剩余不足 20%."
- alert: HBaseRPCLatencyHigh
expr: rate(hadoop_hbase_RegionServer_RPCQueueTime_num_ops[5m]) / rate(hadoop_hbase_RegionServer_RPCQueueTime_sum[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "RPC 队列时间过长"
description: "RegionServer {{ $labels.instance }} 5 分钟平均 RPC 队列时间 > 1s."
三 方案二 Zabbix 告警
wget https://repo.zabbix.com/zabbix/6.0/debian/pool/main/z/zabbix-release/zabbix-release_6.0-4+debian11_all.deb
sudo dpkg -i zabbix-release_6.0-4+debian11_all.deb
sudo apt update && sudo apt install -y zabbix-agent
sudo systemctl enable --now zabbix-agent
#!/usr/bin/env bash
echo "simple" | hbase shell -n 2>/dev/null | grep -E '^\d+ live servers' | awk '{print $1}'
在 /etc/zabbix/zabbix_agentd.conf 增加:UserParameter=hbase.live_servers,/usr/local/bin/hbase_status.sh
重启 Zabbix Agent:sudo systemctl restart zabbix-agent四 关键告警规则与阈值建议
五 快速验证与排错