Prometheus监控之Pushgateway

Pushgateway 简介

Pushgateway 是 Prometheus 监控生态中的重要中间组件,主要用于以下场景:

使用场景

  • 网络隔离场景:当监控目标与 Prometheus 服务器处于不同子网或存在防火墙限制时,通过 Push 模式解决无法直接 Pull 的问题
  • 数据聚合需求:需要将多个分散节点的监控数据汇总后统一采集
  • 自定义监控:当标准 Exporter 无法满足需求时,通过脚本/程序自定义指标采集

注意事项

  • 单点故障风险:聚合多个节点数据后,Pushgateway 故障将影响所有关联监控
  • 状态监控局限:Prometheus 只能监控 Pushgateway 本身状态,无法直接感知原始节点的存活状态
  • 数据持久化:监控数据会持久存储在 Pushgateway 中,需定期清理过期数据
  • 使用原则:建议仅用于短期任务或批处理作业监控,长期监控仍推荐使用 Exporter + Pull 模式

安装部署

推荐使用 Docker 快速部署:

bash
1
2
3
4
5
6
7
8
9
10
# 部署在 192.168.56.102 pushgateway
# 基础运行(数据易失)
docker run -d -p 9091:9091 --name pushgateway prom/pushgateway

# 持久化存储(推荐生产环境)
docker run -d -p 9091:9091 \
-v /path/to/persistence:/persistence \
-e PUSHGATEWAY_PERSISTENCE_FILE=/persistence/data.store \
--name pushgateway \
prom/pushgateway

Prometheus 配置

prometheus.yml 添加以下抓取配置:

yaml
1
2
3
4
5
6
7
- job_name: "pushgateway"
honor_labels: true #加上此配置exporter节点上传数据中的一些标签将不会被pushgateway节点的相同标签覆盖
static_configs:
- targets: ['192.168.56.102:9091']
labels:
instance: pushgateway
env: production # 添加环境标签

配置生效方式(确保启用管理接口):

bash
1
2
3
4
5
# 发送重新加载信号
curl -X POST http://prometheus-host:9090/-/reload

# 验证配置
curl http://prometheus-host:9090/api/v1/status/config | jq .

数据推送指南

通过 CURL 推送

基础推送(自动生成类型)

bash
1
2
echo "data_file_num 158" | curl --data-binary @- \
http://pushgateway-host:9091/metrics/job/data_monitor/instance/server01

完整格式推送

bash
1
2
3
4
5
6
cat <<EOF | curl --data-binary @- http://pushgateway-host:9091/metrics/job/data_monitor
# HELP data_file_num Total files in data directory
# TYPE data_file_num gauge
data_file_num{instance="server01",path="/data"} 158
data_file_num{instance="server02",path="/backup"} 72
EOF

数据清理

bash
1
2
3
4
5
# 删除特定实例
curl -X DELETE http://pushgateway-host:9091/metrics/job/data_monitor/instance/server01

# 删除整个任务组
curl -X DELETE http://pushgateway-host:9091/metrics/job/data_monitor

通过 Python SDK 推送

安装客户端库:

bash
1
pip install prometheus-client

python代码

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
import os

def monitor_directory(path):
registry = CollectorRegistry()

# 创建带标签的指标
file_gauge = Gauge(
'directory_files_total',
'Total files in directory',
['path', 'instance'],
registry=registry
)

files_count = len(os.listdir(path))
file_gauge.labels(path=path, instance='server01').set(files_count)

# 推送至网关
push_to_gateway(
'pushgateway-host:9091',
job='directory_monitor',
registry=registry
)

if __name__ == "__main__":
monitor_directory('/data')

实战案例:监控目录文件数量

Shell 实现

/opt/file_monitor.sh:

bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash

DATA_DIR="/data"
PUSHGATEWAY_URL="http://pushgateway-host:9091"

# 获取文件数量(排除隐藏文件)
FILE_NUM=$(find ${DATA_DIR} -maxdepth 1 -type f ! -name ".*" | wc -l)

# 推送指标
cat <<EOF | curl --data-binary @- ${PUSHGATEWAY_URL}/metrics/job/directory_monitor/instance/$(hostname)
# TYPE directory_files_total gauge
# HELP directory_files_total Total non-hidden files in directory
directory_files_total{path="${DATA_DIR}"} ${FILE_NUM}
EOF

设置定时任务:

bash
1
2
3
# 每5分钟执行一次
crontab -e
*/5 * * * * /opt/file_monitor.sh

Python 实现

/opt/file_monitor.py:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
import os
from datetime import datetime

def main():
registry = CollectorRegistry()
data_dir = "/data"

# 指标定义
file_gauge = Gauge(
'directory_files_total',
'Total files in monitored directory',
['directory', 'instance'],
registry=registry
)

try:
file_count = len([
f for f in os.listdir(data_dir)
if os.path.isfile(os.path.join(data_dir, f)) and not f.startswith('.')
])

file_gauge.labels(
directory=data_dir,
instance=os.uname().nodename
).set(file_count)

# 添加最后成功时间戳
success_timestamp = Gauge(
'last_success_timestamp',
'Last successful collection time',
registry=registry
)
success_timestamp.set_to_current_time()

except Exception as e:
print(f"Monitoring failed: {str(e)}")
exit(1)

push_to_gateway(
'pushgateway-host:9091',
job='directory_monitor',
registry=registry
)

if __name__ == "__main__":
main()

实战案例:监控游戏进程

bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#!/bin/bash

# 配置
PUSHGATEWAY_URL="http://101.133.144.129:9091"
JOB_NAME="cssx_process_monitor"

# 获取匹配的进程目录列表
get_process_dirs() {
ls -d /home/wangxian/cssx_*_s* | awk -F '/' '{print $NF}'
}

# 检查进程数量
check_process_count() {
local process_name="$1"
ps aux | grep -E "[c]ssx_.*${process_name}" | wc -l
}

# 获取系统指标
get_system_metrics() {
# 磁盘使用率(/data分区)
cssx_disk_usage=$(df /data | awk 'NR==2 {print $5}' | sed 's/%//')

# 内存使用率
mem_usage=$(free | grep Mem | awk '{printf("%.0f", $3/$2 * 100)}')

# 获取CPU使用率(使用vmstat取5秒内的平均使用率)
cpu_idle=$(vmstat 1 2 | tail -1 | awk '{print $15}')
cssx_cpu_usage=$(( 100 - cpu_idle ))

# 总内存大小(单位:MB)
cssx_total_memory=$(free -m | grep Mem | awk '{print $2}')

echo "$cssx_disk_usage $mem_usage $cssx_cpu_usage $cssx_total_memory"
}

# 推送指标到Pushgateway
push_metrics() {
# 创建临时文件存储指标
metrics_file=$(mktemp)

# 获取系统指标
read disk mem cpu total <<< $(get_system_metrics)

# 添加系统指标到文件
echo "# HELP cssx_disk_usage Data disk usage" >> "$metrics_file"
echo "# TYPE cssx_disk_usage gauge" >> "$metrics_file"
echo "cssx_disk_usage $disk" >> "$metrics_file"

echo "# HELP cssx_memory_usage Memory usage" >> "$metrics_file"
echo "# TYPE cssx_memory_usage gauge" >> "$metrics_file"
echo "cssx_memory_usage $mem" >> "$metrics_file"

echo "# HELP cssx_cpu_usage CPU usage" >> "$metrics_file"
echo "# TYPE cssx_cpu_usage gauge" >> "$metrics_file"
echo "cssx_cpu_usage $cpu" >> "$metrics_file"

echo "# HELP cssx_total_memory Total system memory in MB" >> "$metrics_file"
echo "# TYPE cssx_total_memory gauge" >> "$metrics_file"
echo "cssx_total_memory $total" >> "$metrics_file"

# 添加进程状态到文件
echo "# HELP cssx_process_status Process running status" >> "$metrics_file"
echo "# TYPE cssx_process_status gauge" >> "$metrics_file"

# 获取内网IP
int_ip=$(ip -4 addr show scope global | grep -Po 'inet \K[\d.]+' | grep -v '^127\.' | head -n1)
if [[ -z "$int_ip" ]]; then
int_ip="unknown"
fi

# 获取外网IP
ext_ip=$(cat /data/ip 2>/dev/null)
if [[ -z "$ext_ip" ]]; then
ext_ip=$(curl -s https://api.ipify.org/ 2>/dev/null)
if [[ $? -ne 0 || -z "$ext_ip" ]]; then
ext_ip="unknown"
fi
fi

# 添加IP指标
echo "# HELP cssx_internal_ip Internal IP address" >> "$metrics_file"
echo "# TYPE cssx_internal_ip gauge" >> "$metrics_file"
echo "cssx_internal_ip{ip=\"$int_ip\"} 1" >> "$metrics_file"

echo "# HELP cssx_external_ip External IP address" >> "$metrics_file"
echo "# TYPE cssx_external_ip gauge" >> "$metrics_file"
echo "cssx_external_ip{ip=\"$ext_ip\"} 1" >> "$metrics_file"

# 处理进程监控
for dir_name in $(get_process_dirs); do
# 从目录名提取进程名
process_name=$(echo "$dir_name" | grep -oP 'cssx_\K.+?(?=_s\d+)')
count=$(check_process_count "$process_name")

# 设置状态:1=正常(绿色),0=异常(红色)
status=1
[ "$count" -lt 7 ] && status=0

# 添加进程状态到文件
echo "cssx_process_status{process_name=\"$dir_name\"} $status" >> "$metrics_file"
done

# 推送指标到Pushgateway
curl --data-binary "@$metrics_file" "$PUSHGATEWAY_URL/metrics/job/$JOB_NAME"
rm -f "$metrics_file"
}

# 主函数
push_metrics

验证

通过 Prometheus Pushgateway:http://PushgatewayIP:9091/

通过Prometheus:http://PrometheusIP:9090/graph

Grafana添加仪表

参考:grafana基础使用教程_grafana使用-CSDN博客

告警配置

在 Prometheus 告警规则文件(如 rules/file_alert.yml)中添加:

yaml
1
2
3
4
5
6
7
8
9
10
11
groups:
- name: directory_monitoring
rules:
- alert: ExcessiveFiles
expr: directory_files_total{job="directory_monitor"} > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Too many files in directory ({{ $value }})"
description: "Instance {{ $labels.instance }} directory {{ $labels.directory }} contains {{ $value }} files"

最佳实践

  1. 数据生命周期管理

设置定期清理脚本:

bash
1
2
# 保留最近2小时的数据
find /persistence -name "*.store" -mtime +0.08 -exec rm -f {} \;

配置 Pushgateway 启动参数:

yaml
1
2
# 自动删除5小时未更新的指标
-web.enable-admin-api -persistence.retention=5h
  1. 监控 Pushgateway 自身
yaml
1
2
3
4
5
# prometheus.yml
- job_name: pushgateway_monitor
metrics_path: /metrics
static_configs:
- targets: ['pushgateway-host:9091']
  1. 安全防护
  • 启用基本认证
  • 配置防火墙规则,限制可访问IP
  • 使用 HTTPS 传输
  1. 性能优化
  • 批量推送减少请求次数
  • 适当调整 grouping_interval
  • 监控 Pushgateway 内存使用