共计 6535 个字符,预计需要花费 17 分钟才能阅读完成。
1 目标
目标:可视化监控服务器资源包括CPU、内存、磁盘、GPU的使用,以及监控docker容器的使用情况。
有一台监控节点和三台被监控节点GPU服务器,被监控节点操作都是一样的。
2 使用Grafana+Promethus的方式采集数据
采用Grafana做可视化面板,Promethus做数据采集和收集存储。
- CPU、内存、磁盘等:node-exporter
- GPU:nvidia_gpu_exporter、dcgm-exporter
- docker容器:cadvisor
2.1 监控节点
安装Grafana
cd /opt
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-9.5.1-1.x86_64.rpm
ls
yum install -y grafana-enterprise-9.5.1-1.x86_64.rpm
#设置systemctl
systemctl enable grafana-server.service
systemctl start grafana-server.service
systemctl status grafana-server.service
访问:http://172.xx.xx.xx:3000/ 查看是否安装成功:初始密码admin,admin可以修改密码。
设置中文:修改配置文件
vim /etc/grafana/grafana.ini
#修改如下语句
default_language = zh-Hans
该处修改语言,可能不需要修改配置文件,web界面进入后,点击管理->默认首选项就可以设置语言,保存即可,下图是已经设置好的结果
安装promethus
监控节点上的promethus负责收集各个主机上的数据
cd /opt
wget https://mirrors.tuna.tsinghua.edu.cn/github-release/prometheus/prometheus/LatestRelease/prometheus-2.37.8.linux-amd64.tar.gz
tar -zxvf prometheus-2.37.8.linux-amd64.tar.gz
mv prometheus-2.37.8.linux-amd64/ prometheus-linux-amd64/
cd prometheus-linux-amd64/
nohup ./prometheus &
验证一下:http://172.XX.xx.xx:9090/ 默认端口9090,如果可以访问就说明安装成功,但是这样的启动方式不太好,我们使用systemd管理
设置systemctl
vim /lib/systemd/system/prometheus.service
文件内容包含:
[Unit]
Description=Prometheus Server
After=network.target
[Service]
User=root
Group=root
Type=simple
ExecStart=/opt/prometheus-linux-amd64/prometheus --config.file=/opt/prometheus-linux-amd64/prometheus.yml
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus
systemctl status prometheus
监控节点安装node-exporter
node-exporter用来采集主机相关的数据的。我们先安装这个测试一下是否可以正常拿到数据。
# 查看以下是否可以直接通过yum源安装
yum search node-export
yum install golang-github-prometheus-node-exporter
systemctl enable node_exporter.service
systemctl start node_exporter.service
systemctl status node_exporter.service
验证一下:curl http://172.xx.xx.xx:9400/metrics,如果有数据或者没有数据但没有报错,说明安装成功了。
同样的方式,再各个节点安装node-exporter。
修改promethus.yml配置文件
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
#添加如下内容,targets的ip,端口根据你要监控的节点不同二设置,instance可以设置名称
- job_name: linux
static_configs:
- targets: ['10.0.xx.xx:9100']
labels:
instance: 监控节点
- targets: ['10.0.xx.xx:9100']
labels:
instance: 被监控GPU节点1
- targets: ['10.0.xx.xx:9100']
labels:
instance: 被监控GPU节点2
- targets: ['10.0.xx.xx:9100']
labels:
instance: 被监控GPU节点3
systemctl restart promethus
访问promethus的web:
查看对应的node-exporter是否正常:
2.2 被监控节点
以下操作每个被监控节点上按照需求安装。
安装node-exporter
与监控节点一致,不再多言。
yum install golang-github-prometheus-node-exporter
systemctl status node_exporter.service
systemctl enable node_exporter.service
systemctl start node_exporter.service
安装GPU采集器-nvidia_gpu_exporter
cd /opt
wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v1.2.0/nvidia_gpu_exporter_1.2.0_linux_x86_64.tar.gz
tar -xvzf nvidia_gpu_exporter_1.2.0_linux_x86_64.tar.gz
mv nvidia_gpu_exporter /usr/bin
设置systemctl:vim /lib/systemd/system/nvidia_gpu_exporter.service,内容如下:
[Unit]
Description=Nvidia GPU Exporter
After=network-online.target
[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/bin/nvidia_gpu_exporter
SyslogIdentifier=nvidia_gpu_exporter
Restart=always
RestartSec=1
NoNewPrivileges=yes
ProtectHome=yes
ProtectSystem=strict
ProtectControlGroups=true
ProtectKernelModules=true
ProtectKernelTunables=yes
ProtectHostname=yes
ProtectKernelLogs=yes
ProtectProc=yes
[Install]
WantedBy=multi-user.target
设置开机启动
systemctl daemon-reload
systemctl start nvidia_gpu_exporter.service
systemctl status nvidia_gpu_exporter.service
systemctl enable nvidia_gpu_exporter.service
测试:curl http://localhost:9835/metrics
安装另一种GPU采集器dcgm-exporter
这里我们使用docker安装了,没有装docker的需要装一下docker
docker pull nvidia/dcgm-exporter
docker run -d --gpus all --rm -p 9400:9400 nvidia/dcgm-exporter:latest
curl http://localhost:9400/metrics
安装容器监控cadvisor
这里也是使用容器监控,由于cadvisor默认的是8080端口,我们替换成9080,防止冲突。
docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --volume=/dev/disk/:/dev/disk:ro --publish=9080:8080 --detach=true --name=cadvisor google/cadvisor:latest
docker ps -a
curl http://localhost:9080/metrics
设置promethus.yml文件
现在我们已经再各个监控节点上安装好了采集器,并且可以通过/metrics获取到度量数据,现在我们修改promethus.yml文件,添加对应的job节点
完整的配置文件如下:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: linux
static_configs:
- targets: ['10.0.xx.xx:9100']
labels:
instance: 监控节点
- targets: ['10.0.xx.xx:9100']
labels:
instance: 被监控节点GPU节点1
- targets: ['10.0.xx.xx:9100']
labels:
instance: 被监控节点GPU节点2
- targets: ['10.0.xx.xx:9100']
labels:
instance: 被监控节点GPU节点3
- job_name: gpu # 任务名称,会在prometheus targets页面显示名称
static_configs:
- targets: ['10.0.xx.xx:9835']
labels:
instance: 被监控节点GPU节点1
- targets: ['10.0.xx.xx:9835']
labels:
instance: 被监控节点GPU节点2
- targets: ['10.0.xx.xx:9835']
labels:
instance: 被监控节点GPU节点3
- job_name: gpu_dcgm
static_configs:
- targets: ['10.0.xx.xx:9400']
labels:
instance: 被监控节点GPU节点1
- targets: ['10.0.xx.xx:9400']
labels:
instance: 被监控节点GPU节点2
- targets: ['10.0.xx.xx:9400']
labels:
instance: 被监控节点GPU节点3
- job_name: docker
static_configs:
- targets: ['10.0.xx.xx:9080']
labels:
instance: "被监控节点GPU节点1:9080"
- targets: ['10.0.xx.xx:9080']
labels:
instance: "被监控节点GPU节点2:9080"
- targets: ['10.0.xx.xx:9080']
labels:
instance: "被监控节点GPU节点3:9080"
配置grafana面板
导入dashboard,分别导入:13631、12239、11074、14574,四个面板,可以分别进行自定义相应的面板,也可以去grafana官网搜索更多有关的仪表盘。
展示一下:14574面板:如果面板不符合自己的要求,可以通过设置调整成自己想要的界面。
14574界面:展示的只是GPU的UUID但是,我给他添加了几个变量,分别是instance、device,方便我们使用,这样就可以方便的看到没一台物理服务器上的GPU显卡设备:。