prometheus+alertmanager+grafana+springboot监控报警搭建

Prometheus:收集springboot和对应虚机各个维度数据
grafana:图形化界面,展示Prometheus收集到的数据
Alertmanager:报警

image

一、springboot配置

1、在项目pom中引入依赖

1
2
3
4
5
6
7
8
9
10
11
12
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-core</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

2、在application.properties添加配置

1
2
management.endpoints.web.exposure.include=prometheus
management.metrics.tags.application=projectname

配置完成后重动项目,浏览器访问

http://ip:port/projectname/actuator/prometheus

出现如下数据说明成功

prometheus也会从这个接口读取数据

image

二、prometheus配置

1、安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 下载
$ wget https://github.com/prometheus/prometheus/releases/download/v2.15.0/prometheus-2.15.0.darwin-amd64.tar.gz

# 解压
$ tar -zxvf prometheus-2.15.0.darwin-amd64.tar.gz
$ cd prometheus-2.15.0.darwin-amd64

# 查看目录
$ ls- ls
24 -rw-r--r--@ 1 yunai staff 11357 Dec 23 22:03 LICENSE
8 -rw-r--r--@ 1 yunai staff 3184 Dec 23 22:03 NOTICE
0 drwxr-xr-x@ 4 yunai staff 128 Dec 23 22:03 console_libraries
0 drwxr-xr-x@ 9 yunai staff 288 Dec 23 22:03 consoles
158776 -rwxr-xr-x@ 1 yunai staff 81289464 Dec 23 20:13 prometheus # Prometheus 执行程序
8 -rw-r--r--@ 1 yunai staff 926 Dec 23 22:03 prometheus.yml # 配置文件
92704 -rwxr-xr-x@ 1 yunai staff 47461216 Dec 23 20:15 promtool
26512 -rwxr-xr-x@ 1 yunai staff 13572848 Dec 23 20:16 tsdb

2、配置prometheus.yml监听目标项目

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
scrape_timeout: 10s
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scheme: http
timeout: 10s
api_version: v1

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- /data/alertmanager/alert-rules.yml

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'projectname'
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /projectname/actuator/prometheus
scheme: http
static_configs:
- targets:
- ip:port
- job_name: 'projectname2'
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /projectname2/actuator/prometheus
scheme: http
static_configs:
- targets:
- ip:port

job可以配置多个来监听多个项目的数据

alerting和rule_files是alertmanager报警关联配置

3、启动

1
prometheus --web.enable-lifecycle --config.file=/data/prometheus/prometheus.yml > /data/prometheus/logs 2>&1 &

prometheus 启动参数加上
–web.enable-lifecycle

这样修改完配置可以通过接口reload

http://ip:port/-/reload

IP:9090 可以查看prometheus后台(太难看所以接入Grafana图表)

image

三、Grafana配置

1、安装启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 安装 brew
$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

# 更新 brew 源
brew update

# 安装Grafana
brew install grafana

# 启动
# To start Grafana using homebrew services first make sure homebrew/services is installed.
$ brew tap homebrew/services

# Then start Grafana using:
$ brew services start grafana
==> Successfully started `grafana` (label: homebrew.mxcl.grafana)

默认配置下,Grafana 服务启动在 3000 端口,内置「admin/admin」账号

访问IP:3000进入后台

2、添加prometheus数据源

image

image

点击「Save & Test」绿色按钮,完成添加 Prometheus 数据源

3、制作Dashboard仪表盘

可以参照官方或者社区文档,如果配置仪表盘和布局,有一个简单的方式是直接copy json,这里给一个监控JVM和HTTP接口数据的详细json

JSONModel

manager界面打开新建Dashboard,修改Name,然后打开JSON Model

image

JSON Model中panels为仪表盘具体配置,只需要修改panels和templating等信息就可以,不能全部复制,因为gnetId等内容是唯一的

image

保存修改,界面如下

image

四、Alertmanager配置

1、下载安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 下载
$ wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.darwin-amd64.tar.gz

# 解压
$ tar -zxvf alertmanager-0.20.0.darwin-amd64.tar.gz
$ cd alertmanager-0.20.0.darwin-amd64

# 查看目录
$ ls- ls
24 -rw-r--r--@ 1 yunai staff 11357 Dec 11 22:51 LICENSE
8 -rw-r--r--@ 1 yunai staff 457 Dec 11 22:51 NOTICE
52096 -rwxr-xr-x@ 1 yunai staff 26671536 Dec 11 22:16 alertmanager # Alertmanager 执行程序
8 -rw-r--r--@ 1 yunai staff 380 Dec 11 22:51 alertmanager.yml # 配置文件
43680 -rwxr-xr-x@ 1 yunai staff 22360744 Dec 11 22:17 amtool

2、修改alertmanager.yml配置

这里可以配置webhook去调用单独的项目http接口,然后项目接口中自己选择报警方式(邮件,短信等)和内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
global:
resolve_timeout: 5m

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://ip:port/alert'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

3、配置alert-rules.yml报警规则

前面prometheus配置了alert-rules.yml文件的路径,保持一致

更多规则可以自己定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
groups:
- name: host_monitoring
rules:
- alert: 堆内存报警
expr: sum(jvm_memory_used_bytes{area="heap"})*100/sum(jvm_memory_max_bytes{area="heap"}) > 90
for: 5m
labels:
team: node
annotations:
alert_type: 堆内存报警
application: '{{$labels.application}}'
instance: '{{$labels.instance}}'
explain: "堆内存使用量超过90,目前使用量:{{ $value }}%"
- alert: 堆外内存报警
expr: sum(jvm_memory_used_bytes{area="nonheap"})*100/sum(jvm_memory_max_bytes{area="nonheap"}) > 90
for: 5m
labels:
team: node
annotations:
alert_type: 堆外内存报警
application: '{{$labels.application}}'
instance: '{{$labels.instance}}'
explain: "堆外内存使用量超过90,目前使用量:{{ $value }}%"
- alert: QPS报警
expr: sum(rate(http_server_requests_seconds_count[5m])) by (application, instance) > 1000
for: 5m
labels:
team: node
annotations:
alert_type: QPS报警
application: '{{$labels.application}}'
instance: '{{$labels.instance}}'
explain: "QPS超过1000,当前值:{{ $value }}%"
- alert: 5xx错误码报警
expr: (sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application, instance))*100/(sum(rate(http_server_requests_seconds_count[5m])) by (application, instance)) > 5
for: 5m
labels:
team: node
annotations:
alert_type: 5xx错误码报警
application: '{{$labels.application}}'
instance: '{{$labels.instance}}'
explain: "5xx错误码占比超过5%,目前值:{{ $value }}%"

4、启动alertmanager

1
2
# 启动
alertmanager --config.file=/data/alertmanager/alertmanager.yml > /data/alertmanager/logs 2>&1 &

启动后可以通过IP:3000访问后台

至此,整个流程搭建完成

有问题可以联系博主

转载请注明出处,谢谢