Prometheus：收集springboot和对应虚机各个维度数据
grafana：图形化界面，展示Prometheus收集到的数据
Alertmanager：报警

一、springboot配置

1、在项目pom中引入依赖

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-core</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

2、在application.properties添加配置

1 2	management.endpoints.web.exposure.include=prometheus management.metrics.tags.application=projectname

配置完成后重动项目，浏览器访问

http://ip:port/projectname/actuator/prometheus

出现如下数据说明成功

prometheus也会从这个接口读取数据

二、prometheus配置

1、安装

# 下载
$ wget https://github.com/prometheus/prometheus/releases/download/v2.15.0/prometheus-2.15.0.darwin-amd64.tar.gz

# 解压
$ tar -zxvf prometheus-2.15.0.darwin-amd64.tar.gz
$ cd prometheus-2.15.0.darwin-amd64

# 查看目录
$ ls- ls
    24 -rw-r--r--@ 1 yunai  staff     11357 Dec 23 22:03 LICENSE
     8 -rw-r--r--@ 1 yunai  staff      3184 Dec 23 22:03 NOTICE
     0 drwxr-xr-x@ 4 yunai  staff       128 Dec 23 22:03 console_libraries
     0 drwxr-xr-x@ 9 yunai  staff       288 Dec 23 22:03 consoles
158776 -rwxr-xr-x@ 1 yunai  staff  81289464 Dec 23 20:13 prometheus # Prometheus 执行程序
     8 -rw-r--r--@ 1 yunai  staff       926 Dec 23 22:03 prometheus.yml # 配置文件
 92704 -rwxr-xr-x@ 1 yunai  staff  47461216 Dec 23 20:15 promtool
 26512 -rwxr-xr-x@ 1 yunai  staff  13572848 Dec 23 20:16 tsdb

2、配置prometheus.yml监听目标项目

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  scrape_timeout:      10s
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093
    scheme: http
    timeout: 10s
    api_version: v1

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
- /data/alertmanager/alert-rules.yml

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'projectname'
    honor_timestamps: true
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /projectname/actuator/prometheus
    scheme: http
    static_configs:
    - targets:
      - ip:port
  - job_name: 'projectname2'
    honor_timestamps: true
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /projectname2/actuator/prometheus
    scheme: http
    static_configs:
    - targets:
      - ip:port

job可以配置多个来监听多个项目的数据

alerting和rule_files是alertmanager报警关联配置

3、启动

1	prometheus --web.enable-lifecycle --config.file=/data/prometheus/prometheus.yml > /data/prometheus/logs 2>&1 &

prometheus 启动参数加上
–web.enable-lifecycle

这样修改完配置可以通过接口reload

http://ip:port/-/reload

IP:9090 可以查看prometheus后台（太难看所以接入Grafana图表）

三、Grafana配置

1、安装启动

# 安装 brew
$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

# 更新 brew 源
brew update

# 安装Grafana
brew install grafana

# 启动
# To start Grafana using homebrew services first make sure homebrew/services is installed.
$ brew tap homebrew/services

# Then start Grafana using:
$ brew services start grafana
==> Successfully started `grafana` (label: homebrew.mxcl.grafana)

默认配置下，Grafana 服务启动在 3000 端口，内置「admin/admin」账号

访问IP:3000进入后台

2、添加prometheus数据源

点击「Save & Test」绿色按钮，完成添加 Prometheus 数据源

3、制作Dashboard仪表盘

可以参照官方或者社区文档，如果配置仪表盘和布局，有一个简单的方式是直接copy json，这里给一个监控JVM和HTTP接口数据的详细json

JSONModel

manager界面打开新建Dashboard，修改Name，然后打开JSON Model

JSON Model中panels为仪表盘具体配置，只需要修改panels和templating等信息就可以，不能全部复制，因为gnetId等内容是唯一的

保存修改，界面如下

四、Alertmanager配置

1、下载安装

# 下载
$ wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.darwin-amd64.tar.gz

# 解压
$ tar -zxvf alertmanager-0.20.0.darwin-amd64.tar.gz
$ cd alertmanager-0.20.0.darwin-amd64

# 查看目录
$ ls- ls
   24 -rw-r--r--@ 1 yunai  staff     11357 Dec 11 22:51 LICENSE
    8 -rw-r--r--@ 1 yunai  staff       457 Dec 11 22:51 NOTICE
52096 -rwxr-xr-x@ 1 yunai  staff  26671536 Dec 11 22:16 alertmanager # Alertmanager 执行程序
    8 -rw-r--r--@ 1 yunai  staff       380 Dec 11 22:51 alertmanager.yml # 配置文件
43680 -rwxr-xr-x@ 1 yunai  staff  22360744 Dec 11 22:17 amtool

2、修改alertmanager.yml配置

这里可以配置webhook去调用单独的项目http接口，然后项目接口中自己选择报警方式（邮件，短信等）和内容

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://ip:port/alert'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

3、配置alert-rules.yml报警规则

前面prometheus配置了alert-rules.yml文件的路径，保持一致

更多规则可以自己定义

groups:
  - name: host_monitoring
    rules:
    - alert: 堆内存报警
      expr: sum(jvm_memory_used_bytes{area="heap"})*100/sum(jvm_memory_max_bytes{area="heap"}) > 90
      for: 5m
      labels:
        team: node
      annotations:
        alert_type: 堆内存报警
        application: '{{$labels.application}}'
        instance: '{{$labels.instance}}'
        explain: "堆内存使用量超过90，目前使用量：{{ $value }}%"
    - alert: 堆外内存报警
      expr: sum(jvm_memory_used_bytes{area="nonheap"})*100/sum(jvm_memory_max_bytes{area="nonheap"}) > 90
      for: 5m
      labels:
        team: node
      annotations:
        alert_type: 堆外内存报警
        application: '{{$labels.application}}'
        instance: '{{$labels.instance}}'
        explain: "堆外内存使用量超过90，目前使用量：{{ $value }}%"
    - alert: QPS报警
      expr: sum(rate(http_server_requests_seconds_count[5m])) by (application, instance) > 1000
      for: 5m
      labels:
        team: node
      annotations:
        alert_type: QPS报警
        application: '{{$labels.application}}'
        instance: '{{$labels.instance}}'
        explain: "QPS超过1000，当前值：{{ $value }}%"
    - alert: 5xx错误码报警
      expr: (sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application, instance))*100/(sum(rate(http_server_requests_seconds_count[5m])) by (application, instance)) > 5
      for: 5m
      labels:
        team: node
      annotations:
        alert_type: 5xx错误码报警
        application: '{{$labels.application}}'
        instance: '{{$labels.instance}}'
        explain: "5xx错误码占比超过5%，目前值：{{ $value }}%"

4、启动alertmanager

1 2	# 启动 alertmanager --config.file=/data/alertmanager/alertmanager.yml > /data/alertmanager/logs 2>&1 &

启动后可以通过IP:3000访问后台

至此，整个流程搭建完成

prometheus+alertmanager+grafana+springboot监控报警搭建