使用 Prometheus Operator 监控 Kubernetes

简介

Prometheus Operator 是 CoreOS 开发的基于 Prometheus 的 Kubernete s监控方案,也可能是目前功能最全面的开源方案。更多信息可以查看https://github.com/coreos/prometheus-operator

部署 Prometheus Operator

前期准备

1. 创建命名空间

为方便管理,创建一个单独的 Namespace monitoring,Prometheus Operator 相关的组件都会部署到这个 Namespace。

1
# kubectl create namespace monitoring

2. 导入相关镜像

所有节点上面导入 prometheus-operator.tar,下载地址:prometheus-operator.tar

1
# docker load -i prometheus-operator.tar

安装 Prometheus Operator

1. 使用 Helm 安装 Prometheus Operator

Prometheus Operator 所有的组件都打包成 Helm Chart,安装部署非常方便。

1
# helm install --name prometheus-operator --namespace=monitoring stable/prometheus-operator

2. 查看创建的资源

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# kubectl get all -n monitoring 
NAME                                                          READY   STATUS    RESTARTS   AGE
pod/alertmanager-prometheus-operator-alertmanager-0           2/2     Running   0          60s
pod/prometheus-operator-grafana-6c8f4bcfb4-jp5bh              3/3     Running   0          65s
pod/prometheus-operator-kube-state-metrics-6b6d6b8bbd-gff7j   1/1     Running   0          65s
pod/prometheus-operator-operator-76f78fd685-295rb             1/1     Running   0          65s
pod/prometheus-operator-prometheus-node-exporter-44tgz        1/1     Running   0          65s
pod/prometheus-operator-prometheus-node-exporter-6t4sc        1/1     Running   0          65s
pod/prometheus-operator-prometheus-node-exporter-vnwrv        1/1     Running   0          65s
pod/prometheus-prometheus-operator-prometheus-0               3/3     Running   1          54s

NAME                                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/alertmanager-operated                          ClusterIP   None             <none>        9093/TCP,6783/TCP   60s
service/prometheus-operated                            ClusterIP   None             <none>        9090/TCP            54s
service/prometheus-operator-alertmanager               ClusterIP   10.105.62.219    <none>        9093/TCP            65s
service/prometheus-operator-grafana                    ClusterIP   10.103.30.59     <none>        80/TCP              65s
service/prometheus-operator-kube-state-metrics         ClusterIP   10.105.189.63    <none>        8080/TCP            65s
service/prometheus-operator-operator                   ClusterIP   10.105.212.90    <none>        8080/TCP            65s
service/prometheus-operator-prometheus                 ClusterIP   10.104.229.158   <none>        9090/TCP            65s
service/prometheus-operator-prometheus-node-exporter   ClusterIP   10.103.226.249   <none>        9100/TCP            65s

NAME                                                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/prometheus-operator-prometheus-node-exporter   3         3         3       3            3           <none>          65s

NAME                                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/prometheus-operator-grafana              1/1     1            1           65s
deployment.apps/prometheus-operator-kube-state-metrics   1/1     1            1           65s
deployment.apps/prometheus-operator-operator             1/1     1            1           65s

NAME                                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/prometheus-operator-grafana-6c8f4bcfb4              1         1         1       65s
replicaset.apps/prometheus-operator-kube-state-metrics-6b6d6b8bbd   1         1         1       65s
replicaset.apps/prometheus-operator-operator-76f78fd685             1         1         1       65s

NAME                                                             READY   AGE
statefulset.apps/alertmanager-prometheus-operator-alertmanager   1/1     60s
statefulset.apps/prometheus-prometheus-operator-prometheus       1/1     54s

3.查看安装后的 release

1
2
3
# helm list 
NAME               	REVISION	UPDATED                 	STATUS  	CHART                    	APP VERSION	NAMESPACE 
prometheus-operator	1       	Tue Jan  8 13:49:12 2019	DEPLOYED	prometheus-operator-1.5.1	0.26.0     	monitoring

prometheus-operator 的 charts 会自动安装 Prometheus、Alertmanager 和 Grafana。

修改访问模式

1. 查看访问类型

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# kubectl get svc -n monitoring 
NAME                                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
alertmanager-operated                          ClusterIP   None             <none>        9093/TCP,6783/TCP   7m30s
prometheus-operated                            ClusterIP   None             <none>        9090/TCP            7m24s
prometheus-operator-alertmanager               ClusterIP   10.105.62.219    <none>        9093/TCP            7m35s
prometheus-operator-grafana                    ClusterIP   10.103.30.59     <none>        80/TCP              7m35s
prometheus-operator-kube-state-metrics         ClusterIP   10.105.189.63    <none>        8080/TCP            7m35s
prometheus-operator-operator                   ClusterIP   10.105.212.90    <none>        8080/TCP            7m35s
prometheus-operator-prometheus                 ClusterIP   10.104.229.158   <none>        9090/TCP            7m35s
prometheus-operator-prometheus-node-exporter   ClusterIP   10.103.226.249   <none>        9100/TCP            7m35s

默认的访问类型为 ClusterIP 无法外部访问,只能集群内访问。

2. 修改 alertmanager、prometheus、grafana的访问类型

grafana:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# kubectl edit svc prometheus-operator-grafana -n monitoring

……
spec:
  clusterIP: 10.103.30.59
  ports:
  - name: service
    port: 80
    protocol: TCP
    targetPort: 3000
  selector:
    app: grafana
    release: prometheus-operator
  sessionAffinity: None
  type: NodePort        #修改此行

alertmanager:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# kubectl edit svc prometheus-operator-alertmanager -n monitoring

……
spec:
  clusterIP: 10.105.62.219
  ports:
  - name: web
    port: 9093
    protocol: TCP
    targetPort: 9093
  selector:
    alertmanager: prometheus-operator-alertmanager
    app: alertmanager
  sessionAffinity: None
  type: NodePort       #修改此行
status:
  loadBalancer: {}

prometheus:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# kubectl edit svc prometheus-operator-prometheus -n monitoring

……
spec:
  clusterIP: 10.104.229.158
  ports:
  - name: web
    port: 9090
    protocol: TCP
    targetPort: web
  selector:
    app: prometheus
    prometheus: prometheus-operator-prometheus
  sessionAffinity: None
  type: NodePort      #修改此行
status:
  loadBalancer: {}

3. 查看修改后的访问类型

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# kubectl get svc -n monitoring 
NAME                                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
alertmanager-operated                          ClusterIP   None             <none>        9093/TCP,6783/TCP   23m
prometheus-operated                            ClusterIP   None             <none>        9090/TCP            23m
prometheus-operator-alertmanager               NodePort    10.105.62.219    <none>        9093:32645/TCP      23m
prometheus-operator-grafana                    NodePort    10.103.30.59     <none>        80:30043/TCP        23m
prometheus-operator-kube-state-metrics         ClusterIP   10.105.189.63    <none>        8080/TCP            23m
prometheus-operator-operator                   ClusterIP   10.105.212.90    <none>        8080/TCP            23m
prometheus-operator-prometheus                 NodePort    10.104.229.158   <none>        9090:32275/TCP      23m
prometheus-operator-prometheus-node-exporter   ClusterIP   10.103.226.249   <none>        9100/TCP            23m

修改 kubelet 打开只读端口

prometheus 需要访问 kubelet 的 10255 端口获取 metrics。但是默认情况下 10255 端口是不开放的,会导致 prometheus 上有 unhealthy,如下图: unhealthy 打开只读端口需要编辑所有节点的 /var/lib/kubelet/config.yaml 文件,加入以下内容

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# /var/lib/kubelet/config.yaml

……
oomScoreAdj: -999
podPidsLimit: -1
port: 10250
readOnlyPort: 10255          #增加此行
registryBurst: 10
registryPullQPS: 5
resolvConf: /etc/resolv.conf

重启 kubelet 服务

1
# systemctl restart kubelet.service

查看 prometheus target healthy

访问 dashboard

  1.  Pormetheus 的 Web UI 访问地址为:http://nodeip:32275/target,如下图: prometheus

  2. Alertmanager 的 Web UI 访问地址为:http://nodeip:32645/,如下图: alertmanager

  3. Grafana Dashboard 访问地址为:http://nodeip:30043/,默认的用户名/密码为:admin/prom-operator,登陆后如下图: grafana grafana-1 grafana-2

问题记录

1. prometheus-operator-coredns 无数据

问题详情见:Don’t scrape metrics from coreDNS 解决方法如下:修改 prometheus-operator-coredns 服务的 selector 为 kube-dns

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# kubectl edit svc prometheus-operator-coredns  -n kube-system

……
spec:
  clusterIP: None
  ports:
  - name: http-metrics
    port: 9153
    protocol: TCP
    targetPort: 9153
  selector:
    k8s-app: kube-dns         #修改此行
  sessionAffinity: None
  type: ClusterIP

2. prometheus-operator-kube-etcd 无数据

prometheus 通过 4001 端口访问 etcd metrics,但是 etcd 默认监听 2379。 解决方法如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# vim /etc/slug:/manifests/etcd.yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.slug:.io/critical-pod: ""
  creationTimestamp: null
  labels:
    k8s-app: etcd-server                                                       #增加此行
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://172.20.6.116:2379
    - --cert-file=/etc/slug:/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://172.20.6.116:2380
    - --initial-cluster=k8s-master=https://172.20.6.116:2380
    - --key-file=/etc/slug:/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://172.20.6.116:2379,http://172.20.6.116:4001         #增加 4001 端口的 http 监听
    - --listen-peer-urls=https://172.20.6.116:2380
    - --name=k8s-master
    - --peer-cert-file=/etc/slug:/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/slug:/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/slug:/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/slug:/pki/etcd/ca.crt

重启 kubelet 服务即可

1
# systemctl restart kubelet.service

3. prometheus-operator-kube-controller-manager 和 prometheus-operator-kube-scheduler 无数据

由于 kube-controller-manager 和 kube-scheduler 默认监听 127.0.0.1 ,prometheus 无法通过本机地址获取数据,需要修改kube-controller-manager 和 kube-scheduler 监听地址。 解决办法如下: kube-controller-manager:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# vim /etc/slug:/manifests/kube-controller-manager.yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.slug:.io/critical-pod: ""
  creationTimestamp: null
  labels:
    k8s-app: kube-controller-manager               #增加此行
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --address=0.0.0.0                                   #修改监听地址
    - --allocate-node-cidrs=true

kube-scheduler:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# vim /etc/slug:/manifests/kube-scheduler.yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.slug:.io/critical-pod: ""
  creationTimestamp: null
  labels:
    k8s-app: kube-scheduler                         #增加此行
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --address=0.0.0.0                                   #修改监听地址
    - --kubeconfig=/etc/slug:/scheduler.conf
    - --leader-elect=true

重启 kubelet 服务即可

1
# systemctl restart kubelet.service
Nickname
Email
Website
0/500
  • OωO
  • |´・ω・)ノ
  • ヾ(≧∇≦*)ゝ
  • (☆ω☆)
  • (╯‵□′)╯︵┴─┴
  •  ̄﹃ ̄
  • (/ω\)
  • ∠( ᐛ 」∠)_
  • (๑•̀ㅁ•́ฅ)
  • →_→
  • ୧(๑•̀⌄•́๑)૭
  • ٩(ˊᗜˋ*)و
  • (ノ°ο°)ノ
  • (´இ皿இ`)
  • ⌇●﹏●⌇
  • (ฅ´ω`ฅ)
  • (╯°A°)╯︵○○○
  • φ( ̄∇ ̄o)
  • ヾ(´・ ・`。)ノ"
  • ( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
  • (ó﹏ò。)
  • Σ(っ °Д °;)っ
  • ( ,,´・ω・)ノ"(´っω・`。)
  • ╮(╯▽╰)╭
  • o(*////▽////*)q
  • >﹏<
  • ( ๑´•ω•) "(ㆆᴗㆆ)
  • 😂
  • 😀
  • 😅
  • 😊
  • 🙂
  • 🙃
  • 😌
  • 😍
  • 😘
  • 😜
  • 😝
  • 😏
  • 😒
  • 🙄
  • 😳
  • 😡
  • 😔
  • 😫
  • 😱
  • 😭
  • 💩
  • 👻
  • 🙌
  • 🖕
  • 👍
  • 👫
  • 👬
  • 👭
  • 🌚
  • 🌝
  • 🙈
  • 💊
  • 😶
  • 🙏
  • 🍦
  • 🍉
  • 😣
  • 颜文字
  • Emoji
  • Bilibili
6 comments
Anonymous

你好,我搭建好之后,etcd和proxy不正常,请问应该如何处理,或者如何查找原因?

 浙江
 Windows 10
 Firefox 78.0
tanmx
Reply @Anonymous :

@Anonymous , 贴出报错日志啊

 Windows 10
 Firefox 81.0
Anonymous

请教下,有关于有微服务自定义监控博文么,这样安装如何在prometheus Targets 发现自己微服务的监控

 浙江
 macOS Catalina
 Chrome 80.0.3987.149
李阿斗

😰

 Windows 10
 Firefox 68.0
李阿斗

这个问题是什么原因

 Windows 10
 Firefox 68.0
李阿斗

[root@k8s-master charts]# helm install –name prometheus-operator –namespace=monitoring stable/prometheus-operator
Error: found in requirements.yaml, but missing in charts/ directory: kube-state-metrics, prometheus-node-exporter, grafana

 Windows 10
 Firefox 68.0
小苹果

你好我是二进制安装etcd怎么进行监控

 Windows 10
 Chrome 71.0.3578.98
Anonymous
Reply @小苹果 :

@小苹果 , 赞一个,很有用😋

 Windows 10
 Chrome 74.0.3729.169
一个默默无闻的工程师的日常
Built with Hugo
主题 StackJimmy 设计