kubernetes监控之prometheus+grafana
什么是 Prometheus
Prometheus 是由 SoundCloud 开源监控告警解决方案,从 2012 年开始编写代码,再到 2015 年 github 上开源以来,已经吸引了 9k+ 关注,以及很多大公司的使用;2016 年 Prometheus 成为继 k8s 后,第二名 CNCF成员。作为新一代开源解决方案,很多理念与 Google SRE 运维之道不谋而合。
基础架构:
image
安装prometheus:
1.创建prometheus和altertmanager(prometheus的报警模块)的configmap文件
下面用到的yaml配置文件已放到github上有需要的可以自己clone
cat prometheus-cm.yaml
apiVersion: v1 kind: Namespacemetadata: name: kube-ops --- apiVersion: v1 kind: ConfigMapmetadata: name: prometheus-config namespace: kube-ops data: prometheus.yml: | global: scrape_interval: 30s scrape_timeout: 30s rule_files: - /etc/prometheus/rules.yml alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'kubernetes-cadvisor' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-node-exporter' scheme: http tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_role] action: replace target_label: kubernetes_role - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:31672' target_label: __address__ - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: 'kubernetes-services' metrics_path: /probe params: module: [http_2xx] kubernetes_sd_configs: - role: service relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe] action: keep regex: true - source_labels: [__address__] target_label: __param_target - target_label: __address__ replacement: blackbox-exporter.example.com:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] target_label: kubernetes_name rules.yml: | groups: - name: test-rule rules: - alert: NodeFilesystemUsage expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High Filesystem usage detected" description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}" - alert: NodeMemoryUsage expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High Memory usage detected" description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}" - alert: NodeCPUUsage expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High CPU usage detected" description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"--- kind: ConfigMapapiVersion: v1 metadata: name: alertmanager namespace: kube-ops data: config.yml: |- global: resolve_timeout: 5m route: receiver: webhook group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [alertname] routes: - receiver: webhook group_wait: 10s match: team: node receivers: - name: webhook webhook_configs: - url: 'http://apollo/hooks/dingtalk/' send_resolved: true - url: 'http://apollo/hooks/prome/' send_resolved: true
kubectl apply -f prometheus-cm.yaml
2.部署node-exporter(在每个节点都安装,需要用daemonset模式)
vim node-exporter.yaml
--- apiVersion: extensions/v1beta1 kind: DaemonSetmetadata: name: node-exporter namespace: kube-ops labels: k8s-app: node-exporter spec: template: metadata: labels: k8s-app: node-exporter spec: containers: - image: prom/node-exporter name: node-exporter ports: - containerPort: 9100 protocol: TCP name: http---apiVersion: v1kind: Servicemetadata: labels: k8s-app: node-exporter name: node-exporter namespace: kube-opsspec: ports: - name: http port: 9100 nodePort: 31672 protocol: TCP type: NodePort selector: k8s-app: node-exporter
kubectl apply -f node-exporter.yaml
3.安装prometheus+alertmanager(注意这里的data数据盘是使用cephfs pv pvc挂载的,需要自行创建,pv pvc的模板前面的github链接中可以找到)
vim prometheus-deployment.yaml
apiVersion: extensions/v1beta1 kind: Deploymentmetadata: labels: k8s-app: prometheus name: prometheus namespace: kube-ops spec: replicas: 1 template: metadata: labels: k8s-app: prometheus spec: serviceAccountName: prometheus securityContext: runAsUser: 0 containers: - image: prom/prometheus:v2.0.0 name: prometheus command: - "/bin/prometheus" args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention=24h" ports: - containerPort: 9090 protocol: TCP name: http volumeMounts: - mountPath: "/prometheus" name: data - mountPath: "/etc/prometheus" name: config-volume resources: requests: cpu: 100m memory: 100Mi limits: cpu: 200m memory: 1Gi - image: quay.io/prometheus/alertmanager:v0.12.0 name: alertmanager args: - "-config.file=/etc/alertmanager/config.yml" - "-storage.path=/alertmanager" ports: - containerPort: 9093 protocol: TCP name: http volumeMounts: - name: alertmanager-config-volume mountPath: /etc/alertmanager resources: requests: cpu: 50m memory: 50Mi limits: cpu: 200m memory: 200Mi volumes: - name: data persistentVolumeClaim: claimName: prometheus-pvc - configMap: name: prometheus-config name: config-volume - name: alertmanager-config-volume configMap: name: alertmanager
kubectl apply -f prometheus-deployment.yaml
4.暴露prometheus端口(也可以使用ingress规则暴露,可自行选择,下面的grafana选择用ingress暴露)
vim prometheus-svc.yaml
apiVersion: v1 kind: Servicemetadata: name: prometheus namespace: kube-ops labels: k8s-app: prometheus spec: selector: k8s-app: prometheus type: NodePort ports: - name: web port: 9090 targetPort: http
kubectl apply -f prometheus-svc.yaml
5.安装grafana(prometheus自带的图形展示不行很理想,所以要搭配强大的图形展示工具grafana)
安装grafnan的时候用到了cephfs pv pvc和ingress。
vim grafana.yaml
apiVersion: v1 kind: Servicemetadata: labels: kubernetes.io/cluster-service: 'true' kubernetes.io/name: grafana name: grafana namespace: kube-ops spec: ports: - port: 3000 targetPort: 3000 selector: k8s-app: grafana --- apiVersion: extensions/v1beta1 kind: Ingressmetadata: labels: kubernetes.io/cluster-service: 'true' kubernetes.io/name: grafana name: grafana-ingress namespace: kube-ops spec: rules: - host: grafana.io http: paths: - path: / backend: serviceName: grafana servicePort: 3000--- apiVersion: v1 kind: PersistentVolumemetadata: name: prometheus-pv namespace: kube-ops spec: capacity: storage: 50Gi accessModes: - ReadWriteMany cephfs: monitors: - 192.168.0.231:6789 - 192.168.0.242:6789 - 192.168.0.211:6789 path: /data/system/grafana user: admin secretRef: name: ceph-secret readOnly: false persistentVolumeReclaimPolicy: Recycle--- apiVersion: v1 kind: PersistentVolumeClaimmetadata: name: prometheus-pvc namespace: kube-ops spec: accessModes: - ReadWriteMany resources: requests: storage: 50Gi --- apiVersion: extensions/v1beta1 kind: Deploymentmetadata: name: grafana namespace: kube-ops spec: replicas: 1 template: metadata: labels: task: monitoring k8s-app: grafana spec: containers: - name: grafana image: gcr.io/google_containers/heapster-grafana-amd64:v4.4.3 ports: - containerPort: 3000 protocol: TCP volumeMounts: - mountPath: /var name: grafana subPath: grafana/data - mountPath: /ssl name: ssl resources: limits: cpu: 200m memory: 200Mi requests: cpu: 100m memory: 100Mi env: - name: INFLUXDB_HOST value: influxdb.kube-system - name: GF_SERVER_HTTP_PORT value: "3000" - name: GF_AUTH_BASIC_ENABLED value: "true" - name: GF_AUTH_ANONYMOUS_ENABLED value: "false" - name: GF_SERVER_ROOT_URL value: / - name: GF_SMTP_ENABLED value: "true" - name: GF_ALERTING_ENABLED value: "true" - name: GF_ALERTING_EXECUTE_ALERTS value: "true" readinessProbe: httpGet: path: /login port: 3000 initialDelaySeconds: 30 timeoutSeconds: 2 volumes: - name: ssl hostPath: path: /etc/ssl/certs - name: grafana persistentVolumeClaim: claimName: prometheus-pvc
kubectl apply -f grafana.yaml
这里的ingress是选择 grafana.io作为域名,需要将本地的hosts解析成nodeip(没做lb的话)。
进入到以下界面,填写数据信息
image.png
随后设置dashboard页面,推荐下面的两款。
https://grafana.com/dashboards/1621
https://grafana.com/dashboards/315
这里选择使用315
效果展示如下图所示
image.png
作者:jinnzy
链接:https://www.jianshu.com/p/46e7b67d5416
共同学习,写下你的评论
评论加载中...
作者其他优质文章