Monitoring Apache Kafka en Production : Stack Complète avec Prometheus et Grafana
Le monitoring d'Apache Kafka est essentiel pour maintenir la performance, la fiabilité et la disponibilité de vos systèmes de streaming en production. Un cluster Kafka mal surveillé peut entraîner des pertes de données, des pannes invisibles et des incidents difficiles à diagnostiquer.
Ce guide présente une stack de monitoring complète, production-ready, avec des configurations concrètes, des dashboards Grafana et des stratégies d'alerting intelligentes.
Architecture de Monitoring
Stack Complète
┌─────────────────────────────────────────────────────────────┐
│ KAFKA CLUSTER │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Broker 1 │ │ Broker 2 │ │ Broker 3 │ │
│ │ JMX:7071 │ │ JMX:7071 │ │ JMX:7071 │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼────────────────────────┘
│ │ │
│ (JMX Exporter exposes metrics on :7071)
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ PROMETHEUS (scrape metrics) │
│ - Scrape interval: 15s │
│ - Retention: 30 days │
│ - Alertmanager integration │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ GRAFANA (visualization) │
│ - Kafka Cluster Overview Dashboard │
│ - Broker Metrics Dashboard │
│ - Consumer Lag Dashboard │
│ - Topic Performance Dashboard │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ ALERTMANAGER (notifications) │
│ - Slack, PagerDuty, Email │
│ - Routing rules │
│ - Silences and inhibitions │
└─────────────────────────────────────────────────────────────┘
Métriques Essentielles à Surveiller
1. Métriques Broker (Serveur Kafka)
Throughput et Débit
# Messages entrants par seconde
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
# Bytes entrants par seconde
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
# Bytes sortants par seconde
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
# Requests failed
kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec
kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec
Latence et Performance
# Latence totale des requêtes Produce
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce
# Latence totale des requêtes Fetch
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer
# Latence par composant
kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce
Réplication et Haute Disponibilité
# Partitions sous-répliquées (CRITIQUE)
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
# Partitions offline (CRITIQUE)
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
# ISR Shrinks/Expands (indicateur de problèmes)
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
# Leader election rate
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs
Ressources Système
# JVM Heap Memory
java.lang:type=Memory,attribute=HeapMemoryUsage
# GC Time
java.lang:type=GarbageCollector,name=G1YoungGeneration
java.lang:type=GarbageCollector,name=G1OldGeneration
# Thread count
java.lang:type=Threading,attribute=ThreadCount
# File descriptors (important pour éviter "Too many open files")
kafka.server:type=KafkaServer,name=linux-disk-read-bytes
kafka.server:type=KafkaServer,name=linux-disk-write-bytes
2. Métriques Producer
# Throughput
kafka.producer:type=producer-metrics,client-id=*,attribute=record-send-rate
# Latence
kafka.producer:type=producer-metrics,client-id=*,attribute=record-send-latency-avg
# Retry rate
kafka.producer:type=producer-metrics,client-id=*,attribute=record-retry-rate
# Buffer disponible
kafka.producer:type=producer-metrics,client-id=*,attribute=buffer-available-bytes
3. Métriques Consumer
# Consumer Lag (CRITIQUE)
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,attribute=records-lag-max
# Records consumed rate
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,attribute=records-consumed-rate
# Fetch latency
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,attribute=fetch-latency-avg
Configuration JMX Exporter
Installation et Setup
# Télécharger JMX Prometheus Exporter
cd /opt/kafka
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.19.0/jmx_prometheus_javaagent-0.19.0.jar
# Créer le répertoire de configuration
mkdir -p /opt/kafka/prometheus
Configuration JMX Exporter (kafka-jmx-config.yml)
# /opt/kafka/prometheus/kafka-jmx-config.yml
---
lowercaseOutputName: true
lowercaseOutputLabelNames: true
# Whitelist des métriques à exposer (optimise les performances)
whitelistObjectNames:
- kafka.server:type=BrokerTopicMetrics,*
- kafka.server:type=ReplicaManager,*
- kafka.server:type=KafkaRequestHandlerPool,*
- kafka.network:type=RequestMetrics,*
- kafka.network:type=SocketServer,*
- kafka.controller:type=KafkaController,*
- kafka.controller:type=ControllerStats,*
- kafka.log:type=LogFlushStats,*
- java.lang:type=Memory
- java.lang:type=GarbageCollector,*
- java.lang:type=Threading
- java.lang:type=Runtime
# Règles de transformation des labels
rules:
# Métriques Broker
- pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
topic: "$4"
partition: "$5"
- pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
broker: "$4:$5"
- pattern: kafka.server<type=(.+), name=(.+)><>Value
name: kafka_server_$1_$2
type: GAUGE
# Métriques Network
- pattern: kafka.network<type=(.+), name=(.+), request=(.+)><>(\w+)
name: kafka_network_$1_$2
type: GAUGE
labels:
request: "$3"
quantile: "$4"
# Métriques Controller
- pattern: kafka.controller<type=(.+), name=(.+)><>Value
name: kafka_controller_$1_$2
type: GAUGE
# Métriques Log
- pattern: kafka.log<type=(.+), name=(.+)><>Value
name: kafka_log_$1_$2
type: GAUGE
# Métriques JVM
- pattern: 'java.lang<type=(.+), name=(.+)><(.+)>(\w+): (\d+)'
name: java_lang_$1_$3_$4
value: $5
type: GAUGE
labels:
name: "$2"
- pattern: java.lang<type=Memory><HeapMemoryUsage>(\w+)
name: java_lang_memory_heapmemoryusage_$1
type: GAUGE
- pattern: java.lang<type=Memory><NonHeapMemoryUsage>(\w+)
name: java_lang_memory_nonheapmemoryusage_$1
type: GAUGE
- pattern: java.lang<type=GarbageCollector, name=(.+)><>CollectionCount
name: java_lang_garbagecollector_collectioncount
type: COUNTER
labels:
name: "$1"
- pattern: java.lang<type=GarbageCollector, name=(.+)><>CollectionTime
name: java_lang_garbagecollector_collectiontime
type: COUNTER
labels:
name: "$1"
Démarrage Kafka avec JMX Exporter
# Méthode 1: Variable d'environnement
export KAFKA_OPTS="-javaagent:/opt/kafka/jmx_prometheus_javaagent-0.19.0.jar=7071:/opt/kafka/prometheus/kafka-jmx-config.yml"
./bin/kafka-server-start.sh config/server.properties
# Méthode 2: Modification directe du script kafka-server-start.sh
# Ajouter avant la ligne exec:
export KAFKA_OPTS="$KAFKA_OPTS -javaagent:/opt/kafka/jmx_prometheus_javaagent-0.19.0.jar=7071:/opt/kafka/prometheus/kafka-jmx-config.yml"
# Méthode 3: Docker Compose
version: '3.8'
services:
kafka:
image: confluentinc/cp-kafka:7.4.0
environment:
KAFKA_OPTS: "-javaagent:/opt/jmx-exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx-exporter/kafka-jmx-config.yml"
volumes:
- ./jmx-exporter:/opt/jmx-exporter
ports:
- "7071:7071" # JMX metrics endpoint
# Vérifier que les métriques sont exposées
curl http://localhost:7071/metrics | head -20
# Résultat attendu :
# kafka_server_brokertopicmetrics_messagesinpersec{...} 1234.5
# kafka_server_replicamanager_underreplicatedpartitions 0
# ...
Configuration Prometheus
prometheus.yml
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'kafka-prod'
environment: 'production'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules
rule_files:
- '/etc/prometheus/rules/*.yml'
# Scrape configurations
scrape_configs:
# Kafka Brokers
- job_name: 'kafka-brokers'
static_configs:
- targets:
- 'kafka-broker-1:7071'
- 'kafka-broker-2:7071'
- 'kafka-broker-3:7071'
labels:
cluster: 'kafka-prod'
# Zookeeper (si utilisé)
- job_name: 'zookeeper'
static_configs:
- targets:
- 'zookeeper-1:7072'
- 'zookeeper-2:7072'
- 'zookeeper-3:7072'
# Node Exporter (métriques système)
- job_name: 'node-exporter'
static_configs:
- targets:
- 'kafka-broker-1:9100'
- 'kafka-broker-2:9100'
- 'kafka-broker-3:9100'
# Kafka Lag Exporter (pour consumer lag)
- job_name: 'kafka-lag-exporter'
static_configs:
- targets:
- 'kafka-lag-exporter:9999'
Règles d'Alerting (/etc/prometheus/rules/kafka-alerts.yml)
groups:
- name: kafka-broker-alerts
interval: 30s
rules:
# CRITICAL: Partitions sous-répliquées
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 5m
labels:
severity: critical
component: kafka
annotations:
summary: "Kafka has under-replicated partitions"
description: "Broker {{ $labels.instance }} has {{ $value }} under-replicated partitions"
# CRITICAL: Partitions offline
- alert: KafkaOfflinePartitions
expr: kafka_controller_kafkacontroller_offlinepartitionscount > 0
for: 1m
labels:
severity: critical
component: kafka
annotations:
summary: "Kafka has offline partitions"
description: "Cluster has {{ $value }} offline partitions"
# WARNING: ISR Shrinks fréquents
- alert: KafkaISRShrinkRate
expr: rate(kafka_server_replicamanager_isrshrinkspers ec[5m]) > 0.01
for: 10m
labels:
severity: warning
component: kafka
annotations:
summary: "High ISR shrink rate detected"
description: "Broker {{ $labels.instance }} ISR shrink rate: {{ $value }}"
# WARNING: Latence élevée Produce
- alert: KafkaHighProduceLatency
expr: kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.99"} > 100
for: 5m
labels:
severity: warning
component: kafka
annotations:
summary: "High Kafka produce latency"
description: "P99 produce latency on {{ $labels.instance }}: {{ $value }}ms"
# WARNING: Latence élevée Fetch
- alert: KafkaHighFetchLatency
expr: kafka_network_requestmetrics_totaltimems{request="FetchConsumer",quantile="0.99"} > 100
for: 5m
labels:
severity: warning
component: kafka
annotations:
summary: "High Kafka fetch latency"
description: "P99 fetch latency on {{ $labels.instance }}: {{ $value }}ms"
# CRITICAL: Broker down
- alert: KafkaBrokerDown
expr: up{job="kafka-brokers"} == 0
for: 1m
labels:
severity: critical
component: kafka
annotations:
summary: "Kafka broker is down"
description: "Broker {{ $labels.instance }} is unreachable"
# WARNING: High JVM memory usage
- alert: KafkaHighHeapMemoryUsage
expr: (java_lang_memory_heapmemoryusage_used / java_lang_memory_heapmemoryusage_max) > 0.8
for: 5m
labels:
severity: warning
component: kafka
annotations:
summary: "High JVM heap memory usage"
description: "Broker {{ $labels.instance }} heap usage: {{ $value | humanizePercentage }}"
# WARNING: High GC time
- alert: KafkaHighGCTime
expr: rate(java_lang_garbagecollector_collectiontime{name="G1OldGeneration"}[5m]) > 0.1
for: 5m
labels:
severity: warning
component: kafka
annotations:
summary: "High GC time detected"
description: "Broker {{ $labels.instance }} GC time rate: {{ $value }}s/s"
- name: kafka-consumer-alerts
interval: 30s
rules:
# CRITICAL: Consumer lag trop élevé
- alert: KafkaConsumerLagHigh
expr: kafka_consumergroup_lag > 10000
for: 5m
labels:
severity: critical
component: kafka-consumer
annotations:
summary: "High consumer lag detected"
description: "Consumer group {{ $labels.consumergroup }} on topic {{ $labels.topic }} has lag: {{ $value }}"
# WARNING: Consumer lag croissant
- alert: KafkaConsumerLagIncreasing
expr: delta(kafka_consumergroup_lag[10m]) > 1000
for: 5m
labels:
severity: warning
component: kafka-consumer
annotations:
summary: "Consumer lag is increasing"
description: "Consumer group {{ $labels.consumergroup }} lag increased by {{ $value }} in 10min"
- name: kafka-topic-alerts
interval: 30s
rules:
# WARNING: Taux d'erreur élevé
- alert: KafkaHighFailedProduceRate
expr: rate(kafka_server_brokertopicmetrics_failedproducerequestspers ec[5m]) > 0.01
for: 5m
labels:
severity: warning
component: kafka
annotations:
summary: "High failed produce request rate"
description: "Topic {{ $labels.topic }} failed produce rate: {{ $value }}/s"
Dashboards Grafana
Installation de Grafana
# Docker Compose
version: '3.8'
services:
grafana:
image: grafana/grafana:10.2.0
container_name: grafana
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
GF_INSTALL_PLUGINS: grafana-piechart-panel
volumes:
- grafana-storage:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
volumes:
grafana-storage:
Dashboard 1: Kafka Cluster Overview
{
"dashboard": {
"title": "Kafka Cluster Overview",
"panels": [
{
"title": "Cluster Health",
"targets": [
{
"expr": "up{job=\"kafka-brokers\"}",
"legendFormat": "{{instance}}"
}
],
"type": "stat"
},
{
"title": "Under-Replicated Partitions",
"targets": [
{
"expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)",
"legendFormat": "Under-Replicated"
}
],
"type": "graph",
"alert": {
"conditions": [
{
"evaluator": { "type": "gt", "params": [0] },
"operator": { "type": "and" },
"query": { "params": ["A", "5m", "now"] },
"reducer": { "type": "last" },
"type": "query"
}
]
}
},
{
"title": "Messages In Per Second",
"targets": [
{
"expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec[5m])) by (topic)",
"legendFormat": "{{topic}}"
}
],
"type": "graph"
},
{
"title": "Bytes In/Out Per Second",
"targets": [
{
"expr": "sum(rate(kafka_server_brokertopicmetrics_bytesinpersec[5m]))",
"legendFormat": "Bytes In"
},
{
"expr": "sum(rate(kafka_server_brokertopicmetrics_bytesoutpersec[5m]))",
"legendFormat": "Bytes Out"
}
],
"type": "graph"
},
{
"title": "Request Latency (p99)",
"targets": [
{
"expr": "kafka_network_requestmetrics_totaltimems{quantile=\"0.99\",request=\"Produce\"}",
"legendFormat": "Produce - {{instance}}"
},
{
"expr": "kafka_network_requestmetrics_totaltimems{quantile=\"0.99\",request=\"FetchConsumer\"}",
"legendFormat": "Fetch - {{instance}}"
}
],
"type": "graph"
},
{
"title": "JVM Heap Usage",
"targets": [
{
"expr": "(java_lang_memory_heapmemoryusage_used / java_lang_memory_heapmemoryusage_max) * 100",
"legendFormat": "{{instance}}"
}
],
"type": "graph"
}
]
}
}
Requêtes Prometheus Utiles pour Grafana
# Throughput messages par topic
sum(rate(kafka_server_brokertopicmetrics_messagesinpersec[5m])) by (topic)
# Throughput bytes par broker
sum(rate(kafka_server_brokertopicmetrics_bytesinpersec[5m])) by (instance)
# Latence P50, P95, P99 Produce
kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.5"}
kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.95"}
kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.99"}
# Consumer Lag par groupe
kafka_consumergroup_lag{consumergroup="my-consumer-group"}
# Partitions par broker
count(kafka_server_replicafetchermanager_maxlag) by (instance)
# Leader election rate
rate(kafka_controller_controllerstats_leaderelectionrateandtimems_count[5m])
# ISR Shrinks/Expands
rate(kafka_server_replicamanager_isrshrinkspers ec[5m])
rate(kafka_server_replicamanager_isrexpandspers ec[5m])
# Network request queue size
kafka_network_requestchannel_requestqueuesize
# Log flush latency
kafka_log_logflushstats_logflus hrateandtimems
# GC Time percentage
rate(java_lang_garbagecollector_collectiontime[5m]) / 10
Consumer Lag Monitoring
Kafka Lag Exporter (Lightbend)
# docker-compose.yml
version: '3.8'
services:
kafka-lag-exporter:
image: seglo/kafka-lag-exporter:0.8.2
container_name: kafka-lag-exporter
ports:
- "9999:9999"
volumes:
- ./kafka-lag-exporter.conf:/opt/docker/conf/application.conf
- ./logback.xml:/opt/docker/conf/logback.xml
# kafka-lag-exporter.conf
kafka-lag-exporter {
port = 9999
poll-interval = 30 seconds
lookup-table-size = 60
clusters = [
{
name = "kafka-prod"
bootstrap-brokers = "kafka-1:9092,kafka-2:9092,kafka-3:9092"
# Consumer groups à monitorer
consumer-groups = [
{
name = "user-service-group"
},
{
name = "order-processing-group"
}
]
# Ou monitorer tous les groupes
# consumer-groups-filters = [
# {
# name = ".*"
# }
# ]
}
]
watchers = {
strimzi = false
}
metric-whitelist = [
"kafka_consumergroup_group_lag",
"kafka_consumergroup_group_lag_seconds",
"kafka_consumergroup_group_max_lag",
"kafka_consumergroup_group_max_lag_seconds"
]
}
Outils Complémentaires
1. Kafka UI (anciennement AKHQ)
version: '3.8'
services:
kafka-ui:
image: provectuslabs/kafka-ui:latest
container_name: kafka-ui
ports:
- "8080:8080"
environment:
KAFKA_CLUSTERS_0_NAME: production
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka-1:9092,kafka-2:9092
KAFKA_CLUSTERS_0_METRICS_PORT: 7071
KAFKA_CLUSTERS_0_SCHEMAREGISTRY: http://schema-registry:8081
DYNAMIC_CONFIG_ENABLED: 'true'
2. Confluent Control Center (Commercial)
Monitoring avancé avec :
- Stream lineage visualization
- Connector management
- KSQL editor
- Alerting intégré
3. Burrow (LinkedIn)
Consumer lag monitoring sans JMX :
docker run -d \
--name burrow \
-p 8000:8000 \
-v /path/to/burrow.toml:/etc/burrow/burrow.toml \
linkedin/burrow:latest
Troubleshooting avec les Métriques
Scénario 1: Latence Élevée
# Vérifier la latence par composant
curl -s http://kafka-broker:7071/metrics | grep -E "requestmetrics.*Produce"
# Diagnostiquer :
# - RequestQueueTimeMs : Queue saturée → augmenter num.network.threads
# - LocalTimeMs : Disque lent → vérifier I/O
# - RemoteTimeMs : Réplication lente → vérifier réseau inter-brokers
# - ResponseQueueTimeMs : Response queue saturée
Scénario 2: Consumer Lag Croissant
# Identifier les topics problématiques
topk(5, kafka_consumergroup_lag) by (topic, consumergroup)
# Vérifier si le producer est trop rapide
rate(kafka_server_brokertopicmetrics_messagesinpersec[5m])
# Vérifier si le consumer est trop lent
rate(kafka_consumer_records_consumed_rate[5m])
Scénario 3: Partitions Sous-Répliquées
# Vérifier ISR shrinks
curl -s http://kafka-broker:7071/metrics | grep isrshrinkspers ec
# Causes courantes :
# - replica.lag.time.max.ms trop court
# - Réseau saturé
# - Broker surchargé (CPU, I/O)
# - GC pauses
Bonnes Pratiques
1. Rétention des Métriques
# Prometheus retention
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB
2. Granularité des Scrapes
# Brokers: 15s (métriques critiques)
scrape_interval: 15s
# Consumer lag: 30s (acceptable)
scrape_interval: 30s
# JVM metrics: 30s
scrape_interval: 30s
3. Alerting Intelligent
- Severity levels : critical, warning, info
- Templating : Utiliser des templates pour éviter la répétition
- Throttling : Grouper les alertes similaires
- Escalation : Différents canaux selon la sévérité
4. Capacity Planning
# Prévoir la croissance du stockage
predict_linear(
kafka_log_log_size[7d],
86400 * 30
)
# Tendance du throughput
predict_linear(
kafka_server_brokertopicmetrics_bytesinpersec[7d],
86400 * 7
)
Conclusion
Un monitoring efficace de Kafka repose sur :
1. Stack Technique Complète
- JMX Exporter pour exposer les métriques Kafka
- Prometheus pour collecter et stocker
- Grafana pour visualiser
- Alertmanager pour notifier
2. Métriques Essentielles
- Réplication : Under-replicated partitions, ISR
- Performance : Latence, throughput
- Consommation : Consumer lag
- Ressources : JVM, CPU, disque
3. Alerting Intelligent
- Alertes critiques vs warnings
- Seuils adaptés au contexte métier
- Escalation appropriée
4. Outils Complémentaires
- Kafka UI pour gestion visuelle
- Lag Exporter pour consumer lag détaillé
- Node Exporter pour métriques système
Le monitoring n'est pas une activité ponctuelle mais un processus continu d'amélioration. Commencez par les métriques business (messages/sec, lag), puis affinez progressivement avec les métriques techniques.
Cet article fait partie d'une série complète sur Apache Kafka en production. Consultez mes autres guides sur l'architecture interne, le tuning de performance, et les patterns avancés.