Kafka sur Kubernetes en Production
Introduction
Deployer Apache Kafka sur Kubernetes combine deux technologies complexes, mais offre des avantages significatifs : automatisation, self-healing, scaling et portabilite. Ce guide couvre exhaustivement le deploiement et l'operation de Kafka en production sur Kubernetes.
Nous explorerons :
- Les patterns d'architecture pour Kafka sur K8s
- Le deploiement avec Strimzi Operator
- La gestion du stockage persistant
- Le scaling et la haute disponibilite
- Le monitoring et l'observabilite
- Les strategies de troubleshooting
Pourquoi Kubernetes pour Kafka ?
Avantages
┌─────────────────────────────────────────────────────────────┐
│ KAFKA + KUBERNETES │
├─────────────────────────────────────────────────────────────┤
│ │
│ Automatisation │
│ ├─ Deploiement declaratif (GitOps) │
│ ├─ Rolling updates automatiques │
│ └─ Self-healing (restart pods fails) │
│ │
│ Scalabilite │
│ ├─ Scale horizontal des brokers │
│ ├─ Auto-scaling des consumers │
│ └─ Resource management natif │
│ │
│ Portabilite │
│ ├─ Multi-cloud │
│ ├─ On-premise / hybrid │
│ └─ Environnements identiques (dev/staging/prod) │
│ │
│ Observabilite │
│ ├─ Integration Prometheus/Grafana │
│ ├─ Logging centralise │
│ └─ Tracing distribue │
│ │
└─────────────────────────────────────────────────────────────┘
Defis
┌─────────────────────────────────────────────────────────────┐
│ CHALLENGES │
├─────────────────────────────────────────────────────────────┤
│ │
│ Stockage │
│ ├─ Kafka est stateful (necessite persistence) │
│ ├─ Performance I/O critique │
│ └─ PVC provisioning et management │
│ │
│ Reseau │
│ ├─ DNS et service discovery │
│ ├─ Exposition externe │
│ └─ Latence inter-pod │
│ │
│ Operations │
│ ├─ Rolling restarts complexes │
│ ├─ Partition rebalancing │
│ └─ Backup et disaster recovery │
│ │
└─────────────────────────────────────────────────────────────┘
Options de Deploiement
Comparaison des Approches
┌────────────────────────────────────────────────────────────────┐
│ Methode │ Complexite │ Production │ Recommande │
├──────────────────┼────────────┼────────────┼───────────────────┤
│ Helm Charts │ Moyenne │ Oui │ Petits clusters │
│ Strimzi Operator │ Faible │ Oui │ Recommande │
│ Confluent CFK │ Faible │ Oui │ Enterprise │
│ StatefulSet DIY │ Haute │ Possible │ Non recommande │
└────────────────────────────────────────────────────────────────┘
Strimzi : L'Operator Kafka pour Kubernetes
Strimzi gere:
├─ Kafka Cluster (brokers)
├─ ZooKeeper (ou KRaft)
├─ Kafka Connect
├─ Kafka MirrorMaker
├─ Kafka Bridge (HTTP)
├─ Schema Registry
└─ Cruise Control (rebalancing)
CRDs (Custom Resource Definitions):
├─ Kafka
├─ KafkaTopic
├─ KafkaUser
├─ KafkaConnect
├─ KafkaMirrorMaker2
├─ KafkaBridge
└─ KafkaRebalance
Installation de Strimzi
Prerequis
# Kubernetes 1.21+
kubectl version
# Namespace dedie
kubectl create namespace kafka
# Storage class avec provisioning dynamique
kubectl get storageclass
Installation via Helm
# Ajouter le repo Helm Strimzi
helm repo add strimzi https://strimzi.io/charts/
helm repo update
# Installer l'operator
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
--namespace kafka \
--set watchAnyNamespace=true \
--version 0.38.0
# Verifier l'installation
kubectl get pods -n kafka
# NAME READY STATUS RESTARTS AGE
# strimzi-cluster-operator-xxx-yyy 1/1 Running 0 30s
Installation via Manifests
# Telecharger et appliquer les manifests
kubectl create namespace kafka
kubectl apply -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
# Verifier
kubectl get pods -n kafka -w
Deploiement d'un Cluster Kafka
Configuration de Base (Dev/Test)
# kafka-cluster-dev.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: kafka-dev
namespace: kafka
spec:
kafka:
version: 3.6.0
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
inter.broker.protocol.version: "3.6"
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 100Gi
deleteClaim: false
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 2000m
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 10Gi
deleteClaim: false
resources:
requests:
memory: 1Gi
cpu: 250m
limits:
memory: 2Gi
cpu: 500m
entityOperator:
topicOperator: {}
userOperator: {}
# Deployer
kubectl apply -f kafka-cluster-dev.yaml
# Suivre le deploiement
kubectl get kafka -n kafka -w
# Verifier les pods
kubectl get pods -n kafka
# kafka-dev-zookeeper-0 1/1 Running
# kafka-dev-zookeeper-1 1/1 Running
# kafka-dev-zookeeper-2 1/1 Running
# kafka-dev-kafka-0 1/1 Running
# kafka-dev-kafka-1 1/1 Running
# kafka-dev-kafka-2 1/1 Running
Configuration Production
# kafka-cluster-prod.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: kafka-prod
namespace: kafka
labels:
app: kafka
environment: production
spec:
kafka:
version: 3.6.0
replicas: 5
# =============== LISTENERS ===============
listeners:
# Internal (cluster interne)
- name: plain
port: 9092
type: internal
tls: false
# Internal TLS
- name: tls
port: 9093
type: internal
tls: true
authentication:
type: tls
# External (LoadBalancer)
- name: external
port: 9094
type: loadbalancer
tls: true
authentication:
type: scram-sha-512
configuration:
bootstrap:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
brokers:
- broker: 0
advertisedHost: kafka-0.example.com
- broker: 1
advertisedHost: kafka-1.example.com
- broker: 2
advertisedHost: kafka-2.example.com
- broker: 3
advertisedHost: kafka-3.example.com
- broker: 4
advertisedHost: kafka-4.example.com
# =============== CONFIGURATION ===============
config:
# Replication
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
# Performance
num.network.threads: 8
num.io.threads: 16
socket.send.buffer.bytes: 102400
socket.receive.buffer.bytes: 102400
socket.request.max.bytes: 104857600
# Log retention
log.retention.hours: 168
log.segment.bytes: 1073741824
log.retention.check.interval.ms: 300000
# Compression
compression.type: lz4
# Protocol version
inter.broker.protocol.version: "3.6"
log.message.format.version: "3.6"
# Rack awareness
broker.rack: ${STRIMZI_BROKER_RACK}
# =============== RACK AWARENESS ===============
rack:
topologyKey: topology.kubernetes.io/zone
# =============== STORAGE ===============
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 500Gi
class: fast-ssd
deleteClaim: false
- id: 1
type: persistent-claim
size: 500Gi
class: fast-ssd
deleteClaim: false
# =============== RESOURCES ===============
resources:
requests:
memory: 8Gi
cpu: 2000m
limits:
memory: 16Gi
cpu: 4000m
# =============== JVM OPTIONS ===============
jvmOptions:
-Xms: 6g
-Xmx: 6g
gcLoggingEnabled: true
javaSystemProperties:
- name: com.sun.management.jmxremote.port
value: "9999"
# =============== METRICS ===============
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
# =============== LIVENESS & READINESS ===============
livenessProbe:
initialDelaySeconds: 60
timeoutSeconds: 5
readinessProbe:
initialDelaySeconds: 60
timeoutSeconds: 5
# =============== TEMPLATE ===============
template:
pod:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: strimzi.io/name
operator: In
values:
- kafka-prod-kafka
topologyKey: kubernetes.io/hostname
tolerations:
- key: "dedicated"
operator: "Equal"
value: "kafka"
effect: "NoSchedule"
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
strimzi.io/name: kafka-prod-kafka
# =============== ZOOKEEPER ===============
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 20Gi
class: fast-ssd
deleteClaim: false
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 1000m
jvmOptions:
-Xms: 1g
-Xmx: 1g
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: zookeeper-metrics-config.yml
template:
pod:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: strimzi.io/name
operator: In
values:
- kafka-prod-zookeeper
topologyKey: kubernetes.io/hostname
# =============== ENTITY OPERATOR ===============
entityOperator:
topicOperator:
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 500m
userOperator:
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 500m
# =============== CRUISE CONTROL ===============
cruiseControl:
config:
goals: >
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
default.goals: >
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal
resources:
requests:
memory: 512Mi
cpu: 200m
limits:
memory: 2Gi
cpu: 1000m
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: cruisecontrol-metrics-config.yml
Configuration KRaft (Sans ZooKeeper)
# kafka-kraft.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: dual-role
namespace: kafka
labels:
strimzi.io/cluster: kafka-kraft
spec:
replicas: 3
roles:
- controller
- broker
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 100Gi
deleteClaim: false
---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: kafka-kraft
namespace: kafka
annotations:
strimzi.io/node-pools: enabled
strimzi.io/kraft: enabled
spec:
kafka:
version: 3.6.0
metadataVersion: 3.6-IV2
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
entityOperator:
topicOperator: {}
userOperator: {}
Gestion du Stockage
Storage Classes Recommandees
# storage-class-aws.yaml (AWS EBS)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "16000"
throughput: "1000"
fsType: xfs
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage-class-gcp.yaml (GCP PD)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage-class-azure.yaml (Azure Disk)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_LRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
JBOD (Just a Bunch Of Disks)
# Configuration JBOD pour haute performance
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 500Gi
class: fast-ssd
deleteClaim: false
- id: 1
type: persistent-claim
size: 500Gi
class: fast-ssd
deleteClaim: false
- id: 2
type: persistent-claim
size: 500Gi
class: fast-ssd
deleteClaim: false
Expansion du Stockage
# Modifier la taille dans le Kafka CR
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 1000Gi # Augmente de 500Gi a 1000Gi
class: fast-ssd
deleteClaim: false
# Appliquer et verifier
kubectl apply -f kafka-cluster-prod.yaml
kubectl get pvc -n kafka
Gestion des Topics
KafkaTopic CRD
# topic-orders.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: orders
namespace: kafka
labels:
strimzi.io/cluster: kafka-prod
spec:
partitions: 12
replicas: 3
config:
retention.ms: 604800000 # 7 jours
segment.bytes: 1073741824 # 1 GB
min.insync.replicas: 2
compression.type: lz4
cleanup.policy: delete
---
# topic-orders-dlq.yaml (Dead Letter Queue)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: orders-dlq
namespace: kafka
labels:
strimzi.io/cluster: kafka-prod
spec:
partitions: 6
replicas: 3
config:
retention.ms: 2592000000 # 30 jours
cleanup.policy: delete
---
# topic-compacted.yaml (Compacted topic)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: user-profiles
namespace: kafka
labels:
strimzi.io/cluster: kafka-prod
spec:
partitions: 6
replicas: 3
config:
cleanup.policy: compact
min.cleanable.dirty.ratio: 0.5
segment.ms: 3600000
# Deployer les topics
kubectl apply -f topics/
# Lister les topics
kubectl get kafkatopics -n kafka
# Decrire un topic
kubectl describe kafkatopic orders -n kafka
Gestion des Utilisateurs et Securite
KafkaUser avec SCRAM
# user-producer.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: orders-producer
namespace: kafka
labels:
strimzi.io/cluster: kafka-prod
spec:
authentication:
type: scram-sha-512
authorization:
type: simple
acls:
# Producer ACLs
- resource:
type: topic
name: orders
patternType: literal
operations:
- Write
- Describe
- Create
# Idempotent producer
- resource:
type: cluster
operations:
- IdempotentWrite
---
# user-consumer.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: orders-consumer
namespace: kafka
labels:
strimzi.io/cluster: kafka-prod
spec:
authentication:
type: scram-sha-512
authorization:
type: simple
acls:
# Consumer ACLs
- resource:
type: topic
name: orders
patternType: literal
operations:
- Read
- Describe
- resource:
type: group
name: orders-consumer-group
patternType: literal
operations:
- Read
# Deployer les users
kubectl apply -f users/
# Recuperer les credentials
kubectl get secret orders-producer -n kafka -o jsonpath='{.data.password}' | base64 -d
TLS Mutual Authentication
# user-tls.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: app-client
namespace: kafka
labels:
strimzi.io/cluster: kafka-prod
spec:
authentication:
type: tls
authorization:
type: simple
acls:
- resource:
type: topic
name: '*'
patternType: literal
operations:
- Read
- Write
- Describe
Scaling et Haute Disponibilite
Scaling Horizontal des Brokers
# Modifier le nombre de replicas
kubectl patch kafka kafka-prod -n kafka --type merge -p '{"spec":{"kafka":{"replicas":7}}}'
# Ou editer le CR
kubectl edit kafka kafka-prod -n kafka
Rebalancing avec Cruise Control
# rebalance.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
name: full-rebalance
namespace: kafka
labels:
strimzi.io/cluster: kafka-prod
spec:
mode: full
goals:
- RackAwareGoal
- ReplicaCapacityGoal
- DiskCapacityGoal
- NetworkInboundCapacityGoal
- NetworkOutboundCapacityGoal
skipHardGoalCheck: false
# Lancer un rebalance
kubectl apply -f rebalance.yaml
# Suivre le statut
kubectl get kafkarebalance full-rebalance -n kafka -w
# Approuver le plan
kubectl annotate kafkarebalance full-rebalance \
strimzi.io/rebalance=approve -n kafka
# Verifier la completion
kubectl get kafkarebalance full-rebalance -n kafka
Pod Disruption Budget
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: kafka-pdb
namespace: kafka
spec:
minAvailable: 3
selector:
matchLabels:
strimzi.io/name: kafka-prod-kafka
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: zookeeper-pdb
namespace: kafka
spec:
minAvailable: 2
selector:
matchLabels:
strimzi.io/name: kafka-prod-zookeeper
Monitoring et Observabilite
Configuration Prometheus
# kafka-metrics-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-metrics
namespace: kafka
data:
kafka-metrics-config.yml: |
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# Broker metrics
- pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
topic: "$4"
partition: "$5"
- pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
broker: "$4:$5"
- pattern: kafka.server<type=(.+), name=(.+)><>Value
name: kafka_server_$1_$2
type: GAUGE
# Controller metrics
- pattern: kafka.controller<type=(.+), name=(.+)><>Value
name: kafka_controller_$1_$2
type: GAUGE
# Network metrics
- pattern: kafka.network<type=(.+), name=(.+), request=(.+), error=(.+)><>Count
name: kafka_network_$1_$2_total
type: COUNTER
labels:
request: "$3"
error: "$4"
# Log metrics
- pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
name: kafka_log_$1_$2
type: GAUGE
labels:
topic: "$3"
partition: "$4"
zookeeper-metrics-config.yml: |
lowercaseOutputName: true
rules:
- pattern: "org.apache.ZooKeeperService<name0=(.+)><>(\\w+)"
name: zookeeper_$2
type: GAUGE
- pattern: "org.apache.ZooKeeperService<name0=(.+), name1=(.+)><>(\\w+)"
name: zookeeper_$3
type: GAUGE
labels:
replicaId: "$2"
cruisecontrol-metrics-config.yml: |
lowercaseOutputName: true
rules:
- pattern: "kafka.cruisecontrol<name=(.+)><>(\\w+)"
name: cruise_control_$1_$2
type: GAUGE
ServiceMonitor pour Prometheus Operator
# service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-monitor
namespace: kafka
labels:
app: kafka
spec:
selector:
matchLabels:
strimzi.io/kind: Kafka
namespaceSelector:
matchNames:
- kafka
endpoints:
- port: tcp-prometheus
path: /metrics
interval: 15s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: zookeeper-monitor
namespace: kafka
spec:
selector:
matchLabels:
strimzi.io/kind: Kafka
strimzi.io/name: kafka-prod-zookeeper
endpoints:
- port: tcp-prometheus
path: /metrics
interval: 15s
Dashboard Grafana
# grafana-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-grafana-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
kafka-dashboard.json: |
{
"dashboard": {
"title": "Kafka Overview",
"panels": [
{
"title": "Messages In/Sec",
"type": "graph",
"targets": [
{
"expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec_count[5m]))",
"legendFormat": "Messages/sec"
}
]
},
{
"title": "Bytes In/Out",
"type": "graph",
"targets": [
{
"expr": "sum(rate(kafka_server_brokertopicmetrics_bytesinpersec_count[5m]))",
"legendFormat": "Bytes In/sec"
},
{
"expr": "sum(rate(kafka_server_brokertopicmetrics_bytesoutpersec_count[5m]))",
"legendFormat": "Bytes Out/sec"
}
]
},
{
"title": "Under Replicated Partitions",
"type": "stat",
"targets": [
{
"expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"
}
]
},
{
"title": "Offline Partitions",
"type": "stat",
"targets": [
{
"expr": "sum(kafka_controller_kafkacontroller_offlinepartitionscount)"
}
]
}
]
}
}
Alertes Prometheus
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kafka-alerts
namespace: kafka
spec:
groups:
- name: kafka.rules
rules:
# Under-replicated partitions
- alert: KafkaUnderReplicatedPartitions
expr: sum(kafka_server_replicamanager_underreplicatedpartitions) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka has under-replicated partitions"
description: "{{ $value }} partitions are under-replicated"
# Offline partitions (CRITICAL)
- alert: KafkaOfflinePartitions
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka has offline partitions"
description: "{{ $value }} partitions are offline"
# No active controller
- alert: KafkaNoActiveController
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 1m
labels:
severity: critical
annotations:
summary: "No active Kafka controller"
# Broker down
- alert: KafkaBrokerDown
expr: count(up{job="kafka"} == 1) < 3
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka broker is down"
description: "Less than 3 brokers are up"
# Consumer lag
- alert: KafkaConsumerLag
expr: sum(kafka_consumer_fetch_manager_records_lag) by (group) > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "High consumer lag"
description: "Consumer group {{ $labels.group }} has lag > 10000"
# Disk usage
- alert: KafkaDiskUsageHigh
expr: (sum(kafka_log_size) by (pod) / sum(kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data-.*kafka.*"}) by (pod)) > 0.8
for: 15m
labels:
severity: warning
annotations:
summary: "Kafka disk usage > 80%"
Operations et Maintenance
Rolling Restart
# Trigger un rolling restart
kubectl annotate pod kafka-prod-kafka-0 -n kafka \
strimzi.io/manual-rolling-update=true
# Ou pour tous les brokers
kubectl annotate statefulset kafka-prod-kafka -n kafka \
strimzi.io/manual-rolling-update=true
# Suivre le rollout
kubectl rollout status statefulset/kafka-prod-kafka -n kafka
Mise a Jour de Version
# Modifier la version dans le CR
spec:
kafka:
version: 3.7.0 # Nouvelle version
config:
inter.broker.protocol.version: "3.6" # Garder l'ancien protocole
log.message.format.version: "3.6"
# Appliquer et suivre
kubectl apply -f kafka-cluster-prod.yaml
kubectl get kafka kafka-prod -n kafka -w
# Apres rollout complet, mettre a jour les versions de protocole
spec:
kafka:
config:
inter.broker.protocol.version: "3.7"
log.message.format.version: "3.7"
Backup et Restore
# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: kafka-backup
namespace: kafka
spec:
schedule: "0 2 * * *" # Tous les jours a 2h
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: bitnami/kafka:3.6
command:
- /bin/bash
- -c
- |
# Export des configurations topics
kafka-topics.sh --bootstrap-server kafka-prod-kafka-bootstrap:9092 \
--describe > /backup/topics-$(date +%Y%m%d).txt
# Backup des consumer group offsets
kafka-consumer-groups.sh --bootstrap-server kafka-prod-kafka-bootstrap:9092 \
--all-groups --describe > /backup/consumer-groups-$(date +%Y%m%d).txt
# Upload to S3
aws s3 cp /backup/ s3://my-bucket/kafka-backups/ --recursive
volumeMounts:
- name: backup-volume
mountPath: /backup
volumes:
- name: backup-volume
emptyDir: {}
restartPolicy: OnFailure
Troubleshooting
Problemes Courants
Pod Pending (Storage)
# Verifier les PVC
kubectl get pvc -n kafka
# Verifier la StorageClass
kubectl get sc
# Verifier les events
kubectl describe pod kafka-prod-kafka-0 -n kafka
Broker Not Ready
# Logs du broker
kubectl logs kafka-prod-kafka-0 -n kafka -c kafka
# Verifier la config
kubectl exec kafka-prod-kafka-0 -n kafka -- cat /tmp/strimzi.properties
# Verifier la connectivite ZooKeeper
kubectl exec kafka-prod-kafka-0 -n kafka -- \
kafka-broker-api-versions.sh --bootstrap-server localhost:9092
Consumer Lag
# Verifier le lag
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group my-consumer-group
# Verifier les partitions
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --topic my-topic
Network Issues
# Test de connectivite
kubectl run test-pod --rm -it --image=busybox -n kafka -- \
nslookup kafka-prod-kafka-bootstrap
# Verifier les services
kubectl get svc -n kafka
# Verifier les endpoints
kubectl get endpoints -n kafka
Commandes Utiles
# Lister les topics
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
kafka-topics.sh --bootstrap-server localhost:9092 --list
# Decrire un topic
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --topic my-topic
# Consommer des messages
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic my-topic --from-beginning --max-messages 10
# Produire des messages
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
kafka-console-producer.sh --bootstrap-server localhost:9092 \
--topic my-topic
# Verifier l'etat du cluster
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
kafka-metadata.sh --snapshot /var/lib/kafka/data-0/__cluster_metadata-0/00000000000000000000.log --command-config /tmp/strimzi.properties
# Rebalance status
kubectl get kafkarebalance -n kafka
A Retenir
- Strimzi est l'Operator recommande : Simplifie enormement la gestion de Kafka sur K8s
- Storage SSD obligatoire : Utilisez des StorageClass avec SSD pour la performance
- Rack awareness : Distribuez les brokers sur plusieurs zones pour la HA
- Monitoring indispensable : Prometheus + Grafana + alertes sont essentiels
- KRaft pour le futur : Migrez vers KRaft pour eliminer ZooKeeper
- GitOps friendly : Tous les CRDs peuvent etre versiones dans Git
- PDB pour la stabilite : Configurez des PodDisruptionBudgets
Deployer Kafka sur Kubernetes en production necessite une expertise specifique. Contactez-moi pour un accompagnement dans la conception et le deploiement de votre infrastructure Kafka cloud-native.