Kafka sur Kubernetes en Production

Introduction

Deployer Apache Kafka sur Kubernetes combine deux technologies complexes, mais offre des avantages significatifs : automatisation, self-healing, scaling et portabilite. Ce guide couvre exhaustivement le deploiement et l'operation de Kafka en production sur Kubernetes.

Nous explorerons :

Les patterns d'architecture pour Kafka sur K8s
Le deploiement avec Strimzi Operator
La gestion du stockage persistant
Le scaling et la haute disponibilite
Le monitoring et l'observabilite
Les strategies de troubleshooting

Architecture Kafka sur Kubernetes

Pourquoi Kubernetes pour Kafka ?

Avantages

┌─────────────────────────────────────────────────────────────┐
│                 KAFKA + KUBERNETES                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Automatisation                                              │
│  ├─ Deploiement declaratif (GitOps)                         │
│  ├─ Rolling updates automatiques                            │
│  └─ Self-healing (restart pods fails)                       │
│                                                              │
│  Scalabilite                                                 │
│  ├─ Scale horizontal des brokers                            │
│  ├─ Auto-scaling des consumers                              │
│  └─ Resource management natif                               │
│                                                              │
│  Portabilite                                                 │
│  ├─ Multi-cloud                                              │
│  ├─ On-premise / hybrid                                      │
│  └─ Environnements identiques (dev/staging/prod)            │
│                                                              │
│  Observabilite                                               │
│  ├─ Integration Prometheus/Grafana                          │
│  ├─ Logging centralise                                       │
│  └─ Tracing distribue                                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Defis

┌─────────────────────────────────────────────────────────────┐
│                    CHALLENGES                                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Stockage                                                    │
│  ├─ Kafka est stateful (necessite persistence)              │
│  ├─ Performance I/O critique                                │
│  └─ PVC provisioning et management                          │
│                                                              │
│  Reseau                                                      │
│  ├─ DNS et service discovery                                │
│  ├─ Exposition externe                                       │
│  └─ Latence inter-pod                                        │
│                                                              │
│  Operations                                                  │
│  ├─ Rolling restarts complexes                              │
│  ├─ Partition rebalancing                                    │
│  └─ Backup et disaster recovery                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Options de Deploiement

Comparaison des Approches

┌────────────────────────────────────────────────────────────────┐
│ Methode          │ Complexite │ Production │ Recommande        │
├──────────────────┼────────────┼────────────┼───────────────────┤
│ Helm Charts      │ Moyenne    │ Oui        │ Petits clusters   │
│ Strimzi Operator │ Faible     │ Oui        │ Recommande        │
│ Confluent CFK    │ Faible     │ Oui        │ Enterprise        │
│ StatefulSet DIY  │ Haute      │ Possible   │ Non recommande    │
└────────────────────────────────────────────────────────────────┘

Strimzi : L'Operator Kafka pour Kubernetes

Strimzi gere:
├─ Kafka Cluster (brokers)
├─ ZooKeeper (ou KRaft)
├─ Kafka Connect
├─ Kafka MirrorMaker
├─ Kafka Bridge (HTTP)
├─ Schema Registry
└─ Cruise Control (rebalancing)

CRDs (Custom Resource Definitions):
├─ Kafka
├─ KafkaTopic
├─ KafkaUser
├─ KafkaConnect
├─ KafkaMirrorMaker2
├─ KafkaBridge
└─ KafkaRebalance

Installation de Strimzi

Prerequis

# Kubernetes 1.21+
kubectl version

# Namespace dedie
kubectl create namespace kafka

# Storage class avec provisioning dynamique
kubectl get storageclass

Installation via Helm

# Ajouter le repo Helm Strimzi
helm repo add strimzi https://strimzi.io/charts/
helm repo update

# Installer l'operator
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
  --namespace kafka \
  --set watchAnyNamespace=true \
  --version 0.38.0

# Verifier l'installation
kubectl get pods -n kafka
# NAME                                        READY   STATUS    RESTARTS   AGE
# strimzi-cluster-operator-xxx-yyy            1/1     Running   0          30s

Installation via Manifests

# Telecharger et appliquer les manifests
kubectl create namespace kafka

kubectl apply -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka

# Verifier
kubectl get pods -n kafka -w

Deploiement d'un Cluster Kafka

Configuration de Base (Dev/Test)

# kafka-cluster-dev.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka-dev
  namespace: kafka
spec:
  kafka:
    version: 3.6.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.6"
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 100Gi
          deleteClaim: false
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 2000m
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi
      deleteClaim: false
    resources:
      requests:
        memory: 1Gi
        cpu: 250m
      limits:
        memory: 2Gi
        cpu: 500m
  entityOperator:
    topicOperator: {}
    userOperator: {}

# Deployer
kubectl apply -f kafka-cluster-dev.yaml

# Suivre le deploiement
kubectl get kafka -n kafka -w

# Verifier les pods
kubectl get pods -n kafka
# kafka-dev-zookeeper-0    1/1     Running
# kafka-dev-zookeeper-1    1/1     Running
# kafka-dev-zookeeper-2    1/1     Running
# kafka-dev-kafka-0        1/1     Running
# kafka-dev-kafka-1        1/1     Running
# kafka-dev-kafka-2        1/1     Running

Configuration Production

# kafka-cluster-prod.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka-prod
  namespace: kafka
  labels:
    app: kafka
    environment: production
spec:
  kafka:
    version: 3.6.0
    replicas: 5

    # =============== LISTENERS ===============
    listeners:
      # Internal (cluster interne)
      - name: plain
        port: 9092
        type: internal
        tls: false
      # Internal TLS
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
      # External (LoadBalancer)
      - name: external
        port: 9094
        type: loadbalancer
        tls: true
        authentication:
          type: scram-sha-512
        configuration:
          bootstrap:
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-type: nlb
          brokers:
            - broker: 0
              advertisedHost: kafka-0.example.com
            - broker: 1
              advertisedHost: kafka-1.example.com
            - broker: 2
              advertisedHost: kafka-2.example.com
            - broker: 3
              advertisedHost: kafka-3.example.com
            - broker: 4
              advertisedHost: kafka-4.example.com

    # =============== CONFIGURATION ===============
    config:
      # Replication
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2

      # Performance
      num.network.threads: 8
      num.io.threads: 16
      socket.send.buffer.bytes: 102400
      socket.receive.buffer.bytes: 102400
      socket.request.max.bytes: 104857600

      # Log retention
      log.retention.hours: 168
      log.segment.bytes: 1073741824
      log.retention.check.interval.ms: 300000

      # Compression
      compression.type: lz4

      # Protocol version
      inter.broker.protocol.version: "3.6"
      log.message.format.version: "3.6"

      # Rack awareness
      broker.rack: ${STRIMZI_BROKER_RACK}

    # =============== RACK AWARENESS ===============
    rack:
      topologyKey: topology.kubernetes.io/zone

    # =============== STORAGE ===============
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 500Gi
          class: fast-ssd
          deleteClaim: false
        - id: 1
          type: persistent-claim
          size: 500Gi
          class: fast-ssd
          deleteClaim: false

    # =============== RESOURCES ===============
    resources:
      requests:
        memory: 8Gi
        cpu: 2000m
      limits:
        memory: 16Gi
        cpu: 4000m

    # =============== JVM OPTIONS ===============
    jvmOptions:
      -Xms: 6g
      -Xmx: 6g
      gcLoggingEnabled: true
      javaSystemProperties:
        - name: com.sun.management.jmxremote.port
          value: "9999"

    # =============== METRICS ===============
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml

    # =============== LIVENESS & READINESS ===============
    livenessProbe:
      initialDelaySeconds: 60
      timeoutSeconds: 5
    readinessProbe:
      initialDelaySeconds: 60
      timeoutSeconds: 5

    # =============== TEMPLATE ===============
    template:
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: strimzi.io/name
                      operator: In
                      values:
                        - kafka-prod-kafka
                topologyKey: kubernetes.io/hostname
        tolerations:
          - key: "dedicated"
            operator: "Equal"
            value: "kafka"
            effect: "NoSchedule"
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                strimzi.io/name: kafka-prod-kafka

  # =============== ZOOKEEPER ===============
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 20Gi
      class: fast-ssd
      deleteClaim: false
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 1000m
    jvmOptions:
      -Xms: 1g
      -Xmx: 1g
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: zookeeper-metrics-config.yml
    template:
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: strimzi.io/name
                      operator: In
                      values:
                        - kafka-prod-zookeeper
                topologyKey: kubernetes.io/hostname

  # =============== ENTITY OPERATOR ===============
  entityOperator:
    topicOperator:
      resources:
        requests:
          memory: 256Mi
          cpu: 100m
        limits:
          memory: 512Mi
          cpu: 500m
    userOperator:
      resources:
        requests:
          memory: 256Mi
          cpu: 100m
        limits:
          memory: 512Mi
          cpu: 500m

  # =============== CRUISE CONTROL ===============
  cruiseControl:
    config:
      goals: >
        com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
      default.goals: >
        com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal
    resources:
      requests:
        memory: 512Mi
        cpu: 200m
      limits:
        memory: 2Gi
        cpu: 1000m
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: cruisecontrol-metrics-config.yml

Configuration KRaft (Sans ZooKeeper)

# kafka-kraft.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: dual-role
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-kraft
spec:
  replicas: 3
  roles:
    - controller
    - broker
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 100Gi
        deleteClaim: false
---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka-kraft
  namespace: kafka
  annotations:
    strimzi.io/node-pools: enabled
    strimzi.io/kraft: enabled
spec:
  kafka:
    version: 3.6.0
    metadataVersion: 3.6-IV2
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
  entityOperator:
    topicOperator: {}
    userOperator: {}

Gestion du Stockage

Storage Classes Recommandees

# storage-class-aws.yaml (AWS EBS)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "16000"
  throughput: "1000"
  fsType: xfs
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage-class-gcp.yaml (GCP PD)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage-class-azure.yaml (Azure Disk)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_LRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

JBOD (Just a Bunch Of Disks)

# Configuration JBOD pour haute performance
storage:
  type: jbod
  volumes:
    - id: 0
      type: persistent-claim
      size: 500Gi
      class: fast-ssd
      deleteClaim: false
    - id: 1
      type: persistent-claim
      size: 500Gi
      class: fast-ssd
      deleteClaim: false
    - id: 2
      type: persistent-claim
      size: 500Gi
      class: fast-ssd
      deleteClaim: false

Expansion du Stockage

# Modifier la taille dans le Kafka CR
storage:
  type: jbod
  volumes:
    - id: 0
      type: persistent-claim
      size: 1000Gi  # Augmente de 500Gi a 1000Gi
      class: fast-ssd
      deleteClaim: false

# Appliquer et verifier
kubectl apply -f kafka-cluster-prod.yaml
kubectl get pvc -n kafka

Gestion des Topics

KafkaTopic CRD

# topic-orders.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: orders
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  partitions: 12
  replicas: 3
  config:
    retention.ms: 604800000    # 7 jours
    segment.bytes: 1073741824  # 1 GB
    min.insync.replicas: 2
    compression.type: lz4
    cleanup.policy: delete
---
# topic-orders-dlq.yaml (Dead Letter Queue)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: orders-dlq
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  partitions: 6
  replicas: 3
  config:
    retention.ms: 2592000000   # 30 jours
    cleanup.policy: delete
---
# topic-compacted.yaml (Compacted topic)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: user-profiles
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  partitions: 6
  replicas: 3
  config:
    cleanup.policy: compact
    min.cleanable.dirty.ratio: 0.5
    segment.ms: 3600000

# Deployer les topics
kubectl apply -f topics/

# Lister les topics
kubectl get kafkatopics -n kafka

# Decrire un topic
kubectl describe kafkatopic orders -n kafka

Gestion des Utilisateurs et Securite

KafkaUser avec SCRAM

# user-producer.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: orders-producer
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Producer ACLs
      - resource:
          type: topic
          name: orders
          patternType: literal
        operations:
          - Write
          - Describe
          - Create
      # Idempotent producer
      - resource:
          type: cluster
        operations:
          - IdempotentWrite
---
# user-consumer.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: orders-consumer
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Consumer ACLs
      - resource:
          type: topic
          name: orders
          patternType: literal
        operations:
          - Read
          - Describe
      - resource:
          type: group
          name: orders-consumer-group
          patternType: literal
        operations:
          - Read

# Deployer les users
kubectl apply -f users/

# Recuperer les credentials
kubectl get secret orders-producer -n kafka -o jsonpath='{.data.password}' | base64 -d

TLS Mutual Authentication

# user-tls.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: app-client
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  authentication:
    type: tls
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: '*'
          patternType: literal
        operations:
          - Read
          - Write
          - Describe

Scaling et Haute Disponibilite

Scaling Horizontal des Brokers

# Modifier le nombre de replicas
kubectl patch kafka kafka-prod -n kafka --type merge -p '{"spec":{"kafka":{"replicas":7}}}'

# Ou editer le CR
kubectl edit kafka kafka-prod -n kafka

Rebalancing avec Cruise Control

# rebalance.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: full-rebalance
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  mode: full
  goals:
    - RackAwareGoal
    - ReplicaCapacityGoal
    - DiskCapacityGoal
    - NetworkInboundCapacityGoal
    - NetworkOutboundCapacityGoal
  skipHardGoalCheck: false

# Lancer un rebalance
kubectl apply -f rebalance.yaml

# Suivre le statut
kubectl get kafkarebalance full-rebalance -n kafka -w

# Approuver le plan
kubectl annotate kafkarebalance full-rebalance \
  strimzi.io/rebalance=approve -n kafka

# Verifier la completion
kubectl get kafkarebalance full-rebalance -n kafka

Pod Disruption Budget

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
  namespace: kafka
spec:
  minAvailable: 3
  selector:
    matchLabels:
      strimzi.io/name: kafka-prod-kafka
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: zookeeper-pdb
  namespace: kafka
spec:
  minAvailable: 2
  selector:
    matchLabels:
      strimzi.io/name: kafka-prod-zookeeper

Monitoring et Observabilite

Configuration Prometheus

# kafka-metrics-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-metrics
  namespace: kafka
data:
  kafka-metrics-config.yml: |
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    rules:
    # Broker metrics
    - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
      name: kafka_server_$1_$2
      type: GAUGE
      labels:
        clientId: "$3"
        topic: "$4"
        partition: "$5"
    - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
      name: kafka_server_$1_$2
      type: GAUGE
      labels:
        clientId: "$3"
        broker: "$4:$5"
    - pattern: kafka.server<type=(.+), name=(.+)><>Value
      name: kafka_server_$1_$2
      type: GAUGE
    # Controller metrics
    - pattern: kafka.controller<type=(.+), name=(.+)><>Value
      name: kafka_controller_$1_$2
      type: GAUGE
    # Network metrics
    - pattern: kafka.network<type=(.+), name=(.+), request=(.+), error=(.+)><>Count
      name: kafka_network_$1_$2_total
      type: COUNTER
      labels:
        request: "$3"
        error: "$4"
    # Log metrics
    - pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
      name: kafka_log_$1_$2
      type: GAUGE
      labels:
        topic: "$3"
        partition: "$4"

  zookeeper-metrics-config.yml: |
    lowercaseOutputName: true
    rules:
    - pattern: "org.apache.ZooKeeperService<name0=(.+)><>(\\w+)"
      name: zookeeper_$2
      type: GAUGE
    - pattern: "org.apache.ZooKeeperService<name0=(.+), name1=(.+)><>(\\w+)"
      name: zookeeper_$3
      type: GAUGE
      labels:
        replicaId: "$2"

  cruisecontrol-metrics-config.yml: |
    lowercaseOutputName: true
    rules:
    - pattern: "kafka.cruisecontrol<name=(.+)><>(\\w+)"
      name: cruise_control_$1_$2
      type: GAUGE

ServiceMonitor pour Prometheus Operator

# service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-monitor
  namespace: kafka
  labels:
    app: kafka
spec:
  selector:
    matchLabels:
      strimzi.io/kind: Kafka
  namespaceSelector:
    matchNames:
      - kafka
  endpoints:
    - port: tcp-prometheus
      path: /metrics
      interval: 15s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: zookeeper-monitor
  namespace: kafka
spec:
  selector:
    matchLabels:
      strimzi.io/kind: Kafka
      strimzi.io/name: kafka-prod-zookeeper
  endpoints:
    - port: tcp-prometheus
      path: /metrics
      interval: 15s

Dashboard Grafana

# grafana-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-grafana-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  kafka-dashboard.json: |
    {
      "dashboard": {
        "title": "Kafka Overview",
        "panels": [
          {
            "title": "Messages In/Sec",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec_count[5m]))",
                "legendFormat": "Messages/sec"
              }
            ]
          },
          {
            "title": "Bytes In/Out",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(kafka_server_brokertopicmetrics_bytesinpersec_count[5m]))",
                "legendFormat": "Bytes In/sec"
              },
              {
                "expr": "sum(rate(kafka_server_brokertopicmetrics_bytesoutpersec_count[5m]))",
                "legendFormat": "Bytes Out/sec"
              }
            ]
          },
          {
            "title": "Under Replicated Partitions",
            "type": "stat",
            "targets": [
              {
                "expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"
              }
            ]
          },
          {
            "title": "Offline Partitions",
            "type": "stat",
            "targets": [
              {
                "expr": "sum(kafka_controller_kafkacontroller_offlinepartitionscount)"
              }
            ]
          }
        ]
      }
    }

Alertes Prometheus

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kafka-alerts
  namespace: kafka
spec:
  groups:
    - name: kafka.rules
      rules:
        # Under-replicated partitions
        - alert: KafkaUnderReplicatedPartitions
          expr: sum(kafka_server_replicamanager_underreplicatedpartitions) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Kafka has under-replicated partitions"
            description: "{{ $value }} partitions are under-replicated"

        # Offline partitions (CRITICAL)
        - alert: KafkaOfflinePartitions
          expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Kafka has offline partitions"
            description: "{{ $value }} partitions are offline"

        # No active controller
        - alert: KafkaNoActiveController
          expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "No active Kafka controller"

        # Broker down
        - alert: KafkaBrokerDown
          expr: count(up{job="kafka"} == 1) < 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Kafka broker is down"
            description: "Less than 3 brokers are up"

        # Consumer lag
        - alert: KafkaConsumerLag
          expr: sum(kafka_consumer_fetch_manager_records_lag) by (group) > 10000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High consumer lag"
            description: "Consumer group {{ $labels.group }} has lag > 10000"

        # Disk usage
        - alert: KafkaDiskUsageHigh
          expr: (sum(kafka_log_size) by (pod) / sum(kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data-.*kafka.*"}) by (pod)) > 0.8
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Kafka disk usage > 80%"

Operations et Maintenance

Rolling Restart

# Trigger un rolling restart
kubectl annotate pod kafka-prod-kafka-0 -n kafka \
  strimzi.io/manual-rolling-update=true

# Ou pour tous les brokers
kubectl annotate statefulset kafka-prod-kafka -n kafka \
  strimzi.io/manual-rolling-update=true

# Suivre le rollout
kubectl rollout status statefulset/kafka-prod-kafka -n kafka

Mise a Jour de Version

# Modifier la version dans le CR
spec:
  kafka:
    version: 3.7.0  # Nouvelle version
    config:
      inter.broker.protocol.version: "3.6"  # Garder l'ancien protocole
      log.message.format.version: "3.6"

# Appliquer et suivre
kubectl apply -f kafka-cluster-prod.yaml
kubectl get kafka kafka-prod -n kafka -w

# Apres rollout complet, mettre a jour les versions de protocole
spec:
  kafka:
    config:
      inter.broker.protocol.version: "3.7"
      log.message.format.version: "3.7"

Backup et Restore

# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: kafka-backup
  namespace: kafka
spec:
  schedule: "0 2 * * *"  # Tous les jours a 2h
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: bitnami/kafka:3.6
              command:
                - /bin/bash
                - -c
                - |
                  # Export des configurations topics
                  kafka-topics.sh --bootstrap-server kafka-prod-kafka-bootstrap:9092 \
                    --describe > /backup/topics-$(date +%Y%m%d).txt

                  # Backup des consumer group offsets
                  kafka-consumer-groups.sh --bootstrap-server kafka-prod-kafka-bootstrap:9092 \
                    --all-groups --describe > /backup/consumer-groups-$(date +%Y%m%d).txt

                  # Upload to S3
                  aws s3 cp /backup/ s3://my-bucket/kafka-backups/ --recursive
              volumeMounts:
                - name: backup-volume
                  mountPath: /backup
          volumes:
            - name: backup-volume
              emptyDir: {}
          restartPolicy: OnFailure

Troubleshooting

Problemes Courants

Pod Pending (Storage)

# Verifier les PVC
kubectl get pvc -n kafka

# Verifier la StorageClass
kubectl get sc

# Verifier les events
kubectl describe pod kafka-prod-kafka-0 -n kafka

Broker Not Ready

# Logs du broker
kubectl logs kafka-prod-kafka-0 -n kafka -c kafka

# Verifier la config
kubectl exec kafka-prod-kafka-0 -n kafka -- cat /tmp/strimzi.properties

# Verifier la connectivite ZooKeeper
kubectl exec kafka-prod-kafka-0 -n kafka -- \
  kafka-broker-api-versions.sh --bootstrap-server localhost:9092

Consumer Lag

# Verifier le lag
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-consumer-group

# Verifier les partitions
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic my-topic

Network Issues

# Test de connectivite
kubectl run test-pod --rm -it --image=busybox -n kafka -- \
  nslookup kafka-prod-kafka-bootstrap

# Verifier les services
kubectl get svc -n kafka

# Verifier les endpoints
kubectl get endpoints -n kafka

Commandes Utiles

# Lister les topics
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-topics.sh --bootstrap-server localhost:9092 --list

# Decrire un topic
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic my-topic

# Consommer des messages
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic my-topic --from-beginning --max-messages 10

# Produire des messages
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-console-producer.sh --bootstrap-server localhost:9092 \
  --topic my-topic

# Verifier l'etat du cluster
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-metadata.sh --snapshot /var/lib/kafka/data-0/__cluster_metadata-0/00000000000000000000.log --command-config /tmp/strimzi.properties

# Rebalance status
kubectl get kafkarebalance -n kafka

A Retenir

Strimzi est l'Operator recommande : Simplifie enormement la gestion de Kafka sur K8s
Storage SSD obligatoire : Utilisez des StorageClass avec SSD pour la performance
Rack awareness : Distribuez les brokers sur plusieurs zones pour la HA
Monitoring indispensable : Prometheus + Grafana + alertes sont essentiels
KRaft pour le futur : Migrez vers KRaft pour eliminer ZooKeeper
GitOps friendly : Tous les CRDs peuvent etre versiones dans Git
PDB pour la stabilite : Configurez des PodDisruptionBudgets

Deployer Kafka sur Kubernetes en production necessite une expertise specifique. Contactez-moi pour un accompagnement dans la conception et le deploiement de votre infrastructure Kafka cloud-native.