KAFKA
Avance

Kafka sur Kubernetes en Production

Kafka sur Kubernetes en Production

Guide complet pour deployer et operer Apache Kafka sur Kubernetes en production : Strimzi operator, StatefulSets, configurations de stockage, scaling, monitoring et strategies de troubleshooting.

Florian Courouge
45 min de lecture
4,261 mots
0 vues
Kafka
Kubernetes
Strimzi
Production
DevOps
Cloud-Native
StatefulSet

Kafka sur Kubernetes en Production

Introduction

Deployer Apache Kafka sur Kubernetes combine deux technologies complexes, mais offre des avantages significatifs : automatisation, self-healing, scaling et portabilite. Ce guide couvre exhaustivement le deploiement et l'operation de Kafka en production sur Kubernetes.

Nous explorerons :

  • Les patterns d'architecture pour Kafka sur K8s
  • Le deploiement avec Strimzi Operator
  • La gestion du stockage persistant
  • Le scaling et la haute disponibilite
  • Le monitoring et l'observabilite
  • Les strategies de troubleshooting

Architecture Kafka sur Kubernetes

Pourquoi Kubernetes pour Kafka ?

Avantages

┌─────────────────────────────────────────────────────────────┐
│                 KAFKA + KUBERNETES                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Automatisation                                              │
│  ├─ Deploiement declaratif (GitOps)                         │
│  ├─ Rolling updates automatiques                            │
│  └─ Self-healing (restart pods fails)                       │
│                                                              │
│  Scalabilite                                                 │
│  ├─ Scale horizontal des brokers                            │
│  ├─ Auto-scaling des consumers                              │
│  └─ Resource management natif                               │
│                                                              │
│  Portabilite                                                 │
│  ├─ Multi-cloud                                              │
│  ├─ On-premise / hybrid                                      │
│  └─ Environnements identiques (dev/staging/prod)            │
│                                                              │
│  Observabilite                                               │
│  ├─ Integration Prometheus/Grafana                          │
│  ├─ Logging centralise                                       │
│  └─ Tracing distribue                                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Defis

┌─────────────────────────────────────────────────────────────┐
│                    CHALLENGES                                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Stockage                                                    │
│  ├─ Kafka est stateful (necessite persistence)              │
│  ├─ Performance I/O critique                                │
│  └─ PVC provisioning et management                          │
│                                                              │
│  Reseau                                                      │
│  ├─ DNS et service discovery                                │
│  ├─ Exposition externe                                       │
│  └─ Latence inter-pod                                        │
│                                                              │
│  Operations                                                  │
│  ├─ Rolling restarts complexes                              │
│  ├─ Partition rebalancing                                    │
│  └─ Backup et disaster recovery                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Options de Deploiement

Comparaison des Approches

┌────────────────────────────────────────────────────────────────┐
│ Methode          │ Complexite │ Production │ Recommande        │
├──────────────────┼────────────┼────────────┼───────────────────┤
│ Helm Charts      │ Moyenne    │ Oui        │ Petits clusters   │
│ Strimzi Operator │ Faible     │ Oui        │ Recommande        │
│ Confluent CFK    │ Faible     │ Oui        │ Enterprise        │
│ StatefulSet DIY  │ Haute      │ Possible   │ Non recommande    │
└────────────────────────────────────────────────────────────────┘

Strimzi : L'Operator Kafka pour Kubernetes

Strimzi gere:
├─ Kafka Cluster (brokers)
├─ ZooKeeper (ou KRaft)
├─ Kafka Connect
├─ Kafka MirrorMaker
├─ Kafka Bridge (HTTP)
├─ Schema Registry
└─ Cruise Control (rebalancing)

CRDs (Custom Resource Definitions):
├─ Kafka
├─ KafkaTopic
├─ KafkaUser
├─ KafkaConnect
├─ KafkaMirrorMaker2
├─ KafkaBridge
└─ KafkaRebalance

Installation de Strimzi

Prerequis

# Kubernetes 1.21+
kubectl version

# Namespace dedie
kubectl create namespace kafka

# Storage class avec provisioning dynamique
kubectl get storageclass

Installation via Helm

# Ajouter le repo Helm Strimzi
helm repo add strimzi https://strimzi.io/charts/
helm repo update

# Installer l'operator
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
  --namespace kafka \
  --set watchAnyNamespace=true \
  --version 0.38.0

# Verifier l'installation
kubectl get pods -n kafka
# NAME                                        READY   STATUS    RESTARTS   AGE
# strimzi-cluster-operator-xxx-yyy            1/1     Running   0          30s

Installation via Manifests

# Telecharger et appliquer les manifests
kubectl create namespace kafka

kubectl apply -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka

# Verifier
kubectl get pods -n kafka -w

Deploiement d'un Cluster Kafka

Configuration de Base (Dev/Test)

# kafka-cluster-dev.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka-dev
  namespace: kafka
spec:
  kafka:
    version: 3.6.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.6"
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 100Gi
          deleteClaim: false
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 2000m
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi
      deleteClaim: false
    resources:
      requests:
        memory: 1Gi
        cpu: 250m
      limits:
        memory: 2Gi
        cpu: 500m
  entityOperator:
    topicOperator: {}
    userOperator: {}
# Deployer
kubectl apply -f kafka-cluster-dev.yaml

# Suivre le deploiement
kubectl get kafka -n kafka -w

# Verifier les pods
kubectl get pods -n kafka
# kafka-dev-zookeeper-0    1/1     Running
# kafka-dev-zookeeper-1    1/1     Running
# kafka-dev-zookeeper-2    1/1     Running
# kafka-dev-kafka-0        1/1     Running
# kafka-dev-kafka-1        1/1     Running
# kafka-dev-kafka-2        1/1     Running

Configuration Production

# kafka-cluster-prod.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka-prod
  namespace: kafka
  labels:
    app: kafka
    environment: production
spec:
  kafka:
    version: 3.6.0
    replicas: 5

    # =============== LISTENERS ===============
    listeners:
      # Internal (cluster interne)
      - name: plain
        port: 9092
        type: internal
        tls: false
      # Internal TLS
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
      # External (LoadBalancer)
      - name: external
        port: 9094
        type: loadbalancer
        tls: true
        authentication:
          type: scram-sha-512
        configuration:
          bootstrap:
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-type: nlb
          brokers:
            - broker: 0
              advertisedHost: kafka-0.example.com
            - broker: 1
              advertisedHost: kafka-1.example.com
            - broker: 2
              advertisedHost: kafka-2.example.com
            - broker: 3
              advertisedHost: kafka-3.example.com
            - broker: 4
              advertisedHost: kafka-4.example.com

    # =============== CONFIGURATION ===============
    config:
      # Replication
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2

      # Performance
      num.network.threads: 8
      num.io.threads: 16
      socket.send.buffer.bytes: 102400
      socket.receive.buffer.bytes: 102400
      socket.request.max.bytes: 104857600

      # Log retention
      log.retention.hours: 168
      log.segment.bytes: 1073741824
      log.retention.check.interval.ms: 300000

      # Compression
      compression.type: lz4

      # Protocol version
      inter.broker.protocol.version: "3.6"
      log.message.format.version: "3.6"

      # Rack awareness
      broker.rack: ${STRIMZI_BROKER_RACK}

    # =============== RACK AWARENESS ===============
    rack:
      topologyKey: topology.kubernetes.io/zone

    # =============== STORAGE ===============
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 500Gi
          class: fast-ssd
          deleteClaim: false
        - id: 1
          type: persistent-claim
          size: 500Gi
          class: fast-ssd
          deleteClaim: false

    # =============== RESOURCES ===============
    resources:
      requests:
        memory: 8Gi
        cpu: 2000m
      limits:
        memory: 16Gi
        cpu: 4000m

    # =============== JVM OPTIONS ===============
    jvmOptions:
      -Xms: 6g
      -Xmx: 6g
      gcLoggingEnabled: true
      javaSystemProperties:
        - name: com.sun.management.jmxremote.port
          value: "9999"

    # =============== METRICS ===============
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml

    # =============== LIVENESS & READINESS ===============
    livenessProbe:
      initialDelaySeconds: 60
      timeoutSeconds: 5
    readinessProbe:
      initialDelaySeconds: 60
      timeoutSeconds: 5

    # =============== TEMPLATE ===============
    template:
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: strimzi.io/name
                      operator: In
                      values:
                        - kafka-prod-kafka
                topologyKey: kubernetes.io/hostname
        tolerations:
          - key: "dedicated"
            operator: "Equal"
            value: "kafka"
            effect: "NoSchedule"
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                strimzi.io/name: kafka-prod-kafka

  # =============== ZOOKEEPER ===============
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 20Gi
      class: fast-ssd
      deleteClaim: false
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 1000m
    jvmOptions:
      -Xms: 1g
      -Xmx: 1g
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: zookeeper-metrics-config.yml
    template:
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: strimzi.io/name
                      operator: In
                      values:
                        - kafka-prod-zookeeper
                topologyKey: kubernetes.io/hostname

  # =============== ENTITY OPERATOR ===============
  entityOperator:
    topicOperator:
      resources:
        requests:
          memory: 256Mi
          cpu: 100m
        limits:
          memory: 512Mi
          cpu: 500m
    userOperator:
      resources:
        requests:
          memory: 256Mi
          cpu: 100m
        limits:
          memory: 512Mi
          cpu: 500m

  # =============== CRUISE CONTROL ===============
  cruiseControl:
    config:
      goals: >
        com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
      default.goals: >
        com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal
    resources:
      requests:
        memory: 512Mi
        cpu: 200m
      limits:
        memory: 2Gi
        cpu: 1000m
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: cruisecontrol-metrics-config.yml

Configuration KRaft (Sans ZooKeeper)

# kafka-kraft.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: dual-role
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-kraft
spec:
  replicas: 3
  roles:
    - controller
    - broker
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 100Gi
        deleteClaim: false
---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka-kraft
  namespace: kafka
  annotations:
    strimzi.io/node-pools: enabled
    strimzi.io/kraft: enabled
spec:
  kafka:
    version: 3.6.0
    metadataVersion: 3.6-IV2
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
  entityOperator:
    topicOperator: {}
    userOperator: {}

Gestion du Stockage

Storage Classes Recommandees

# storage-class-aws.yaml (AWS EBS)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "16000"
  throughput: "1000"
  fsType: xfs
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage-class-gcp.yaml (GCP PD)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage-class-azure.yaml (Azure Disk)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_LRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

JBOD (Just a Bunch Of Disks)

# Configuration JBOD pour haute performance
storage:
  type: jbod
  volumes:
    - id: 0
      type: persistent-claim
      size: 500Gi
      class: fast-ssd
      deleteClaim: false
    - id: 1
      type: persistent-claim
      size: 500Gi
      class: fast-ssd
      deleteClaim: false
    - id: 2
      type: persistent-claim
      size: 500Gi
      class: fast-ssd
      deleteClaim: false

Expansion du Stockage

# Modifier la taille dans le Kafka CR
storage:
  type: jbod
  volumes:
    - id: 0
      type: persistent-claim
      size: 1000Gi  # Augmente de 500Gi a 1000Gi
      class: fast-ssd
      deleteClaim: false
# Appliquer et verifier
kubectl apply -f kafka-cluster-prod.yaml
kubectl get pvc -n kafka

Gestion des Topics

KafkaTopic CRD

# topic-orders.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: orders
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  partitions: 12
  replicas: 3
  config:
    retention.ms: 604800000    # 7 jours
    segment.bytes: 1073741824  # 1 GB
    min.insync.replicas: 2
    compression.type: lz4
    cleanup.policy: delete
---
# topic-orders-dlq.yaml (Dead Letter Queue)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: orders-dlq
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  partitions: 6
  replicas: 3
  config:
    retention.ms: 2592000000   # 30 jours
    cleanup.policy: delete
---
# topic-compacted.yaml (Compacted topic)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: user-profiles
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  partitions: 6
  replicas: 3
  config:
    cleanup.policy: compact
    min.cleanable.dirty.ratio: 0.5
    segment.ms: 3600000
# Deployer les topics
kubectl apply -f topics/

# Lister les topics
kubectl get kafkatopics -n kafka

# Decrire un topic
kubectl describe kafkatopic orders -n kafka

Gestion des Utilisateurs et Securite

KafkaUser avec SCRAM

# user-producer.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: orders-producer
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Producer ACLs
      - resource:
          type: topic
          name: orders
          patternType: literal
        operations:
          - Write
          - Describe
          - Create
      # Idempotent producer
      - resource:
          type: cluster
        operations:
          - IdempotentWrite
---
# user-consumer.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: orders-consumer
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Consumer ACLs
      - resource:
          type: topic
          name: orders
          patternType: literal
        operations:
          - Read
          - Describe
      - resource:
          type: group
          name: orders-consumer-group
          patternType: literal
        operations:
          - Read
# Deployer les users
kubectl apply -f users/

# Recuperer les credentials
kubectl get secret orders-producer -n kafka -o jsonpath='{.data.password}' | base64 -d

TLS Mutual Authentication

# user-tls.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: app-client
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  authentication:
    type: tls
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: '*'
          patternType: literal
        operations:
          - Read
          - Write
          - Describe

Scaling et Haute Disponibilite

Scaling Horizontal des Brokers

# Modifier le nombre de replicas
kubectl patch kafka kafka-prod -n kafka --type merge -p '{"spec":{"kafka":{"replicas":7}}}'

# Ou editer le CR
kubectl edit kafka kafka-prod -n kafka

Rebalancing avec Cruise Control

# rebalance.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: full-rebalance
  namespace: kafka
  labels:
    strimzi.io/cluster: kafka-prod
spec:
  mode: full
  goals:
    - RackAwareGoal
    - ReplicaCapacityGoal
    - DiskCapacityGoal
    - NetworkInboundCapacityGoal
    - NetworkOutboundCapacityGoal
  skipHardGoalCheck: false
# Lancer un rebalance
kubectl apply -f rebalance.yaml

# Suivre le statut
kubectl get kafkarebalance full-rebalance -n kafka -w

# Approuver le plan
kubectl annotate kafkarebalance full-rebalance \
  strimzi.io/rebalance=approve -n kafka

# Verifier la completion
kubectl get kafkarebalance full-rebalance -n kafka

Pod Disruption Budget

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
  namespace: kafka
spec:
  minAvailable: 3
  selector:
    matchLabels:
      strimzi.io/name: kafka-prod-kafka
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: zookeeper-pdb
  namespace: kafka
spec:
  minAvailable: 2
  selector:
    matchLabels:
      strimzi.io/name: kafka-prod-zookeeper

Monitoring et Observabilite

Configuration Prometheus

# kafka-metrics-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-metrics
  namespace: kafka
data:
  kafka-metrics-config.yml: |
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    rules:
    # Broker metrics
    - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
      name: kafka_server_$1_$2
      type: GAUGE
      labels:
        clientId: "$3"
        topic: "$4"
        partition: "$5"
    - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
      name: kafka_server_$1_$2
      type: GAUGE
      labels:
        clientId: "$3"
        broker: "$4:$5"
    - pattern: kafka.server<type=(.+), name=(.+)><>Value
      name: kafka_server_$1_$2
      type: GAUGE
    # Controller metrics
    - pattern: kafka.controller<type=(.+), name=(.+)><>Value
      name: kafka_controller_$1_$2
      type: GAUGE
    # Network metrics
    - pattern: kafka.network<type=(.+), name=(.+), request=(.+), error=(.+)><>Count
      name: kafka_network_$1_$2_total
      type: COUNTER
      labels:
        request: "$3"
        error: "$4"
    # Log metrics
    - pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
      name: kafka_log_$1_$2
      type: GAUGE
      labels:
        topic: "$3"
        partition: "$4"

  zookeeper-metrics-config.yml: |
    lowercaseOutputName: true
    rules:
    - pattern: "org.apache.ZooKeeperService<name0=(.+)><>(\\w+)"
      name: zookeeper_$2
      type: GAUGE
    - pattern: "org.apache.ZooKeeperService<name0=(.+), name1=(.+)><>(\\w+)"
      name: zookeeper_$3
      type: GAUGE
      labels:
        replicaId: "$2"

  cruisecontrol-metrics-config.yml: |
    lowercaseOutputName: true
    rules:
    - pattern: "kafka.cruisecontrol<name=(.+)><>(\\w+)"
      name: cruise_control_$1_$2
      type: GAUGE

ServiceMonitor pour Prometheus Operator

# service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-monitor
  namespace: kafka
  labels:
    app: kafka
spec:
  selector:
    matchLabels:
      strimzi.io/kind: Kafka
  namespaceSelector:
    matchNames:
      - kafka
  endpoints:
    - port: tcp-prometheus
      path: /metrics
      interval: 15s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: zookeeper-monitor
  namespace: kafka
spec:
  selector:
    matchLabels:
      strimzi.io/kind: Kafka
      strimzi.io/name: kafka-prod-zookeeper
  endpoints:
    - port: tcp-prometheus
      path: /metrics
      interval: 15s

Dashboard Grafana

# grafana-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-grafana-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  kafka-dashboard.json: |
    {
      "dashboard": {
        "title": "Kafka Overview",
        "panels": [
          {
            "title": "Messages In/Sec",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec_count[5m]))",
                "legendFormat": "Messages/sec"
              }
            ]
          },
          {
            "title": "Bytes In/Out",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(kafka_server_brokertopicmetrics_bytesinpersec_count[5m]))",
                "legendFormat": "Bytes In/sec"
              },
              {
                "expr": "sum(rate(kafka_server_brokertopicmetrics_bytesoutpersec_count[5m]))",
                "legendFormat": "Bytes Out/sec"
              }
            ]
          },
          {
            "title": "Under Replicated Partitions",
            "type": "stat",
            "targets": [
              {
                "expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"
              }
            ]
          },
          {
            "title": "Offline Partitions",
            "type": "stat",
            "targets": [
              {
                "expr": "sum(kafka_controller_kafkacontroller_offlinepartitionscount)"
              }
            ]
          }
        ]
      }
    }

Alertes Prometheus

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kafka-alerts
  namespace: kafka
spec:
  groups:
    - name: kafka.rules
      rules:
        # Under-replicated partitions
        - alert: KafkaUnderReplicatedPartitions
          expr: sum(kafka_server_replicamanager_underreplicatedpartitions) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Kafka has under-replicated partitions"
            description: "{{ $value }} partitions are under-replicated"

        # Offline partitions (CRITICAL)
        - alert: KafkaOfflinePartitions
          expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Kafka has offline partitions"
            description: "{{ $value }} partitions are offline"

        # No active controller
        - alert: KafkaNoActiveController
          expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "No active Kafka controller"

        # Broker down
        - alert: KafkaBrokerDown
          expr: count(up{job="kafka"} == 1) < 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Kafka broker is down"
            description: "Less than 3 brokers are up"

        # Consumer lag
        - alert: KafkaConsumerLag
          expr: sum(kafka_consumer_fetch_manager_records_lag) by (group) > 10000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High consumer lag"
            description: "Consumer group {{ $labels.group }} has lag > 10000"

        # Disk usage
        - alert: KafkaDiskUsageHigh
          expr: (sum(kafka_log_size) by (pod) / sum(kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data-.*kafka.*"}) by (pod)) > 0.8
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Kafka disk usage > 80%"

Operations et Maintenance

Rolling Restart

# Trigger un rolling restart
kubectl annotate pod kafka-prod-kafka-0 -n kafka \
  strimzi.io/manual-rolling-update=true

# Ou pour tous les brokers
kubectl annotate statefulset kafka-prod-kafka -n kafka \
  strimzi.io/manual-rolling-update=true

# Suivre le rollout
kubectl rollout status statefulset/kafka-prod-kafka -n kafka

Mise a Jour de Version

# Modifier la version dans le CR
spec:
  kafka:
    version: 3.7.0  # Nouvelle version
    config:
      inter.broker.protocol.version: "3.6"  # Garder l'ancien protocole
      log.message.format.version: "3.6"
# Appliquer et suivre
kubectl apply -f kafka-cluster-prod.yaml
kubectl get kafka kafka-prod -n kafka -w

# Apres rollout complet, mettre a jour les versions de protocole
spec:
  kafka:
    config:
      inter.broker.protocol.version: "3.7"
      log.message.format.version: "3.7"

Backup et Restore

# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: kafka-backup
  namespace: kafka
spec:
  schedule: "0 2 * * *"  # Tous les jours a 2h
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: bitnami/kafka:3.6
              command:
                - /bin/bash
                - -c
                - |
                  # Export des configurations topics
                  kafka-topics.sh --bootstrap-server kafka-prod-kafka-bootstrap:9092 \
                    --describe > /backup/topics-$(date +%Y%m%d).txt

                  # Backup des consumer group offsets
                  kafka-consumer-groups.sh --bootstrap-server kafka-prod-kafka-bootstrap:9092 \
                    --all-groups --describe > /backup/consumer-groups-$(date +%Y%m%d).txt

                  # Upload to S3
                  aws s3 cp /backup/ s3://my-bucket/kafka-backups/ --recursive
              volumeMounts:
                - name: backup-volume
                  mountPath: /backup
          volumes:
            - name: backup-volume
              emptyDir: {}
          restartPolicy: OnFailure

Troubleshooting

Problemes Courants

Pod Pending (Storage)

# Verifier les PVC
kubectl get pvc -n kafka

# Verifier la StorageClass
kubectl get sc

# Verifier les events
kubectl describe pod kafka-prod-kafka-0 -n kafka

Broker Not Ready

# Logs du broker
kubectl logs kafka-prod-kafka-0 -n kafka -c kafka

# Verifier la config
kubectl exec kafka-prod-kafka-0 -n kafka -- cat /tmp/strimzi.properties

# Verifier la connectivite ZooKeeper
kubectl exec kafka-prod-kafka-0 -n kafka -- \
  kafka-broker-api-versions.sh --bootstrap-server localhost:9092

Consumer Lag

# Verifier le lag
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-consumer-group

# Verifier les partitions
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic my-topic

Network Issues

# Test de connectivite
kubectl run test-pod --rm -it --image=busybox -n kafka -- \
  nslookup kafka-prod-kafka-bootstrap

# Verifier les services
kubectl get svc -n kafka

# Verifier les endpoints
kubectl get endpoints -n kafka

Commandes Utiles

# Lister les topics
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-topics.sh --bootstrap-server localhost:9092 --list

# Decrire un topic
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic my-topic

# Consommer des messages
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic my-topic --from-beginning --max-messages 10

# Produire des messages
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-console-producer.sh --bootstrap-server localhost:9092 \
  --topic my-topic

# Verifier l'etat du cluster
kubectl exec -it kafka-prod-kafka-0 -n kafka -- \
  kafka-metadata.sh --snapshot /var/lib/kafka/data-0/__cluster_metadata-0/00000000000000000000.log --command-config /tmp/strimzi.properties

# Rebalance status
kubectl get kafkarebalance -n kafka

A Retenir

  1. Strimzi est l'Operator recommande : Simplifie enormement la gestion de Kafka sur K8s
  2. Storage SSD obligatoire : Utilisez des StorageClass avec SSD pour la performance
  3. Rack awareness : Distribuez les brokers sur plusieurs zones pour la HA
  4. Monitoring indispensable : Prometheus + Grafana + alertes sont essentiels
  5. KRaft pour le futur : Migrez vers KRaft pour eliminer ZooKeeper
  6. GitOps friendly : Tous les CRDs peuvent etre versiones dans Git
  7. PDB pour la stabilite : Configurez des PodDisruptionBudgets

Deployer Kafka sur Kubernetes en production necessite une expertise specifique. Contactez-moi pour un accompagnement dans la conception et le deploiement de votre infrastructure Kafka cloud-native.

F

Florian Courouge

Expert DevOps & Kafka | Consultant freelance specialise dans les architectures distribuees et le streaming de donnees.

Articles similaires