CLOUD
Avancé

Architecture Cloud Native : Guide Complet Multi-Cloud et Best Practices

Maîtrisez l'architecture Cloud Native : principes 12-Factor, microservices, conteneurs, orchestration, Infrastructure as Code, et stratégies multi-cloud.

Florian Courouge
28 min de lecture
5,095 mots
0 vues
Cloud Native
AWS
GCP
Azure
Terraform
Multi-Cloud
IaC
Microservices

Architecture Cloud Native : Guide Complet Multi-Cloud et Best Practices

L'architecture Cloud Native représente une approche moderne pour construire et déployer des applications scalables, résilientes et facilement maintenables. Ce guide couvre les principes fondamentaux, les patterns d'architecture, et les stratégies multi-cloud.

Qu'est-ce que le Cloud Native ?

Définition CNCF

La Cloud Native Computing Foundation définit le Cloud Native comme :

"Les technologies cloud native permettent aux organisations de construire et d'exécuter des applications scalables dans des environnements modernes et dynamiques tels que les clouds publics, privés et hybrides."

┌─────────────────────────────────────────────────────────────────────────────┐
│                    PILIERS DU CLOUD NATIVE                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐            │
│   │  CONTENEURS     │  │  MICROSERVICES  │  │    DEVOPS       │            │
│   │                 │  │                 │  │                 │            │
│   │  • Docker       │  │  • Découplage   │  │  • CI/CD        │            │
│   │  • OCI          │  │  • APIs         │  │  • GitOps       │            │
│   │  • Immutabilité │  │  • Domain-Driven│  │  • Automation   │            │
│   └────────┬────────┘  └────────┬────────┘  └────────┬────────┘            │
│            │                    │                    │                      │
│            └────────────────────┼────────────────────┘                      │
│                                 ▼                                           │
│            ┌─────────────────────────────────────────┐                      │
│            │           ORCHESTRATION                 │                      │
│            │                                         │                      │
│            │  • Kubernetes    • Service Mesh        │                      │
│            │  • Scheduling    • Auto-scaling        │                      │
│            │  • Self-healing  • Rolling updates     │                      │
│            └─────────────────────────────────────────┘                      │
│                                 │                                           │
│                                 ▼                                           │
│            ┌─────────────────────────────────────────┐                      │
│            │       INFRASTRUCTURE AS CODE            │                      │
│            │                                         │                      │
│            │  • Terraform     • Pulumi              │                      │
│            │  • CloudFormation • ARM Templates      │                      │
│            │  • Declarative   • Versionné           │                      │
│            └─────────────────────────────────────────┘                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Les 12 Facteurs (Twelve-Factor App)

┌─────────────────────────────────────────────────────────────────────────────┐
│                      THE TWELVE-FACTOR APP                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   1. CODEBASE                    │  7. PORT BINDING                         │
│   Un repo, plusieurs déploiements│  Exporter via port                       │
│                                  │                                          │
│   2. DEPENDENCIES                │  8. CONCURRENCY                          │
│   Déclaration explicite          │  Scale via processus                     │
│                                  │                                          │
│   3. CONFIG                      │  9. DISPOSABILITY                        │
│   Dans l'environnement           │  Démarrage rapide, arrêt graceful       │
│                                  │                                          │
│   4. BACKING SERVICES            │  10. DEV/PROD PARITY                     │
│   Traités comme ressources       │  Environnements identiques               │
│                                  │                                          │
│   5. BUILD, RELEASE, RUN         │  11. LOGS                                │
│   Séparation stricte             │  Flux d'événements                       │
│                                  │                                          │
│   6. PROCESSES                   │  12. ADMIN PROCESSES                     │
│   Stateless, share-nothing       │  Tâches one-off                          │
│                                  │                                          │
└─────────────────────────────────────────────────────────────────────────────┘

Architecture Microservices

Design Patterns Cloud Native

┌─────────────────────────────────────────────────────────────────────────────┐
│                   ARCHITECTURE MICROSERVICES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                        ┌─────────────────┐                                  │
│    Clients ──────────▶ │   API Gateway   │                                  │
│                        │   (Kong/Envoy)  │                                  │
│                        └────────┬────────┘                                  │
│                                 │                                           │
│         ┌───────────────────────┼───────────────────────┐                   │
│         │                       │                       │                   │
│         ▼                       ▼                       ▼                   │
│  ┌─────────────┐        ┌─────────────┐        ┌─────────────┐             │
│  │   User      │        │   Order     │        │   Product   │             │
│  │   Service   │        │   Service   │        │   Service   │             │
│  └──────┬──────┘        └──────┬──────┘        └──────┬──────┘             │
│         │                      │                      │                     │
│         ▼                      ▼                      ▼                     │
│  ┌─────────────┐        ┌─────────────┐        ┌─────────────┐             │
│  │  PostgreSQL │        │    Kafka    │        │    Redis    │             │
│  │    (User)   │        │   (Events)  │        │   (Cache)   │             │
│  └─────────────┘        └─────────────┘        └─────────────┘             │
│                                                                              │
│  Patterns utilisés:                                                          │
│  • Service Discovery (Consul, etcd)                                         │
│  • Circuit Breaker (Resilience4j, Hystrix)                                  │
│  • Event Sourcing / CQRS                                                    │
│  • Saga Pattern pour transactions distribuées                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

API Gateway avec Kong

# kong-config.yaml
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: rate-limiting
plugin: rate-limiting
config:
  minute: 100
  hour: 10000
  policy: redis
  redis_host: redis.infrastructure.svc.cluster.local
  redis_port: 6379

---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: jwt-auth
plugin: jwt
config:
  claims_to_verify:
  - exp
  key_claim_name: kid
  secret_is_base64: false

---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: correlation-id
plugin: correlation-id
config:
  header_name: X-Request-ID
  generator: uuid
  echo_downstream: true

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  annotations:
    konghq.com/plugins: rate-limiting,jwt-auth,correlation-id
    konghq.com/strip-path: "true"
spec:
  ingressClassName: kong
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /users
        pathType: Prefix
        backend:
          service:
            name: user-service
            port:
              number: 8080
      - path: /orders
        pathType: Prefix
        backend:
          service:
            name: order-service
            port:
              number: 8080
      - path: /products
        pathType: Prefix
        backend:
          service:
            name: product-service
            port:
              number: 8080

Service Mesh avec Istio

# istio-config.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: user-service
  namespace: production
spec:
  hosts:
  - user-service
  http:
  # Canary deployment: 90% v1, 10% v2
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: user-service
        subset: v2
      weight: 100
  - route:
    - destination:
        host: user-service
        subset: v1
      weight: 90
    - destination:
        host: user-service
        subset: v2
      weight: 10
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,reset,connect-failure
    timeout: 10s

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: user-service
  namespace: production
spec:
  host: user-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    loadBalancer:
      simple: LEAST_REQUEST
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 60s
      maxEjectionPercent: 50
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

---
# Circuit Breaker
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service-cb
spec:
  host: order-service
  trafficPolicy:
    outlierDetection:
      consecutiveGatewayErrors: 5
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 100
      minHealthPercent: 0

Infrastructure as Code (IaC)

Terraform Multi-Cloud

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STRUCTURE PROJET TERRAFORM                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  terraform/                                                                  │
│  ├── modules/                    # Modules réutilisables                    │
│  │   ├── networking/                                                        │
│  │   │   ├── vpc/                                                           │
│  │   │   ├── subnets/                                                       │
│  │   │   └── security-groups/                                               │
│  │   ├── kubernetes/                                                        │
│  │   │   ├── eks/                                                           │
│  │   │   ├── gke/                                                           │
│  │   │   └── aks/                                                           │
│  │   ├── databases/                                                         │
│  │   │   ├── rds/                                                           │
│  │   │   ├── cloudsql/                                                      │
│  │   │   └── cosmosdb/                                                      │
│  │   └── monitoring/                                                        │
│  │       ├── prometheus/                                                    │
│  │       └── grafana/                                                       │
│  │                                                                          │
│  ├── environments/               # Configuration par environnement          │
│  │   ├── dev/                                                               │
│  │   │   ├── main.tf                                                        │
│  │   │   ├── variables.tf                                                   │
│  │   │   ├── terraform.tfvars                                               │
│  │   │   └── backend.tf                                                     │
│  │   ├── staging/                                                           │
│  │   └── production/                                                        │
│  │                                                                          │
│  └── shared/                     # Ressources partagées                     │
│      ├── dns/                                                               │
│      ├── certificates/                                                      │
│      └── secrets/                                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module Kubernetes Multi-Cloud

# modules/kubernetes/main.tf

variable "cloud_provider" {
  type        = string
  description = "Cloud provider: aws, gcp, or azure"
}

variable "cluster_name" {
  type = string
}

variable "kubernetes_version" {
  type    = string
  default = "1.29"
}

variable "node_pools" {
  type = map(object({
    instance_type = string
    min_size      = number
    max_size      = number
    disk_size_gb  = number
    labels        = map(string)
    taints        = list(object({
      key    = string
      value  = string
      effect = string
    }))
  }))
}

# AWS EKS
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"
  count   = var.cloud_provider == "aws" ? 1 : 0

  cluster_name    = var.cluster_name
  cluster_version = var.kubernetes_version

  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  cluster_endpoint_public_access = true

  eks_managed_node_groups = {
    for name, pool in var.node_pools : name => {
      instance_types = [pool.instance_type]
      min_size       = pool.min_size
      max_size       = pool.max_size
      desired_size   = pool.min_size

      disk_size = pool.disk_size_gb

      labels = pool.labels

      taints = [
        for taint in pool.taints : {
          key    = taint.key
          value  = taint.value
          effect = taint.effect
        }
      ]
    }
  }

  # Addons
  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent = true
    }
    aws-ebs-csi-driver = {
      most_recent = true
    }
  }

  tags = var.tags
}

# GCP GKE
module "gke" {
  source  = "terraform-google-modules/kubernetes-engine/google"
  version = "~> 29.0"
  count   = var.cloud_provider == "gcp" ? 1 : 0

  project_id = var.project_id
  name       = var.cluster_name
  region     = var.region

  network    = var.network
  subnetwork = var.subnetwork

  kubernetes_version = var.kubernetes_version

  ip_range_pods     = var.pods_ip_range
  ip_range_services = var.services_ip_range

  node_pools = [
    for name, pool in var.node_pools : {
      name           = name
      machine_type   = pool.instance_type
      min_count      = pool.min_size
      max_count      = pool.max_size
      disk_size_gb   = pool.disk_size_gb
      auto_repair    = true
      auto_upgrade   = true
      node_labels    = pool.labels
    }
  ]

  node_pools_taints = {
    for name, pool in var.node_pools : name => pool.taints
  }
}

# Azure AKS
module "aks" {
  source  = "Azure/aks/azurerm"
  version = "~> 7.0"
  count   = var.cloud_provider == "azure" ? 1 : 0

  cluster_name        = var.cluster_name
  resource_group_name = var.resource_group_name
  location            = var.location

  kubernetes_version = var.kubernetes_version

  vnet_subnet_id = var.subnet_id

  agents_pools = [
    for name, pool in var.node_pools : {
      name            = name
      vm_size         = pool.instance_type
      min_count       = pool.min_size
      max_count       = pool.max_size
      os_disk_size_gb = pool.disk_size_gb
      node_labels     = pool.labels
      node_taints     = [for t in pool.taints : "${t.key}=${t.value}:${t.effect}"]
    }
  ]

  tags = var.tags
}

# Outputs
output "cluster_endpoint" {
  value = coalesce(
    try(module.eks[0].cluster_endpoint, ""),
    try(module.gke[0].endpoint, ""),
    try(module.aks[0].host, "")
  )
}

output "cluster_ca_certificate" {
  value = coalesce(
    try(module.eks[0].cluster_certificate_authority_data, ""),
    try(module.gke[0].ca_certificate, ""),
    try(module.aks[0].cluster_ca_certificate, "")
  )
  sensitive = true
}

Environment Production

# environments/production/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.25"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.12"
    }
  }

  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Environment = "production"
      ManagedBy   = "terraform"
      Project     = var.project_name
    }
  }
}

# VPC
module "vpc" {
  source = "../../modules/networking/vpc"

  name               = "${var.project_name}-production"
  cidr               = "10.0.0.0/16"
  availability_zones = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]

  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = false
  one_nat_gateway_per_az = true

  enable_dns_hostnames = true
  enable_dns_support   = true

  # Tags pour EKS
  private_subnet_tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
    "kubernetes.io/role/internal-elb"           = "1"
  }

  public_subnet_tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
    "kubernetes.io/role/elb"                    = "1"
  }
}

# Kubernetes Cluster
module "kubernetes" {
  source = "../../modules/kubernetes"

  cloud_provider     = "aws"
  cluster_name       = var.cluster_name
  kubernetes_version = "1.29"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids

  node_pools = {
    system = {
      instance_type = "m6i.large"
      min_size      = 2
      max_size      = 4
      disk_size_gb  = 100
      labels = {
        "node-type" = "system"
      }
      taints = []
    }

    application = {
      instance_type = "m6i.xlarge"
      min_size      = 3
      max_size      = 20
      disk_size_gb  = 200
      labels = {
        "node-type" = "application"
      }
      taints = []
    }

    kafka = {
      instance_type = "r6i.2xlarge"
      min_size      = 3
      max_size      = 9
      disk_size_gb  = 500
      labels = {
        "node-type" = "kafka"
        "workload"  = "data-intensive"
      }
      taints = [{
        key    = "dedicated"
        value  = "kafka"
        effect = "NoSchedule"
      }]
    }

    gpu = {
      instance_type = "p3.2xlarge"
      min_size      = 0
      max_size      = 10
      disk_size_gb  = 200
      labels = {
        "node-type"                       = "gpu"
        "nvidia.com/gpu.present"          = "true"
      }
      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NoSchedule"
      }]
    }
  }

  tags = var.tags
}

# Provider Kubernetes après création du cluster
provider "kubernetes" {
  host                   = module.kubernetes.cluster_endpoint
  cluster_ca_certificate = base64decode(module.kubernetes.cluster_ca_certificate)

  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
  }
}

provider "helm" {
  kubernetes {
    host                   = module.kubernetes.cluster_endpoint
    cluster_ca_certificate = base64decode(module.kubernetes.cluster_ca_certificate)

    exec {
      api_version = "client.authentication.k8s.io/v1beta1"
      command     = "aws"
      args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    }
  }
}

# Observability Stack
module "monitoring" {
  source = "../../modules/monitoring"

  depends_on = [module.kubernetes]

  prometheus_enabled     = true
  grafana_enabled        = true
  alertmanager_enabled   = true
  loki_enabled           = true
  tempo_enabled          = true

  storage_class = "gp3"

  prometheus_retention    = "30d"
  prometheus_storage_size = "100Gi"

  grafana_admin_password = var.grafana_admin_password
}

# Databases
module "rds" {
  source = "../../modules/databases/rds"

  name                = "${var.project_name}-production"
  engine              = "postgres"
  engine_version      = "15.4"
  instance_class      = "db.r6g.xlarge"
  allocated_storage   = 500
  max_allocated_storage = 1000

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids

  multi_az               = true
  backup_retention_period = 30
  deletion_protection    = true

  performance_insights_enabled = true
  monitoring_interval         = 60

  master_username = var.db_username
  master_password = var.db_password

  tags = var.tags
}

Stratégies Multi-Cloud

Architecture Multi-Cloud

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ARCHITECTURE MULTI-CLOUD                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                         ┌─────────────────┐                                 │
│                         │   Global LB     │                                 │
│                         │  (Cloudflare/   │                                 │
│                         │   Route53)      │                                 │
│                         └────────┬────────┘                                 │
│                                  │                                          │
│          ┌───────────────────────┼───────────────────────┐                  │
│          │                       │                       │                  │
│          ▼                       ▼                       ▼                  │
│   ┌─────────────┐        ┌─────────────┐        ┌─────────────┐            │
│   │    AWS      │        │    GCP      │        │   Azure     │            │
│   │  (Primary)  │        │ (Secondary) │        │  (DR Site)  │            │
│   │             │        │             │        │             │            │
│   │  ┌───────┐  │        │  ┌───────┐  │        │  ┌───────┐  │            │
│   │  │  EKS  │  │◀──────▶│  │  GKE  │  │◀──────▶│  │  AKS  │  │            │
│   │  └───────┘  │        │  └───────┘  │        │  └───────┘  │            │
│   │             │        │             │        │             │            │
│   │  ┌───────┐  │        │  ┌───────┐  │        │  ┌───────┐  │            │
│   │  │  RDS  │  │───────▶│  │CloudSQL│ │───────▶│  │CosmosDB│ │            │
│   │  │(Master)│ │  Async │  │(Replica)│ │  Async │  │(Replica)│ │            │
│   │  └───────┘  │  Repli │  └───────┘  │  Repli │  └───────┘  │            │
│   └─────────────┘        └─────────────┘        └─────────────┘            │
│                                                                              │
│   Patterns:                                                                  │
│   • Active-Active pour haute disponibilité                                  │
│   • Active-Passive pour DR                                                  │
│   • Data replication cross-cloud                                            │
│   • Service mesh multi-cluster (Istio)                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Configuration DNS Multi-Cloud avec Route53

# dns/multi-cloud.tf

# Health checks pour chaque région
resource "aws_route53_health_check" "aws_primary" {
  fqdn              = "api-aws.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 10

  tags = {
    Name = "aws-primary-health-check"
  }
}

resource "aws_route53_health_check" "gcp_secondary" {
  fqdn              = "api-gcp.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 10

  tags = {
    Name = "gcp-secondary-health-check"
  }
}

# Failover routing
resource "aws_route53_record" "api_primary" {
  zone_id = var.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.aws_primary.id
}

resource "aws_route53_record" "api_secondary" {
  zone_id = var.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = var.gcp_lb_ip
    zone_id                = var.gcp_zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "SECONDARY"
  }

  set_identifier  = "secondary"
  health_check_id = aws_route53_health_check.gcp_secondary.id
}

# Geolocation routing pour latence optimale
resource "aws_route53_record" "api_geo_eu" {
  zone_id = var.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.eu.dns_name
    zone_id                = aws_lb.eu.zone_id
    evaluate_target_health = true
  }

  geolocation_routing_policy {
    continent = "EU"
  }

  set_identifier = "eu"
}

resource "aws_route53_record" "api_geo_na" {
  zone_id = var.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = var.gcp_us_lb_ip
    zone_id                = var.gcp_us_zone_id
    evaluate_target_health = true
  }

  geolocation_routing_policy {
    continent = "NA"
  }

  set_identifier = "na"
}

Data Replication Cross-Cloud

# Kafka MirrorMaker 2 pour réplication cross-cloud
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: cross-cloud-mirror
  namespace: kafka
spec:
  version: 3.6.0
  replicas: 3

  connectCluster: "target-gcp"

  clusters:
  - alias: "source-aws"
    bootstrapServers: kafka-aws.example.com:9092
    tls:
      trustedCertificates:
      - secretName: aws-kafka-ca
        certificate: ca.crt
    authentication:
      type: tls
      certificateAndKey:
        secretName: aws-kafka-user
        certificate: user.crt
        key: user.key

  - alias: "target-gcp"
    bootstrapServers: kafka-gcp.example.com:9092
    tls:
      trustedCertificates:
      - secretName: gcp-kafka-ca
        certificate: ca.crt
    authentication:
      type: tls
      certificateAndKey:
        secretName: gcp-kafka-user
        certificate: user.crt
        key: user.key
    config:
      config.storage.replication.factor: 3
      offset.storage.replication.factor: 3
      status.storage.replication.factor: 3

  mirrors:
  - sourceCluster: "source-aws"
    targetCluster: "target-gcp"
    sourceConnector:
      tasksMax: 10
      config:
        replication.factor: 3
        offset-syncs.topic.replication.factor: 3
        sync.topic.acls.enabled: "false"
        refresh.topics.interval.seconds: 60
    heartbeatConnector:
      config:
        heartbeats.topic.replication.factor: 3
    checkpointConnector:
      config:
        checkpoints.topic.replication.factor: 3
        sync.group.offsets.enabled: "true"
    topicsPattern: ".*"
    groupsPattern: ".*"

  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi

---
# PostgreSQL Logical Replication
# Source AWS
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-publisher-config
data:
  setup-publisher.sql: |
    -- Enable logical replication
    ALTER SYSTEM SET wal_level = logical;
    ALTER SYSTEM SET max_replication_slots = 10;
    ALTER SYSTEM SET max_wal_senders = 10;

    -- Create publication
    CREATE PUBLICATION cross_cloud_pub FOR ALL TABLES;

    -- Create replication user
    CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
    GRANT SELECT ON ALL TABLES IN SCHEMA public TO replicator;

# Target GCP - setup subscription
# CREATE SUBSCRIPTION cross_cloud_sub
#   CONNECTION 'host=postgres-aws.example.com port=5432 dbname=mydb user=replicator password=secure_password'
#   PUBLICATION cross_cloud_pub;

Observabilité Cloud Native

Stack Observabilité Complète

┌─────────────────────────────────────────────────────────────────────────────┐
│                     OBSERVABILITÉ CLOUD NATIVE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                         GRAFANA                                      │   │
│   │              (Dashboards, Alerting, Exploration)                     │   │
│   └───────────────────────────────┬─────────────────────────────────────┘   │
│                                   │                                          │
│       ┌───────────────────────────┼───────────────────────────┐             │
│       │                           │                           │             │
│       ▼                           ▼                           ▼             │
│  ┌──────────┐              ┌──────────┐              ┌──────────┐           │
│  │PROMETHEUS│              │   LOKI   │              │  TEMPO   │           │
│  │ Métriques│              │   Logs   │              │  Traces  │           │
│  └────┬─────┘              └────┬─────┘              └────┬─────┘           │
│       │                         │                         │                 │
│       ▼                         ▼                         ▼                 │
│  ┌──────────┐              ┌──────────┐              ┌──────────┐           │
│  │ Exporters│              │ Promtail │              │  OTEL    │           │
│  │ ServiceM.│              │ Fluentbit│              │ Collector│           │
│  └──────────┘              └──────────┘              └──────────┘           │
│       │                         │                         │                 │
│       └─────────────────────────┼─────────────────────────┘                 │
│                                 │                                            │
│                                 ▼                                            │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                       APPLICATIONS                                   │   │
│   │    (Microservices instrumentés avec OpenTelemetry)                  │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

OpenTelemetry Collector

# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  namespace: observability
spec:
  mode: deployment
  replicas: 3

  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

      prometheus:
        config:
          scrape_configs:
            - job_name: 'kubernetes-pods'
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                  action: keep
                  regex: true

      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_http:
            endpoint: 0.0.0.0:14268

    processors:
      batch:
        timeout: 10s
        send_batch_size: 10000

      memory_limiter:
        check_interval: 1s
        limit_mib: 4000
        spike_limit_mib: 800

      resource:
        attributes:
          - key: environment
            value: production
            action: upsert
          - key: cluster
            value: ${CLUSTER_NAME}
            action: upsert

      tail_sampling:
        decision_wait: 10s
        num_traces: 100000
        policies:
          - name: errors
            type: status_code
            status_code:
              status_codes: [ERROR]
          - name: slow-traces
            type: latency
            latency:
              threshold_ms: 1000
          - name: probabilistic
            type: probabilistic
            probabilistic:
              sampling_percentage: 10

    exporters:
      prometheus:
        endpoint: 0.0.0.0:8889
        resource_to_telemetry_conversion:
          enabled: true

      loki:
        endpoint: http://loki:3100/loki/api/v1/push
        labels:
          attributes:
            service.name: service_name
            service.namespace: service_namespace

      otlp/tempo:
        endpoint: tempo:4317
        tls:
          insecure: true

      otlp/jaeger:
        endpoint: jaeger-collector:4317
        tls:
          insecure: true

    extensions:
      health_check:
        endpoint: 0.0.0.0:13133

    service:
      extensions: [health_check]
      pipelines:
        traces:
          receivers: [otlp, jaeger]
          processors: [memory_limiter, resource, tail_sampling, batch]
          exporters: [otlp/tempo]

        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, resource, batch]
          exporters: [prometheus]

        logs:
          receivers: [otlp]
          processors: [memory_limiter, resource, batch]
          exporters: [loki]

  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 4Gi

Application Instrumentation

# Python FastAPI avec OpenTelemetry
from fastapi import FastAPI, Request
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
import structlog
import time

# Configuration du logging structuré
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    wrapper_class=structlog.stdlib.BoundLogger,
    cache_logger_on_first_use=True,
)

logger = structlog.get_logger()

# Configuration OpenTelemetry
def setup_telemetry(app: FastAPI, service_name: str):
    # Traces
    trace_provider = TracerProvider()
    trace_exporter = OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True)
    trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
    trace.set_tracer_provider(trace_provider)

    # Metrics
    metric_reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint="otel-collector:4317", insecure=True),
        export_interval_millis=60000
    )
    meter_provider = MeterProvider(metric_readers=[metric_reader])
    metrics.set_meter_provider(meter_provider)

    # Auto-instrumentation
    FastAPIInstrumentor.instrument_app(app)
    HTTPXClientInstrumentor().instrument()
    SQLAlchemyInstrumentor().instrument()

    return trace.get_tracer(service_name), metrics.get_meter(service_name)

app = FastAPI()
tracer, meter = setup_telemetry(app, "order-service")

# Métriques custom
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)

request_duration = meter.create_histogram(
    "http_request_duration_seconds",
    description="HTTP request duration"
)

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start_time = time.time()

    response = await call_next(request)

    duration = time.time() - start_time

    # Métriques
    request_counter.add(1, {
        "method": request.method,
        "path": request.url.path,
        "status": response.status_code
    })

    request_duration.record(duration, {
        "method": request.method,
        "path": request.url.path
    })

    # Log structuré
    logger.info(
        "request_completed",
        method=request.method,
        path=request.url.path,
        status=response.status_code,
        duration=duration,
        trace_id=trace.get_current_span().get_span_context().trace_id
    )

    return response

@app.post("/orders")
async def create_order(order_data: dict):
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.customer_id", order_data.get("customer_id"))

        # Business logic avec spans enfants
        with tracer.start_as_current_span("validate_order"):
            # Validation
            pass

        with tracer.start_as_current_span("process_payment"):
            # Payment
            pass

        with tracer.start_as_current_span("send_notification"):
            # Notification
            pass

        logger.info("order_created", order_id=order_data.get("id"))

        return {"status": "created"}

Sécurité Cloud Native

Zero Trust Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                     ZERO TRUST ARCHITECTURE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Principe: "Ne jamais faire confiance, toujours vérifier"                  │
│                                                                              │
│   ┌───────────────────────────────────────────────────────────────────┐     │
│   │                          IDENTITY                                  │     │
│   │   • OIDC/SAML (Keycloak, Auth0)                                   │     │
│   │   • Service Accounts (SPIFFE/SPIRE)                               │     │
│   │   • MFA obligatoire                                                │     │
│   └───────────────────────────────────────────────────────────────────┘     │
│                                   │                                          │
│                                   ▼                                          │
│   ┌───────────────────────────────────────────────────────────────────┐     │
│   │                          ACCESS                                    │     │
│   │   • RBAC/ABAC (OPA, Kubernetes RBAC)                              │     │
│   │   • Just-in-time access                                            │     │
│   │   • Least privilege                                                │     │
│   └───────────────────────────────────────────────────────────────────┘     │
│                                   │                                          │
│                                   ▼                                          │
│   ┌───────────────────────────────────────────────────────────────────┐     │
│   │                         NETWORK                                    │     │
│   │   • mTLS everywhere (Istio, Linkerd)                              │     │
│   │   • Network Policies                                               │     │
│   │   • Micro-segmentation                                             │     │
│   └───────────────────────────────────────────────────────────────────┘     │
│                                   │                                          │
│                                   ▼                                          │
│   ┌───────────────────────────────────────────────────────────────────┐     │
│   │                        WORKLOAD                                    │     │
│   │   • Container scanning (Trivy, Snyk)                              │     │
│   │   • Runtime protection (Falco)                                     │     │
│   │   • Pod Security Standards                                         │     │
│   └───────────────────────────────────────────────────────────────────┘     │
│                                   │                                          │
│                                   ▼                                          │
│   ┌───────────────────────────────────────────────────────────────────┐     │
│   │                          DATA                                      │     │
│   │   • Encryption at rest (KMS)                                       │     │
│   │   • Encryption in transit (TLS 1.3)                               │     │
│   │   • Secrets management (Vault, Sealed Secrets)                    │     │
│   └───────────────────────────────────────────────────────────────────┘     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

HashiCorp Vault pour Secrets

# vault-config.yaml
apiVersion: vault.hashicorp.com/v1
kind: VaultAuth
metadata:
  name: kubernetes-auth
  namespace: vault
spec:
  method: kubernetes
  mount: kubernetes
  kubernetes:
    role: app-role
    serviceAccount: default
    audiences:
    - vault

---
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultStaticSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  vaultAuthRef: kubernetes-auth
  mount: secret
  path: production/database
  type: kv-v2
  refreshAfter: 1h
  destination:
    name: db-credentials
    create: true
    labels:
      app: api

---
# Application utilisant le secret
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: production
spec:
  template:
    metadata:
      annotations:
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/role: "app-role"
        vault.hashicorp.com/agent-inject-secret-db: "secret/data/production/database"
        vault.hashicorp.com/agent-inject-template-db: |
          {{- with secret "secret/data/production/database" -}}
          export DB_HOST="{{ .Data.data.host }}"
          export DB_USER="{{ .Data.data.username }}"
          export DB_PASS="{{ .Data.data.password }}"
          {{- end }}
    spec:
      serviceAccountName: api-sa
      containers:
      - name: api
        image: myapp/api:latest
        command: ["/bin/sh", "-c"]
        args:
        - source /vault/secrets/db && ./app

OPA Gatekeeper Policies

# Contrainte: Images doivent venir de registries autorisés
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sallowedregistries
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedRegistries
      validation:
        openAPIV3Schema:
          type: object
          properties:
            registries:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedregistries

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          satisfied := [good | repo = input.parameters.registries[_] ; good = startswith(container.image, repo)]
          not any(satisfied)
          msg := sprintf("container <%v> has an invalid image registry <%v>, allowed registries are %v", [container.name, container.image, input.parameters.registries])
        }

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRegistries
metadata:
  name: allowed-registries
spec:
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
    namespaces:
    - production
    - staging
  parameters:
    registries:
    - "gcr.io/myproject/"
    - "docker.io/mycompany/"
    - "harbor.internal.com/"

---
# Contrainte: Pas de privilèges root
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8spspprivilegedcontainer
spec:
  crd:
    spec:
      names:
        kind: K8sPSPPrivilegedContainer
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8spspprivileged

        violation[{"msg": msg, "details": {}}] {
          c := input.review.object.spec.containers[_]
          c.securityContext.privileged
          msg := sprintf("Privileged container is not allowed: %v", [c.name])
        }

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
metadata:
  name: no-privileged-containers
spec:
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
    excludedNamespaces:
    - kube-system

FinOps et Optimisation des Coûts

Stratégies d'Optimisation

┌─────────────────────────────────────────────────────────────────────────────┐
│                      STRATÉGIES FINOPS                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   1. RIGHT-SIZING                                                           │
│   ────────────────                                                          │
│   • Analyser l'utilisation réelle vs provisionnée                           │
│   • Réduire les ressources sur-provisionnées                                │
│   • Utiliser VPA pour ajustement automatique                                │
│                                                                              │
│   2. SPOT/PREEMPTIBLE INSTANCES                                             │
│   ──────────────────────────────                                            │
│   • 60-90% d'économies sur compute                                          │
│   • Pour workloads tolérants aux interruptions                              │
│   • Batch processing, CI/CD, dev/test                                       │
│                                                                              │
│   3. RESERVED CAPACITY                                                      │
│   ─────────────────────                                                     │
│   • Engagements 1-3 ans pour workloads stables                              │
│   • Savings Plans (AWS) / CUDs (GCP)                                        │
│   • 30-75% d'économies                                                      │
│                                                                              │
│   4. AUTO-SCALING                                                           │
│   ──────────────────                                                        │
│   • Scale-to-zero pour environnements non-prod                              │
│   • HPA basé sur métriques custom                                           │
│   • Cluster Autoscaler                                                       │
│                                                                              │
│   5. STORAGE OPTIMIZATION                                                   │
│   ───────────────────────                                                   │
│   • Tiering automatique (S3 Intelligent-Tiering)                            │
│   • Lifecycle policies                                                       │
│   • Compression et déduplication                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Kubecost pour Suivi des Coûts

# kubecost-values.yaml
global:
  prometheus:
    enabled: false
    fqdn: http://prometheus.monitoring:9090

kubecostModel:
  etlFileStoreEnabled: true
  allocation:
    # Partage des coûts communs
    sharedNamespaces: "kube-system,monitoring,ingress-nginx"
    sharedOverhead: 0.10  # 10% overhead

  # Alertes coûts
  alerts:
    enabled: true
    alertConfigs:
      budget:
        type: budget
        threshold: 1000  # $1000/jour
        window: daily
        aggregation: namespace

      efficiency:
        type: efficiency
        threshold: 0.6  # Alerte si efficiency < 60%
        window: weekly

      spendChange:
        type: spendChange
        relativeThreshold: 0.2  # Alerte si +20%
        window: weekly

  # Recommandations
  savings:
    enabled: true

reporting:
  valuesFileConfigured: true
  productKey:
    enabled: false

# Dashboards
grafana:
  dashboards:
    enabled: true

---
# Budget CRD
apiVersion: budget.kubecost.io/v1alpha1
kind: Budget
metadata:
  name: production-budget
spec:
  namespace: production
  monthly: 5000  # $5000/mois
  alerts:
  - threshold: 80
    notificationChannel: slack-alerts
  - threshold: 100
    notificationChannel: pagerduty

Conclusion

L'architecture Cloud Native est un paradigme complet qui englobe :

  1. Conteneurs et orchestration : Kubernetes comme standard
  2. Microservices : Découplage et scalabilité indépendante
  3. Infrastructure as Code : Terraform, Pulumi pour reproductibilité
  4. Observabilité : Métriques, logs, traces avec OpenTelemetry
  5. Sécurité Zero Trust : mTLS, RBAC, secrets management
  6. Multi-Cloud : Éviter le vendor lock-in, résilience géographique
  7. FinOps : Optimisation continue des coûts

Checklist de Maturité Cloud Native

Niveau Caractéristiques
1 - Initial VMs, déploiements manuels
2 - Managed Conteneurs, CI basique
3 - Defined Kubernetes, IaC, CI/CD
4 - Measured Observabilité, auto-scaling
5 - Optimized GitOps, FinOps, Zero Trust

La transformation vers le Cloud Native est un voyage continu. Commencez par les fondamentaux, itérez, et évoluez progressivement vers une architecture mature et résiliente.

F

Florian Courouge

Expert DevOps & Kafka | Consultant freelance specialise dans les architectures distribuees et le streaming de donnees.

Articles similaires