Architecture Cloud Native : Guide Complet Multi-Cloud et Best Practices
L'architecture Cloud Native représente une approche moderne pour construire et déployer des applications scalables, résilientes et facilement maintenables. Ce guide couvre les principes fondamentaux, les patterns d'architecture, et les stratégies multi-cloud.
Qu'est-ce que le Cloud Native ?
Définition CNCF
La Cloud Native Computing Foundation définit le Cloud Native comme :
"Les technologies cloud native permettent aux organisations de construire et d'exécuter des applications scalables dans des environnements modernes et dynamiques tels que les clouds publics, privés et hybrides."
┌─────────────────────────────────────────────────────────────────────────────┐
│ PILIERS DU CLOUD NATIVE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ CONTENEURS │ │ MICROSERVICES │ │ DEVOPS │ │
│ │ │ │ │ │ │ │
│ │ • Docker │ │ • Découplage │ │ • CI/CD │ │
│ │ • OCI │ │ • APIs │ │ • GitOps │ │
│ │ • Immutabilité │ │ • Domain-Driven│ │ • Automation │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ ORCHESTRATION │ │
│ │ │ │
│ │ • Kubernetes • Service Mesh │ │
│ │ • Scheduling • Auto-scaling │ │
│ │ • Self-healing • Rolling updates │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ INFRASTRUCTURE AS CODE │ │
│ │ │ │
│ │ • Terraform • Pulumi │ │
│ │ • CloudFormation • ARM Templates │ │
│ │ • Declarative • Versionné │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Les 12 Facteurs (Twelve-Factor App)
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE TWELVE-FACTOR APP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. CODEBASE │ 7. PORT BINDING │
│ Un repo, plusieurs déploiements│ Exporter via port │
│ │ │
│ 2. DEPENDENCIES │ 8. CONCURRENCY │
│ Déclaration explicite │ Scale via processus │
│ │ │
│ 3. CONFIG │ 9. DISPOSABILITY │
│ Dans l'environnement │ Démarrage rapide, arrêt graceful │
│ │ │
│ 4. BACKING SERVICES │ 10. DEV/PROD PARITY │
│ Traités comme ressources │ Environnements identiques │
│ │ │
│ 5. BUILD, RELEASE, RUN │ 11. LOGS │
│ Séparation stricte │ Flux d'événements │
│ │ │
│ 6. PROCESSES │ 12. ADMIN PROCESSES │
│ Stateless, share-nothing │ Tâches one-off │
│ │ │
└─────────────────────────────────────────────────────────────────────────────┘
Architecture Microservices
Design Patterns Cloud Native
┌─────────────────────────────────────────────────────────────────────────────┐
│ ARCHITECTURE MICROSERVICES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ Clients ──────────▶ │ API Gateway │ │
│ │ (Kong/Envoy) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ User │ │ Order │ │ Product │ │
│ │ Service │ │ Service │ │ Service │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ PostgreSQL │ │ Kafka │ │ Redis │ │
│ │ (User) │ │ (Events) │ │ (Cache) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Patterns utilisés: │
│ • Service Discovery (Consul, etcd) │
│ • Circuit Breaker (Resilience4j, Hystrix) │
│ • Event Sourcing / CQRS │
│ • Saga Pattern pour transactions distribuées │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
API Gateway avec Kong
# kong-config.yaml
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: rate-limiting
plugin: rate-limiting
config:
minute: 100
hour: 10000
policy: redis
redis_host: redis.infrastructure.svc.cluster.local
redis_port: 6379
---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: jwt-auth
plugin: jwt
config:
claims_to_verify:
- exp
key_claim_name: kid
secret_is_base64: false
---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: correlation-id
plugin: correlation-id
config:
header_name: X-Request-ID
generator: uuid
echo_downstream: true
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
annotations:
konghq.com/plugins: rate-limiting,jwt-auth,correlation-id
konghq.com/strip-path: "true"
spec:
ingressClassName: kong
rules:
- host: api.example.com
http:
paths:
- path: /users
pathType: Prefix
backend:
service:
name: user-service
port:
number: 8080
- path: /orders
pathType: Prefix
backend:
service:
name: order-service
port:
number: 8080
- path: /products
pathType: Prefix
backend:
service:
name: product-service
port:
number: 8080
Service Mesh avec Istio
# istio-config.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: user-service
namespace: production
spec:
hosts:
- user-service
http:
# Canary deployment: 90% v1, 10% v2
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: user-service
subset: v2
weight: 100
- route:
- destination:
host: user-service
subset: v1
weight: 90
- destination:
host: user-service
subset: v2
weight: 10
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure
timeout: 10s
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: user-service
namespace: production
spec:
host: user-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
loadBalancer:
simple: LEAST_REQUEST
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 60s
maxEjectionPercent: 50
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
---
# Circuit Breaker
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: order-service-cb
spec:
host: order-service
trafficPolicy:
outlierDetection:
consecutiveGatewayErrors: 5
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 100
minHealthPercent: 0
Infrastructure as Code (IaC)
Terraform Multi-Cloud
┌─────────────────────────────────────────────────────────────────────────────┐
│ STRUCTURE PROJET TERRAFORM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ terraform/ │
│ ├── modules/ # Modules réutilisables │
│ │ ├── networking/ │
│ │ │ ├── vpc/ │
│ │ │ ├── subnets/ │
│ │ │ └── security-groups/ │
│ │ ├── kubernetes/ │
│ │ │ ├── eks/ │
│ │ │ ├── gke/ │
│ │ │ └── aks/ │
│ │ ├── databases/ │
│ │ │ ├── rds/ │
│ │ │ ├── cloudsql/ │
│ │ │ └── cosmosdb/ │
│ │ └── monitoring/ │
│ │ ├── prometheus/ │
│ │ └── grafana/ │
│ │ │
│ ├── environments/ # Configuration par environnement │
│ │ ├── dev/ │
│ │ │ ├── main.tf │
│ │ │ ├── variables.tf │
│ │ │ ├── terraform.tfvars │
│ │ │ └── backend.tf │
│ │ ├── staging/ │
│ │ └── production/ │
│ │ │
│ └── shared/ # Ressources partagées │
│ ├── dns/ │
│ ├── certificates/ │
│ └── secrets/ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Module Kubernetes Multi-Cloud
# modules/kubernetes/main.tf
variable "cloud_provider" {
type = string
description = "Cloud provider: aws, gcp, or azure"
}
variable "cluster_name" {
type = string
}
variable "kubernetes_version" {
type = string
default = "1.29"
}
variable "node_pools" {
type = map(object({
instance_type = string
min_size = number
max_size = number
disk_size_gb = number
labels = map(string)
taints = list(object({
key = string
value = string
effect = string
}))
}))
}
# AWS EKS
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
count = var.cloud_provider == "aws" ? 1 : 0
cluster_name = var.cluster_name
cluster_version = var.kubernetes_version
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
cluster_endpoint_public_access = true
eks_managed_node_groups = {
for name, pool in var.node_pools : name => {
instance_types = [pool.instance_type]
min_size = pool.min_size
max_size = pool.max_size
desired_size = pool.min_size
disk_size = pool.disk_size_gb
labels = pool.labels
taints = [
for taint in pool.taints : {
key = taint.key
value = taint.value
effect = taint.effect
}
]
}
}
# Addons
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
}
aws-ebs-csi-driver = {
most_recent = true
}
}
tags = var.tags
}
# GCP GKE
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google"
version = "~> 29.0"
count = var.cloud_provider == "gcp" ? 1 : 0
project_id = var.project_id
name = var.cluster_name
region = var.region
network = var.network
subnetwork = var.subnetwork
kubernetes_version = var.kubernetes_version
ip_range_pods = var.pods_ip_range
ip_range_services = var.services_ip_range
node_pools = [
for name, pool in var.node_pools : {
name = name
machine_type = pool.instance_type
min_count = pool.min_size
max_count = pool.max_size
disk_size_gb = pool.disk_size_gb
auto_repair = true
auto_upgrade = true
node_labels = pool.labels
}
]
node_pools_taints = {
for name, pool in var.node_pools : name => pool.taints
}
}
# Azure AKS
module "aks" {
source = "Azure/aks/azurerm"
version = "~> 7.0"
count = var.cloud_provider == "azure" ? 1 : 0
cluster_name = var.cluster_name
resource_group_name = var.resource_group_name
location = var.location
kubernetes_version = var.kubernetes_version
vnet_subnet_id = var.subnet_id
agents_pools = [
for name, pool in var.node_pools : {
name = name
vm_size = pool.instance_type
min_count = pool.min_size
max_count = pool.max_size
os_disk_size_gb = pool.disk_size_gb
node_labels = pool.labels
node_taints = [for t in pool.taints : "${t.key}=${t.value}:${t.effect}"]
}
]
tags = var.tags
}
# Outputs
output "cluster_endpoint" {
value = coalesce(
try(module.eks[0].cluster_endpoint, ""),
try(module.gke[0].endpoint, ""),
try(module.aks[0].host, "")
)
}
output "cluster_ca_certificate" {
value = coalesce(
try(module.eks[0].cluster_certificate_authority_data, ""),
try(module.gke[0].ca_certificate, ""),
try(module.aks[0].cluster_ca_certificate, "")
)
sensitive = true
}
Environment Production
# environments/production/main.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.25"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.12"
}
}
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/terraform.tfstate"
region = "eu-west-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = "production"
ManagedBy = "terraform"
Project = var.project_name
}
}
}
# VPC
module "vpc" {
source = "../../modules/networking/vpc"
name = "${var.project_name}-production"
cidr = "10.0.0.0/16"
availability_zones = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false
one_nat_gateway_per_az = true
enable_dns_hostnames = true
enable_dns_support = true
# Tags pour EKS
private_subnet_tags = {
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
"kubernetes.io/role/internal-elb" = "1"
}
public_subnet_tags = {
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
"kubernetes.io/role/elb" = "1"
}
}
# Kubernetes Cluster
module "kubernetes" {
source = "../../modules/kubernetes"
cloud_provider = "aws"
cluster_name = var.cluster_name
kubernetes_version = "1.29"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
node_pools = {
system = {
instance_type = "m6i.large"
min_size = 2
max_size = 4
disk_size_gb = 100
labels = {
"node-type" = "system"
}
taints = []
}
application = {
instance_type = "m6i.xlarge"
min_size = 3
max_size = 20
disk_size_gb = 200
labels = {
"node-type" = "application"
}
taints = []
}
kafka = {
instance_type = "r6i.2xlarge"
min_size = 3
max_size = 9
disk_size_gb = 500
labels = {
"node-type" = "kafka"
"workload" = "data-intensive"
}
taints = [{
key = "dedicated"
value = "kafka"
effect = "NoSchedule"
}]
}
gpu = {
instance_type = "p3.2xlarge"
min_size = 0
max_size = 10
disk_size_gb = 200
labels = {
"node-type" = "gpu"
"nvidia.com/gpu.present" = "true"
}
taints = [{
key = "nvidia.com/gpu"
value = "true"
effect = "NoSchedule"
}]
}
}
tags = var.tags
}
# Provider Kubernetes après création du cluster
provider "kubernetes" {
host = module.kubernetes.cluster_endpoint
cluster_ca_certificate = base64decode(module.kubernetes.cluster_ca_certificate)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", var.cluster_name]
}
}
provider "helm" {
kubernetes {
host = module.kubernetes.cluster_endpoint
cluster_ca_certificate = base64decode(module.kubernetes.cluster_ca_certificate)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", var.cluster_name]
}
}
}
# Observability Stack
module "monitoring" {
source = "../../modules/monitoring"
depends_on = [module.kubernetes]
prometheus_enabled = true
grafana_enabled = true
alertmanager_enabled = true
loki_enabled = true
tempo_enabled = true
storage_class = "gp3"
prometheus_retention = "30d"
prometheus_storage_size = "100Gi"
grafana_admin_password = var.grafana_admin_password
}
# Databases
module "rds" {
source = "../../modules/databases/rds"
name = "${var.project_name}-production"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r6g.xlarge"
allocated_storage = 500
max_allocated_storage = 1000
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
multi_az = true
backup_retention_period = 30
deletion_protection = true
performance_insights_enabled = true
monitoring_interval = 60
master_username = var.db_username
master_password = var.db_password
tags = var.tags
}
Stratégies Multi-Cloud
Architecture Multi-Cloud
┌─────────────────────────────────────────────────────────────────────────────┐
│ ARCHITECTURE MULTI-CLOUD │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Global LB │ │
│ │ (Cloudflare/ │ │
│ │ Route53) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ AWS │ │ GCP │ │ Azure │ │
│ │ (Primary) │ │ (Secondary) │ │ (DR Site) │ │
│ │ │ │ │ │ │ │
│ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │
│ │ │ EKS │ │◀──────▶│ │ GKE │ │◀──────▶│ │ AKS │ │ │
│ │ └───────┘ │ │ └───────┘ │ │ └───────┘ │ │
│ │ │ │ │ │ │ │
│ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │
│ │ │ RDS │ │───────▶│ │CloudSQL│ │───────▶│ │CosmosDB│ │ │
│ │ │(Master)│ │ Async │ │(Replica)│ │ Async │ │(Replica)│ │ │
│ │ └───────┘ │ Repli │ └───────┘ │ Repli │ └───────┘ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Patterns: │
│ • Active-Active pour haute disponibilité │
│ • Active-Passive pour DR │
│ • Data replication cross-cloud │
│ • Service mesh multi-cluster (Istio) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Configuration DNS Multi-Cloud avec Route53
# dns/multi-cloud.tf
# Health checks pour chaque région
resource "aws_route53_health_check" "aws_primary" {
fqdn = "api-aws.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = {
Name = "aws-primary-health-check"
}
}
resource "aws_route53_health_check" "gcp_secondary" {
fqdn = "api-gcp.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = {
Name = "gcp-secondary-health-check"
}
}
# Failover routing
resource "aws_route53_record" "api_primary" {
zone_id = var.zone_id
name = "api.example.com"
type = "A"
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.aws_primary.id
}
resource "aws_route53_record" "api_secondary" {
zone_id = var.zone_id
name = "api.example.com"
type = "A"
alias {
name = var.gcp_lb_ip
zone_id = var.gcp_zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "secondary"
health_check_id = aws_route53_health_check.gcp_secondary.id
}
# Geolocation routing pour latence optimale
resource "aws_route53_record" "api_geo_eu" {
zone_id = var.zone_id
name = "api.example.com"
type = "A"
alias {
name = aws_lb.eu.dns_name
zone_id = aws_lb.eu.zone_id
evaluate_target_health = true
}
geolocation_routing_policy {
continent = "EU"
}
set_identifier = "eu"
}
resource "aws_route53_record" "api_geo_na" {
zone_id = var.zone_id
name = "api.example.com"
type = "A"
alias {
name = var.gcp_us_lb_ip
zone_id = var.gcp_us_zone_id
evaluate_target_health = true
}
geolocation_routing_policy {
continent = "NA"
}
set_identifier = "na"
}
Data Replication Cross-Cloud
# Kafka MirrorMaker 2 pour réplication cross-cloud
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
name: cross-cloud-mirror
namespace: kafka
spec:
version: 3.6.0
replicas: 3
connectCluster: "target-gcp"
clusters:
- alias: "source-aws"
bootstrapServers: kafka-aws.example.com:9092
tls:
trustedCertificates:
- secretName: aws-kafka-ca
certificate: ca.crt
authentication:
type: tls
certificateAndKey:
secretName: aws-kafka-user
certificate: user.crt
key: user.key
- alias: "target-gcp"
bootstrapServers: kafka-gcp.example.com:9092
tls:
trustedCertificates:
- secretName: gcp-kafka-ca
certificate: ca.crt
authentication:
type: tls
certificateAndKey:
secretName: gcp-kafka-user
certificate: user.crt
key: user.key
config:
config.storage.replication.factor: 3
offset.storage.replication.factor: 3
status.storage.replication.factor: 3
mirrors:
- sourceCluster: "source-aws"
targetCluster: "target-gcp"
sourceConnector:
tasksMax: 10
config:
replication.factor: 3
offset-syncs.topic.replication.factor: 3
sync.topic.acls.enabled: "false"
refresh.topics.interval.seconds: 60
heartbeatConnector:
config:
heartbeats.topic.replication.factor: 3
checkpointConnector:
config:
checkpoints.topic.replication.factor: 3
sync.group.offsets.enabled: "true"
topicsPattern: ".*"
groupsPattern: ".*"
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
---
# PostgreSQL Logical Replication
# Source AWS
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-publisher-config
data:
setup-publisher.sql: |
-- Enable logical replication
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;
-- Create publication
CREATE PUBLICATION cross_cloud_pub FOR ALL TABLES;
-- Create replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replicator;
# Target GCP - setup subscription
# CREATE SUBSCRIPTION cross_cloud_sub
# CONNECTION 'host=postgres-aws.example.com port=5432 dbname=mydb user=replicator password=secure_password'
# PUBLICATION cross_cloud_pub;
Observabilité Cloud Native
Stack Observabilité Complète
┌─────────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITÉ CLOUD NATIVE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ GRAFANA │ │
│ │ (Dashboards, Alerting, Exploration) │ │
│ └───────────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │PROMETHEUS│ │ LOKI │ │ TEMPO │ │
│ │ Métriques│ │ Logs │ │ Traces │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Exporters│ │ Promtail │ │ OTEL │ │
│ │ ServiceM.│ │ Fluentbit│ │ Collector│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ └─────────────────────────┼─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ APPLICATIONS │ │
│ │ (Microservices instrumentés avec OpenTelemetry) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
OpenTelemetry Collector
# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: observability
spec:
mode: deployment
replicas: 3
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_http:
endpoint: 0.0.0.0:14268
processors:
batch:
timeout: 10s
send_batch_size: 10000
memory_limiter:
check_interval: 1s
limit_mib: 4000
spike_limit_mib: 800
resource:
attributes:
- key: environment
value: production
action: upsert
- key: cluster
value: ${CLUSTER_NAME}
action: upsert
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces
type: latency
latency:
threshold_ms: 1000
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
exporters:
prometheus:
endpoint: 0.0.0.0:8889
resource_to_telemetry_conversion:
enabled: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
attributes:
service.name: service_name
service.namespace: service_namespace
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
extensions:
health_check:
endpoint: 0.0.0.0:13133
service:
extensions: [health_check]
pipelines:
traces:
receivers: [otlp, jaeger]
processors: [memory_limiter, resource, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, resource, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [loki]
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
Application Instrumentation
# Python FastAPI avec OpenTelemetry
from fastapi import FastAPI, Request
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
import structlog
import time
# Configuration du logging structuré
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
logger = structlog.get_logger()
# Configuration OpenTelemetry
def setup_telemetry(app: FastAPI, service_name: str):
# Traces
trace_provider = TracerProvider()
trace_exporter = OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True)
trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(trace_provider)
# Metrics
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="otel-collector:4317", insecure=True),
export_interval_millis=60000
)
meter_provider = MeterProvider(metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
# Auto-instrumentation
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
return trace.get_tracer(service_name), metrics.get_meter(service_name)
app = FastAPI()
tracer, meter = setup_telemetry(app, "order-service")
# Métriques custom
request_counter = meter.create_counter(
"http_requests_total",
description="Total HTTP requests"
)
request_duration = meter.create_histogram(
"http_request_duration_seconds",
description="HTTP request duration"
)
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
# Métriques
request_counter.add(1, {
"method": request.method,
"path": request.url.path,
"status": response.status_code
})
request_duration.record(duration, {
"method": request.method,
"path": request.url.path
})
# Log structuré
logger.info(
"request_completed",
method=request.method,
path=request.url.path,
status=response.status_code,
duration=duration,
trace_id=trace.get_current_span().get_span_context().trace_id
)
return response
@app.post("/orders")
async def create_order(order_data: dict):
with tracer.start_as_current_span("create_order") as span:
span.set_attribute("order.customer_id", order_data.get("customer_id"))
# Business logic avec spans enfants
with tracer.start_as_current_span("validate_order"):
# Validation
pass
with tracer.start_as_current_span("process_payment"):
# Payment
pass
with tracer.start_as_current_span("send_notification"):
# Notification
pass
logger.info("order_created", order_id=order_data.get("id"))
return {"status": "created"}
Sécurité Cloud Native
Zero Trust Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ ZERO TRUST ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Principe: "Ne jamais faire confiance, toujours vérifier" │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ IDENTITY │ │
│ │ • OIDC/SAML (Keycloak, Auth0) │ │
│ │ • Service Accounts (SPIFFE/SPIRE) │ │
│ │ • MFA obligatoire │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ ACCESS │ │
│ │ • RBAC/ABAC (OPA, Kubernetes RBAC) │ │
│ │ • Just-in-time access │ │
│ │ • Least privilege │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ NETWORK │ │
│ │ • mTLS everywhere (Istio, Linkerd) │ │
│ │ • Network Policies │ │
│ │ • Micro-segmentation │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ WORKLOAD │ │
│ │ • Container scanning (Trivy, Snyk) │ │
│ │ • Runtime protection (Falco) │ │
│ │ • Pod Security Standards │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ DATA │ │
│ │ • Encryption at rest (KMS) │ │
│ │ • Encryption in transit (TLS 1.3) │ │
│ │ • Secrets management (Vault, Sealed Secrets) │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
HashiCorp Vault pour Secrets
# vault-config.yaml
apiVersion: vault.hashicorp.com/v1
kind: VaultAuth
metadata:
name: kubernetes-auth
namespace: vault
spec:
method: kubernetes
mount: kubernetes
kubernetes:
role: app-role
serviceAccount: default
audiences:
- vault
---
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultStaticSecret
metadata:
name: db-credentials
namespace: production
spec:
vaultAuthRef: kubernetes-auth
mount: secret
path: production/database
type: kv-v2
refreshAfter: 1h
destination:
name: db-credentials
create: true
labels:
app: api
---
# Application utilisant le secret
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
template:
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "app-role"
vault.hashicorp.com/agent-inject-secret-db: "secret/data/production/database"
vault.hashicorp.com/agent-inject-template-db: |
{{- with secret "secret/data/production/database" -}}
export DB_HOST="{{ .Data.data.host }}"
export DB_USER="{{ .Data.data.username }}"
export DB_PASS="{{ .Data.data.password }}"
{{- end }}
spec:
serviceAccountName: api-sa
containers:
- name: api
image: myapp/api:latest
command: ["/bin/sh", "-c"]
args:
- source /vault/secrets/db && ./app
OPA Gatekeeper Policies
# Contrainte: Images doivent venir de registries autorisés
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8sallowedregistries
spec:
crd:
spec:
names:
kind: K8sAllowedRegistries
validation:
openAPIV3Schema:
type: object
properties:
registries:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8sallowedregistries
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
satisfied := [good | repo = input.parameters.registries[_] ; good = startswith(container.image, repo)]
not any(satisfied)
msg := sprintf("container <%v> has an invalid image registry <%v>, allowed registries are %v", [container.name, container.image, input.parameters.registries])
}
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRegistries
metadata:
name: allowed-registries
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
namespaces:
- production
- staging
parameters:
registries:
- "gcr.io/myproject/"
- "docker.io/mycompany/"
- "harbor.internal.com/"
---
# Contrainte: Pas de privilèges root
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8spspprivilegedcontainer
spec:
crd:
spec:
names:
kind: K8sPSPPrivilegedContainer
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8spspprivileged
violation[{"msg": msg, "details": {}}] {
c := input.review.object.spec.containers[_]
c.securityContext.privileged
msg := sprintf("Privileged container is not allowed: %v", [c.name])
}
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
metadata:
name: no-privileged-containers
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
excludedNamespaces:
- kube-system
FinOps et Optimisation des Coûts
Stratégies d'Optimisation
┌─────────────────────────────────────────────────────────────────────────────┐
│ STRATÉGIES FINOPS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. RIGHT-SIZING │
│ ──────────────── │
│ • Analyser l'utilisation réelle vs provisionnée │
│ • Réduire les ressources sur-provisionnées │
│ • Utiliser VPA pour ajustement automatique │
│ │
│ 2. SPOT/PREEMPTIBLE INSTANCES │
│ ────────────────────────────── │
│ • 60-90% d'économies sur compute │
│ • Pour workloads tolérants aux interruptions │
│ • Batch processing, CI/CD, dev/test │
│ │
│ 3. RESERVED CAPACITY │
│ ───────────────────── │
│ • Engagements 1-3 ans pour workloads stables │
│ • Savings Plans (AWS) / CUDs (GCP) │
│ • 30-75% d'économies │
│ │
│ 4. AUTO-SCALING │
│ ────────────────── │
│ • Scale-to-zero pour environnements non-prod │
│ • HPA basé sur métriques custom │
│ • Cluster Autoscaler │
│ │
│ 5. STORAGE OPTIMIZATION │
│ ─────────────────────── │
│ • Tiering automatique (S3 Intelligent-Tiering) │
│ • Lifecycle policies │
│ • Compression et déduplication │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Kubecost pour Suivi des Coûts
# kubecost-values.yaml
global:
prometheus:
enabled: false
fqdn: http://prometheus.monitoring:9090
kubecostModel:
etlFileStoreEnabled: true
allocation:
# Partage des coûts communs
sharedNamespaces: "kube-system,monitoring,ingress-nginx"
sharedOverhead: 0.10 # 10% overhead
# Alertes coûts
alerts:
enabled: true
alertConfigs:
budget:
type: budget
threshold: 1000 # $1000/jour
window: daily
aggregation: namespace
efficiency:
type: efficiency
threshold: 0.6 # Alerte si efficiency < 60%
window: weekly
spendChange:
type: spendChange
relativeThreshold: 0.2 # Alerte si +20%
window: weekly
# Recommandations
savings:
enabled: true
reporting:
valuesFileConfigured: true
productKey:
enabled: false
# Dashboards
grafana:
dashboards:
enabled: true
---
# Budget CRD
apiVersion: budget.kubecost.io/v1alpha1
kind: Budget
metadata:
name: production-budget
spec:
namespace: production
monthly: 5000 # $5000/mois
alerts:
- threshold: 80
notificationChannel: slack-alerts
- threshold: 100
notificationChannel: pagerduty
Conclusion
L'architecture Cloud Native est un paradigme complet qui englobe :
- Conteneurs et orchestration : Kubernetes comme standard
- Microservices : Découplage et scalabilité indépendante
- Infrastructure as Code : Terraform, Pulumi pour reproductibilité
- Observabilité : Métriques, logs, traces avec OpenTelemetry
- Sécurité Zero Trust : mTLS, RBAC, secrets management
- Multi-Cloud : Éviter le vendor lock-in, résilience géographique
- FinOps : Optimisation continue des coûts
Checklist de Maturité Cloud Native
| Niveau | Caractéristiques |
|---|---|
| 1 - Initial | VMs, déploiements manuels |
| 2 - Managed | Conteneurs, CI basique |
| 3 - Defined | Kubernetes, IaC, CI/CD |
| 4 - Measured | Observabilité, auto-scaling |
| 5 - Optimized | GitOps, FinOps, Zero Trust |
La transformation vers le Cloud Native est un voyage continu. Commencez par les fondamentaux, itérez, et évoluez progressivement vers une architecture mature et résiliente.