2 Matching Annotations
- Jan 2025
-
-
How we migrated onto K8s in less than 12 months
-
Figma's Initial Infrastructure Challenges:
- Figma's monolithic architecture struggled with resource allocation inefficiencies and limited scalability.
- High traffic spikes from collaborative design workflows required more robust solutions for resource autoscaling and failover.
-
Why Kubernetes Was Chosen:
- Kubernetes' container orchestration capabilities promised better resource management and service isolation.
- Features like Horizontal Pod Autoscaling (HPA), robust networking via Kubernetes Services, and support for StatefulSets made it an ideal fit for Figma’s needs.
- The platform also wanted better alignment with cloud-native practices and modern CI/CD workflows.
-
Incremental Migration Approach:
- Step 1: Non-Critical Services: Figma migrated stateless services first, allowing experimentation without risking core functionality.
- Step 2: Custom Tooling: Internal tooling was built to manage Kubernetes manifests and automate Helm chart creation for standardization.
- Step 3: Stateful Services: For databases and other stateful components, Figma relied on Kubernetes' StatefulSets and persistent volumes (PVs) to ensure data integrity during the migration.
- Step 4: Observability Enhancements: Kubernetes-native tools like Prometheus and Grafana were integrated to provide detailed metrics and system insights.
-
Key Technical Adjustments During Migration:
- Service Discovery: Transitioned to Kubernetes-native DNS for internal service communication, replacing legacy methods.
- Load Balancing: Leveraged Kubernetes Ingress and external load balancers (e.g., NGINX or cloud-native solutions) for traffic routing.
- Networking Complexity: Resolved challenges around multi-cluster networking using Kubernetes CNI plugins like Calico.
- Resource Management: Used Resource Quotas and Limits to prevent pod overcommitment and optimize cluster utilization.
-
Challenges Faced:
- Stateful Services: Ensuring zero-downtime migration for databases required careful orchestration of PersistentVolumeClaims (PVCs) and StatefulSets.
- Networking: Handling cross-region traffic and external dependencies required tweaking Kubernetes Ingress configurations.
- Resource Constraints: Balancing costs and performance involved tuning cluster-autoscaler configurations and evaluating node pool setups.
-
Benefits Realized Post-Migration:
- Scalability: Kubernetes' HPA allowed Figma to scale pods dynamically based on traffic patterns, ensuring consistent performance.
- Deployment Efficiency: CI/CD pipelines integrated seamlessly with Kubernetes, enabling faster and more reliable rollouts using tools like Argo CD.
- Reliability: Self-healing capabilities, such as pod restarts and node failover, reduced downtime during failures.
- Observability: Improved system monitoring with Kubernetes' native metrics server and integrations with Prometheus and Grafana.
-
Future Enhancements Planned:
- Service Mesh Integration: Adoption of Istio or Linkerd to enhance observability, security (e.g., mutual TLS), and traffic management.
- Cost Optimization: Further tuning autoscaling policies and resource limits to minimize waste.
- Edge Improvements: Deploying Kubernetes clusters closer to end-users for reduced latency, potentially using Kubernetes' Cluster Federation.
-
Tags
Annotators
URL
-
- Jan 2023
-
www.youtube.com www.youtube.com
-
tl;dw (best DevOps tools in 2023)
- Low-budget cloud computing : Civo (close to Scaleway)
- Infrastructure and Service Management: Crossplane
- App Management - manifests : cdk8s (yes, not Kustomize or Helm)
- App Management - k8s operators: tie between Knative and Crossplane
- App Management - managed services: Google Cloud Run
- Dev Envs: Okteto (yeap, not GitPod)
- CI/CD: GitHub Actions (as it's simplest to use)
- GitOps (CD): Argo CD (wins with Flux due to its adoption rate)
- Policy Management: Kyverno (simpler to use than industry's most powerful tool: OPA / Gatekeeper)
- Observability: OpenTelemetry (instrumentation of apps), VictoriaMetrics (metrics - yes not Prometheus), Grafana / Loki (logs), Grafana Tempo (tracing), Grafana (dashboards), Robusta (alerting), Komodor (troubleshooting)
-