Scaling a Manifests Repo Across Many Services, Environments, and Clusters
You have 15 microservices, 4 environments (dev, staging, preprod, prod), and prod runs in 3 regional clusters. That's potentially 90 Application CRs. How do you structure the manifests repo so it doesn't fall apart?
You have 15 microservices, 4 environments (dev, staging, preprod, prod), and prod runs in 3 regional clusters. That's potentially 90 Application CRs. How do you structure the manifests repo so it doesn't fall apart?
At this size, hand-written Application CRs are dead weight. Switch to ApplicationSets generated from the directory layout, push environment differences into structured parameter files, and split the repo into a workloads side and a platform side. Three changes get you there. First, layout. Two axes: app and environment-or-cluster. I keep the app axis as the primary directory split because most changes happen inside one app. workloads/ checkout/ base/ overlays/ dev/ staging/ preprod/ prod-us-east/ prod-eu-west/ prod-ap-south/ ledger/ base/ overlays/ ... same shape ... clusters/ dev/ config.yaml # cluster metadata: name, server, region, labels staging/ preprod/ prod-us-east/ prod-eu-west/ prod-ap-south/ argocd/ applicationsets/ workloads.yaml # one ApplicationSet that fans out across apps x envs projects/ Second, one ApplicationSet does the fan-out. A matrix generator combines a list of apps with the cluster list. The template produces one Application per (app, cluster) pair. Add a new app: drop a directory under workloads/. Add a new region: register the cluster and add an overlay. Both flows are pure Git, no YAML duplication. Third, push the actual environment differences into well-named files, not into a wall of overlay patches. Each overlay holds a values.yaml or a config.yaml with the parameters that differ: replica count, resource requests, DNS suffix, feature flags. The overlay's kustomization.yaml is a thin shim that pulls in the base, applies a couple of patches, and merges the values file into a ConfigMap. When someone wants to know 'what is different about eu-west prod', they read one file, not five. Things that go wrong at this scale if you do not plan for them: 1. Argo CD performance. 90 Applications all polling the same Git repo means your repo server becomes a bottleneck. Enable Git webhook integration so commits trigger immediate refresh instead of every Application polling on its own interval. Tune timeout.reconciliation in argocd-cm. 2. Repo server memory. Rendering Kustomize or Helm for 90 Applications eats RAM. Scale up the argocd-repo-server replicas, give them more memory, and consider running a separate Argo CD instance for production clusters so a dev repo storm cannot starve prod reconcile. 3. Blast radius of a base change. A change to workloads/checkout/base/deployment.yaml hits 6 environments at once. Use sync waves and sync windows so prod-eu-west does not sync at the same moment as prod-us-east. Or, for the riskiest changes, use a release branch that overlays target instead of main, so you can promote regions one at a time. 4. Discoverability. With 90 Applications, the Argo CD UI gets hard to navigate. Use labels on the ApplicationSet template (app, env, region, team) and rely on the UI's label filters instead of scrolling. One pattern I avoid at this size: per-cluster manifests repos. It feels clean and it kills you the first time you need to roll out a security patch to 6 clusters at once. One repo, many Applications, generated from one template. Drift between regions is the enemy.
This is the senior question where you find out whether the candidate has actually operated Argo CD at scale or just configured it for one team. Strong candidates will name specific scaling pain points (repo server memory, polling vs webhook, sync windows) and reach for ApplicationSets and matrix generators without prompting. They will also call out drift between regions as a first-class concern, which most people learn the hard way. Weak candidates will describe the small-scale layout they used in a previous job and pretend it works at 10x scale.
Matrix ApplicationSet generating Applications across apps and clusters
Cluster Secret with labels the ApplicationSet selector uses
Sync windows that stagger prod region rollouts
- Hand-writing 90 Application YAMLs instead of generating them, then never daring to refactor because every change touches every file
- One manifests repo per cluster, which feels organized until you need to roll a CVE fix to all of them at once
- Letting Argo CD poll Git on its default interval across 90 Applications instead of wiring up webhooks, then wondering why changes take 3 minutes to show up
- At what point would you split into multiple Argo CD instances instead of one? What is the line?
- Your repo server is OOMing. Walk me through how you would diagnose and fix that before throwing more memory at it.
- A base manifest change needs to roll to 6 prod clusters but you want to halt after 2 if anything looks wrong. How do you wire that?
- How would you keep regions from drifting when one region needs a temporary patch the others do not?
More GitOps interview questions
Also worth your time on this topic
Argo CD Multi-Environment Repository Structure Checklist
How to organize your Git repositories when running Argo CD across dev, staging, and production. Covers folder layout, app-of-apps, ApplicationSets, secrets, RBAC, and promotion flow.
60-90 minutes
Bootstrapping Argo CD and Letting It Manage Itself
Argo CD manages your apps. Who manages Argo CD? Walk me through how you would bootstrap it from a fresh cluster and where its own config lives in your repo.
junior
GitOps with Argo CD: Structuring Your Repository for Multi-Environment Deployments
A practical guide to laying out your Git repository for Argo CD across dev, staging, and production. See real folder structures, Kustomize and Helm patterns, and the pitfalls that bite teams in production.