[Databricks on AWS #3] Compute Governance on Databricks: Instance Pools, Cluster Policies, and Shared Clusters
Instance Pools - Pre-warmed VMs
A cold cluster start on AWS means: request EC2 capacity → wait for the instances → install the Databricks runtime → join the cluster. That's several minutes of a user staring at a spinner.
An instance pool keeps a set of VMs pre-acquired (or ready to be acquired fast) so clusters attach to them instead of provisioning from scratch. It's purely about speed and cost predictability - a pool doesn't restrict anything on its own.
For our workspace we defined six pools, split by CPU/GPU and by size:
| Pool | Instance type | Capacity | Tag |
|---|---|---|---|
ip_cpu_small |
m6g.large (2 vCPU / 8 GB) |
10 | cpu/small |
ip_cpu_medium |
m6g.xlarge (4 vCPU / 16 GB) |
15 | cpu/medium |
ip_cpu_large |
r6i.2xlarge (8 vCPU / 64 GB) |
15 | cpu/large |
ip_cpu_xlarge |
r6i.4xlarge (16 vCPU / 128 GB) |
20 | cpu/xlarge |
ip_gpu_small |
g5.xlarge (1× A10G) |
10 | gpu/small |
ip_gpu_large |
g5.2xlarge (1× A10G) |
20 | gpu/large |
The important knobs are shared across all of them:
min_idle_instances = 0- we don't pay to keep VMs warm 24/7; the pool spins up on demand and the first start after idle pays the cold-start tax. A tradeoff we accepted for cost.idle_instance_autotermination_minutes = 10- idle VMs release themselves after 10 minutes.availability = ON_DEMAND- no spot reclaim surprises for interactive work.
The custom_tags matter more than they look: they flow into AWS cost allocation, so you can answer "how much did the GPU pools cost this month?" without guessing.
Cluster Policies - The Actual Governance
Instance pools make clusters fast. Cluster policies make clusters legal.
A policy is a template that constrains what a cluster can be: which pool it draws from, which Spark runtime, min/max workers, autotermination, and how many clusters one user can create. A user with a policy can only create clusters that fit inside it - they can't override the instance type, can't disable autotermination, can't ask for 200 workers.
Our analytics workspace has eight policies. A representative slice:
| Policy | Pool (worker / driver) | Runtime | Workers | Autoterm (min) | Max clusters/user |
|---|---|---|---|---|---|
cp_cpu_small |
ip_cpu_small / ip_cpu_small |
14.3.x-scala2.12 | 1–4 | 10 | 2 |
cp_cpu_medium |
ip_cpu_medium / ip_cpu_medium |
14.3.x-scala2.12 | 0–8 | 10 | 2 |
cp_cpu_large |
ip_cpu_large / ip_cpu_medium |
14.3.x-scala2.12 | 0–12 | 10 | 1 |
cp_gpu_small |
ip_gpu_small / ip_cpu_small |
14.3.x-gpu-ml-scala2.12 | 0–8 | 10 | 1 |
cp_job_standard |
ip_cpu_medium / ip_cpu_medium |
14.3.x-scala2.12 | 0–16 | 30 | - |
A few design choices worth calling out:
- Autotermination is not optional. The policy sets it (10 min for interactive, 30 for jobs) and the user can't remove it. This single rule kills the "forgot to shut it down over the weekend" bill.
data_security_mode = USER_ISOLATIONacross the board - Unity Catalog enforcement stays on. Governance from Part 2 doesn't get bypassed by a cleverly configured cluster.- Driver often runs on a smaller pool than workers (e.g.
cp_cpu_largeusesip_cpu_largeworkers but anip_cpu_mediumdriver). The driver rarely needs to match worker muscle, and this trims cost. max_clusters_per_usercaps sprawl. The big policies are limited to 1 cluster per person.- A shared
cost_centertag on every policy feeds the same cost-allocation story as the pools. - Job policies (higher worker ceilings, longer autotermination) live alongside the interactive ones, so batch pipelines get their own lane without borrowing the interactive budget.
The Entitlement Gate - allow_cluster_create
Policies govern how someone creates a cluster. But some people shouldn't create clusters at all.
Databricks has a workspace-level entitlement for exactly this: allow_cluster_create. Turn it off for a group, and members of that group physically cannot create clusters - the button is gone, the API call is rejected. It doesn't matter what policies exist; the door is locked before they reach it.
This is the gate that makes the whole role model coherent:
| Role | allow_cluster_create |
Cluster policies | What they can do |
|---|---|---|---|
| Admin | on | all / unrestricted | Create anything. Break glass. |
| Engineer | on | assigned policies only | Create clusters - but only inside their policy's box |
| Analyst | off | none | No creation at all - only attach to pre-made shared clusters |
An engineer gets freedom within guardrails. An analyst gets no guardrails because they get no steering wheel - they use what's already there. And that "what's already there" is the last piece.
Shared Clusters - For the People Who Can't Create
If analysts can't create clusters, they need clusters waiting for them. So the analytics workspace runs a couple of always-available shared, all-purpose clusters:
| Cluster | Policy | Runtime | Shape |
|---|---|---|---|
cp_shared_small |
cp_cpu_small |
14.3.x-scala2.12 | m6g.large, 1 driver + 1 worker |
cp_shared_medium |
cp_cpu_medium |
14.3.x-scala2.12 | m6g.xlarge, 1 driver + 0 min workers |
These are built from the same policies engineers use, so they inherit the same autotermination and isolation rules. Access is granted per group via ACLs (CAN_USE, CAN_ATTACH_TO) - an analyst attaches, runs their notebook, and never touches a provisioning decision.
The pipeline workspace, by contrast, has no shared clusters - it's job-driven, so clusters are ephemeral and spun up per run from job policies. Different workspace, different compute personality, same governance primitives.
Apply Order: Pool → Policy → Compute (and the Terragrunt Trap)
Here's where Infrastructure-as-Code bites. The three layers have a hard dependency chain: instance-pool → cluster-policy → compute (clusters). Policies reference pool IDs. Clusters reference policy IDs. So you must apply in order:
# 1. pools first
atlantis apply -d .../ws-landing/instance-pool
# 2. re-plan, then policies (they now see real pool IDs)
atlantis apply -d .../ws-landing/cluster-policy
# 3. re-plan, then clusters
atlantis apply -d .../ws-landing/compute
And now the gotcha that cost us a confused afternoon. On the very first plan - before pools are applied - the policy and compute plans fail with:
Error: ... instance-pool ... is a dependency of ... cluster-policy ... but detected no outputs.
...
This looks like a broken config. It isn't. Terragrunt lets a dependency return mock outputs so downstream modules can plan before the upstream is applied - but only for the Terraform commands you allowlist:
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
The failure happens during terragrunt init, and init isn't in that list. So on a cold bootstrap, init tries to fetch the real output of an unapplied pool, finds nothing, and dies. On our older workspaces this never surfaced - those pools were already applied, so real outputs existed.
Two ways out:
- Just apply in order (pool → re-plan → policy → re-plan → compute). Once the upstream is applied, real outputs exist and the mock is never needed. This is what we did.
- Add
"init"tomock_outputs_allowed_terraform_commandson the policy/compute modules. The mock now covers theinitphase too - but you still apply in order, because a cluster genuinely can't be created against a pool that doesn't exist yet.
The lesson: on a greenfield deployment, "detected no outputs" almost always means "you haven't applied the thing upstream yet," not "your dependency block is wrong."
Takeaways
- Three layers, one funnel. Pools = speed. Policies = the real governance (instance type, size, autotermination - non-negotiable). Entitlement gate = who's even allowed to create.
- Roles map cleanly onto the layers. Admin is unrestricted, engineer creates within a policy, analyst creates nothing and uses shared clusters.
allow_cluster_create = offis what makes the analyst tier real. - Autotermination baked into policy is the single highest-ROI rule you can ship. It ends surprise weekend bills.
- Apply order is load-bearing. pool → policy → compute, and the first cold plan failing with "detected no outputs" is a Terragrunt mock/
initquirk, not a bug in your code.
So we applied it. Pools came up - real IDs issued. Policies came up - real IDs issued. Then we ran the compute apply to create those two shared clusters, and...
databricks_cluster.this["shared_small"]: Still creating... [10m20s elapsed]
...
Error: cannot create cluster: failed to reach RUNNING, got TERMINATED:
Self-bootstrap timed out during launch
...
BOOTSTRAP_TIMEOUT. The EC2 nodes booted. Status checks passed, 3/3. And the clusters refused to start anyway - 25 minutes of "Still creating," then TERMINATED. Not a policy problem. Not an IAM problem. Something in the network was quietly eating our packets.
Next: The BOOTSTRAP_TIMEOUT Mystery - tracing a Databricks cluster from data plane to control plane, across three AWS accounts and one very quiet firewall.
Comments
No comments yet. Start the discussion.