DEV Community 2h ago

[Databricks on AWS #3] Compute Governance on Databricks: Instance Pools, Cluster Policies, and Shared Clusters

Instance Pools - Pre-warmed VMs

A cold cluster start on AWS means: request EC2 capacity → wait for the instances → install the Databricks runtime → join the cluster. That's several minutes of a user staring at a spinner.

An instance pool keeps a set of VMs pre-acquired (or ready to be acquired fast) so clusters attach to them instead of provisioning from scratch. It's purely about speed and cost predictability - a pool doesn't restrict anything on its own.

For our workspace we defined six pools, split by CPU/GPU and by size:

Pool	Instance type	Capacity	Tag
`ip_cpu_small`	`m6g.large` (2 vCPU / 8 GB)	10	cpu/small
`ip_cpu_medium`	`m6g.xlarge` (4 vCPU / 16 GB)	15	cpu/medium
`ip_cpu_large`	`r6i.2xlarge` (8 vCPU / 64 GB)	15	cpu/large
`ip_cpu_xlarge`	`r6i.4xlarge` (16 vCPU / 128 GB)	20	cpu/xlarge
`ip_gpu_small`	`g5.xlarge` (1× A10G)	10	gpu/small
`ip_gpu_large`	`g5.2xlarge` (1× A10G)	20	gpu/large

The important knobs are shared across all of them:

min_idle_instances = 0 - we don't pay to keep VMs warm 24/7; the pool spins up on demand and the first start after idle pays the cold-start tax. A tradeoff we accepted for cost.
idle_instance_autotermination_minutes = 10 - idle VMs release themselves after 10 minutes.
availability = ON_DEMAND - no spot reclaim surprises for interactive work.

The custom_tags matter more than they look: they flow into AWS cost allocation, so you can answer "how much did the GPU pools cost this month?" without guessing.

Cluster Policies - The Actual Governance

Instance pools make clusters fast. Cluster policies make clusters legal.

A policy is a template that constrains what a cluster can be: which pool it draws from, which Spark runtime, min/max workers, autotermination, and how many clusters one user can create. A user with a policy can only create clusters that fit inside it - they can't override the instance type, can't disable autotermination, can't ask for 200 workers.

Our analytics workspace has eight policies. A representative slice:

Policy	Pool (worker / driver)	Runtime	Workers	Autoterm (min)	Max clusters/user
`cp_cpu_small`	`ip_cpu_small` / `ip_cpu_small`	14.3.x-scala2.12	1–4	10	2
`cp_cpu_medium`	`ip_cpu_medium` / `ip_cpu_medium`	14.3.x-scala2.12	0–8	10	2
`cp_cpu_large`	`ip_cpu_large` / `ip_cpu_medium`	14.3.x-scala2.12	0–12	10	1
`cp_gpu_small`	`ip_gpu_small` / `ip_cpu_small`	14.3.x-gpu-ml-scala2.12	0–8	10	1
`cp_job_standard`	`ip_cpu_medium` / `ip_cpu_medium`	14.3.x-scala2.12	0–16	30	-

A few design choices worth calling out:

Autotermination is not optional. The policy sets it (10 min for interactive, 30 for jobs) and the user can't remove it. This single rule kills the "forgot to shut it down over the weekend" bill.
data_security_mode = USER_ISOLATION across the board - Unity Catalog enforcement stays on. Governance from Part 2 doesn't get bypassed by a cleverly configured cluster.
Driver often runs on a smaller pool than workers (e.g. cp_cpu_large uses ip_cpu_large workers but an ip_cpu_medium driver). The driver rarely needs to match worker muscle, and this trims cost.
max_clusters_per_user caps sprawl. The big policies are limited to 1 cluster per person.
A shared cost_center tag on every policy feeds the same cost-allocation story as the pools.
Job policies (higher worker ceilings, longer autotermination) live alongside the interactive ones, so batch pipelines get their own lane without borrowing the interactive budget.

The Entitlement Gate - `allow_cluster_create`

Policies govern how someone creates a cluster. But some people shouldn't create clusters at all.

Databricks has a workspace-level entitlement for exactly this: allow_cluster_create. Turn it off for a group, and members of that group physically cannot create clusters - the button is gone, the API call is rejected. It doesn't matter what policies exist; the door is locked before they reach it.

This is the gate that makes the whole role model coherent:

Role	`allow_cluster_create`	Cluster policies	What they can do
Admin	on	all / unrestricted	Create anything. Break glass.
Engineer	on	assigned policies only	Create clusters - but only inside their policy's box
Analyst	off	none	No creation at all - only attach to pre-made shared clusters

An engineer gets freedom within guardrails. An analyst gets no guardrails because they get no steering wheel - they use what's already there. And that "what's already there" is the last piece.

Shared Clusters - For the People Who Can't Create

If analysts can't create clusters, they need clusters waiting for them. So the analytics workspace runs a couple of always-available shared, all-purpose clusters:

Cluster	Policy	Runtime	Shape
`cp_shared_small`	`cp_cpu_small`	14.3.x-scala2.12	`m6g.large`, 1 driver + 1 worker
`cp_shared_medium`	`cp_cpu_medium`	14.3.x-scala2.12	`m6g.xlarge`, 1 driver + 0 min workers

These are built from the same policies engineers use, so they inherit the same autotermination and isolation rules. Access is granted per group via ACLs (CAN_USE, CAN_ATTACH_TO) - an analyst attaches, runs their notebook, and never touches a provisioning decision.

The pipeline workspace, by contrast, has no shared clusters - it's job-driven, so clusters are ephemeral and spun up per run from job policies. Different workspace, different compute personality, same governance primitives.

Apply Order: Pool → Policy → Compute (and the Terragrunt Trap)

Here's where Infrastructure-as-Code bites. The three layers have a hard dependency chain: instance-pool → cluster-policy → compute (clusters). Policies reference pool IDs. Clusters reference policy IDs. So you must apply in order:

# 1. pools first
atlantis apply -d .../ws-landing/instance-pool

# 2. re-plan, then policies (they now see real pool IDs)
atlantis apply -d .../ws-landing/cluster-policy

# 3. re-plan, then clusters
atlantis apply -d .../ws-landing/compute

And now the gotcha that cost us a confused afternoon. On the very first plan - before pools are applied - the policy and compute plans fail with:

Error: ... instance-pool ... is a dependency of ... cluster-policy ... but detected no outputs.
...

This looks like a broken config. It isn't. Terragrunt lets a dependency return mock outputs so downstream modules can plan before the upstream is applied - but only for the Terraform commands you allowlist:

mock_outputs_allowed_terraform_commands = ["validate", "plan"]

The failure happens during terragrunt init, and init isn't in that list. So on a cold bootstrap, init tries to fetch the real output of an unapplied pool, finds nothing, and dies. On our older workspaces this never surfaced - those pools were already applied, so real outputs existed.

Two ways out:

Just apply in order (pool → re-plan → policy → re-plan → compute). Once the upstream is applied, real outputs exist and the mock is never needed. This is what we did.
Add "init" to mock_outputs_allowed_terraform_commands on the policy/compute modules. The mock now covers the init phase too - but you still apply in order, because a cluster genuinely can't be created against a pool that doesn't exist yet.

The lesson: on a greenfield deployment, "detected no outputs" almost always means "you haven't applied the thing upstream yet," not "your dependency block is wrong."

Takeaways

Three layers, one funnel. Pools = speed. Policies = the real governance (instance type, size, autotermination - non-negotiable). Entitlement gate = who's even allowed to create.
Roles map cleanly onto the layers. Admin is unrestricted, engineer creates within a policy, analyst creates nothing and uses shared clusters. allow_cluster_create = off is what makes the analyst tier real.
Autotermination baked into policy is the single highest-ROI rule you can ship. It ends surprise weekend bills.
Apply order is load-bearing. pool → policy → compute, and the first cold plan failing with "detected no outputs" is a Terragrunt mock/init quirk, not a bug in your code.

So we applied it. Pools came up - real IDs issued. Policies came up - real IDs issued. Then we ran the compute apply to create those two shared clusters, and...

databricks_cluster.this["shared_small"]: Still creating... [10m20s elapsed]
...
Error: cannot create cluster: failed to reach RUNNING, got TERMINATED:
  Self-bootstrap timed out during launch
...

BOOTSTRAP_TIMEOUT. The EC2 nodes booted. Status checks passed, 3/3. And the clusters refused to start anyway - 25 minutes of "Still creating," then TERMINATED. Not a policy problem. Not an IAM problem. Something in the network was quietly eating our packets.

Next: The BOOTSTRAP_TIMEOUT Mystery - tracing a Databricks cluster from data plane to control plane, across three AWS accounts and one very quiet firewall.

Read on DEV Community ↗ ← Back to News