DEV Community

[Databricks on AWS #5] Fixing Databricks BOOTSTRAP_TIMEOUT with AWS PrivateLink: Control Plane Over the Backbone, Zero New Subnets

The BOOTSTRAP_TIMEOUT Mystery

In Part 4 we traced a BOOTSTRAP_TIMEOUT all the way to a centralized egress firewall that silently dropped our new workspace CIDR. Here's the clean fix - take the control-plane traffic off the internet entirely, without touching the existing VPC.

In Part 4 we found the culprit: cluster nodes with no public IP were trying to phone home to the Databricks control plane and secure cluster connectivity (SCC) relay, and a centralized inspection firewall was dropping the packets because our new workspace CIDR wasn't in its allow-list.

We could have just added a firewall rule and moved on. Instead we did the thing our security team actually wanted from the start: get that traffic off the public internet completely. The tool for that is AWS PrivateLink. And the interesting part isn't PrivateLink itself - it's that we landed it with zero new subnets, zero new CIDR, and zero routing changes, in a VPC that was already full.

Why VPC endpoints alone don't fix it (the trap from Part 4)

The obvious instinct is: "no internet - just put everything on VPC endpoints." That instinct is correct, but only for AWS services. S3, STS, and Kinesis all publish com.amazonaws... endpoint services, so you can ride the AWS backbone to them with gateway or interface endpoints.

The thing that was actually blocked in Part 4 is different: it's the Databricks control plane and the SCC relay. That's Databricks-owned infrastructure, not an AWS service. There is no com.amazonaws... endpoint for it that you get for free.

Databricks does, however, publish its own PrivateLink endpoint services - you just have to explicitly wire them up. So the real fix splits in two:

Traffic Solution
S3 / STS / Kinesis AWS VPC endpoints (optional, separate)
Control plane REST API + SCC relay Databricks PrivateLink - this is the one that unblocks bootstrap

This post is about that second row.

Back-end vs front-end PrivateLink

Databricks PrivateLink comes in two flavors, and you need to know which one you're solving for:

Type Connection Us
Back-end (compute plane) cluster → workspace core services: REST API + SCC relay ✅ required
Front-end (inbound) users → workspace over a private path optional (we skipped it)

The BOOTSTRAP_TIMEOUT is purely a back-end problem: the compute plane can't reach core services. Front-end PrivateLink is about how users reach the workspace UI - a different concern, and one we deliberately left alone so that people can still log in normally over the internet.

Back-end PrivateLink means two interface VPC endpoints:

  • one to the workspace / REST API endpoint service
  • one to the SCC relay endpoint service

Both are required. Skip one and you break either the REST channel or the relay channel - the cluster still won't come up.

The elegant part: no new subnets

Here's the constraint we walked into. The VPC was a /24 and it was full - no room to carve out a fresh /28 for endpoints. On most PrivateLink walkthroughs step one is "create a subnet for your endpoints." We couldn't.

Then the realization: an interface VPC endpoint is just an ENI with a private IP. It doesn't need its own subnet. You can drop its ENI straight into an existing subnet - including the very cluster subnets your compute already runs in. Each endpoint consumes a handful of private IPs (roughly 4, one per AZ ENI plus overhead) and that's it.

So the plan became: put both endpoints' ENIs in the existing cluster subnets. No new subnet, no new CIDR. Just ~4 IPs out of the dozens the subnets had free.

Why there's no routing change either

This is the part that made the security review easy, so it's worth being precise about why it's true. An interface endpoint is reachable at a private IP inside your VPC. When a cluster node talks to that IP, it's talking to something in its own VPC - which every route table already handles via the implicit local route. There is nothing to add. No 0.0.0.0/0 change, no new prefix, no target rewrite.

Contrast this with an S3 gateway endpoint, which does modify routing - a gateway endpoint installs a prefix-list route (pl-xxxx → vpce-xxxx) into your route tables. That's a real routing change you have to review. Interface endpoints don't work that way. They work via DNS plus the local route, not route-table entries.

Private DNS is what redirects the traffic

When you enable private DNS on the interface endpoints, AWS overrides DNS resolution inside your VPC so that the Databricks domains resolve to the endpoints' private ENI IPs instead of public addresses. So tunnel.<region>.cloud.databricks.com - the SCC relay host that was getting dropped at the firewall in Part 4 - now resolves to a private IP in your own cluster subnet.

The node connects to that private IP, hits the endpoint ENI, and the traffic rides the AWS backbone straight to Databricks' endpoint service. It never leaves AWS's network. It never touches the firewall.

And critically: private DNS only rewrites the Databricks domains. Every other service in the VPC keeps resolving exactly as before, because they don't use those hostnames. That's why the blast radius here is essentially zero - the existing (non-Databricks) workloads are completely untouched.

To summarize the "why nothing else breaks" story:

Thing Change? Why
Route tables none interface endpoints use DNS + local, not routes
Existing non-Databricks services none private DNS only rewrites Databricks domains
Subnets / CIDR none ENIs go into existing subnets
Existing SGs / NACLs none a new SG is created just for the endpoints
The only real change ~4 private IPs consumed + workspaces flipped to PrivateLink non-disruptive

The AWS side

Concretely, the infra team does three small things - and notably, no new subnet, no new CIDR, no route change:

  1. One security group for the endpoints, allowing inbound 443 (and the relay port) from the cluster security group.
  2. Two interface VPC endpoints, both placed in the existing cluster subnets (one per AZ), with private DNS enabled, pointed at the two Databricks endpoint services. In <region> those services are:
Endpoint Service name
Workspace (REST API) com.amazonaws.vpce.<region>.vpce-svc-...
SCC relay com.amazonaws.vpce.<region>.vpce-svc-...

(The exact vpce-svc-... values come from the Databricks region docs - always re-verify them right before applying, and use the console's "Verify service" to confirm they resolve.)

  1. Hand back the two VPC endpoint IDs (vpce-...) for the Databricks-side registration.

That's the whole AWS footprint. A security group and two ENIs.

The Databricks side (Terraform)

Now Databricks needs to know these endpoints exist and be told to route the workspace through them. Three resource types:

  1. Register each endpoint with the Databricks account, twice - once for the relay, once for the REST API:
resource "databricks_mws_vpc_endpoint" "relay" {
  provider = databricks.mws
  account_id = "<databricks-account-id>"
  aws_vpc_endpoint_id = "vpce-...relay"
  vpc_endpoint_name = "our-dev-relay-vpce"
  region = "<region>"
}

resource "databricks_mws_vpc_endpoint" "rest" {
  provider = databricks.mws
  account_id = "<databricks-account-id>"
  aws_vpc_endpoint_id = "vpce-...workspace"
  vpc_endpoint_name = "our-dev-workspace-vpce"
  region = "<region>"
}
  1. Create private access settings - and this is the one setting people get wrong:
resource "databricks_mws_private_access_settings" "pas" {
  provider = databricks.mws
  private_access_settings_name = "our-dev-pas"
  region = "<region>"
  public_access_enabled = true
}

public_access_enabled = true is deliberate. We're doing back-end PrivateLink only - the compute plane goes private, but we want users to keep logging in over the internet. Set this to false and you've quietly switched on front-end lockdown too, and your users get locked out of the UI. Keep it true unless you're also doing front-end PrivateLink on purpose.

  1. Attach the endpoints to the network and the workspaces. The network config gets the two endpoints wired into vpc_endpoints { dataplane_relay = [...], rest_api = [...] }, and the workspaces get private_access_settings_id and the network. Note: subnet_ids stays exactly as it was - we are not changing the cluster subnets, only adding the endpoint references.

Gotchas (the ones that cost real time)

  • You need both endpoints. Workspace/REST and relay. One alone silently half-works and the cluster still fails.
  • Private DNS is mandatory. Forget to enable it and the Databricks domains keep resolving to public IPs - the endpoints exist but nothing uses them. This is the single most common "I set it up but it still times out" cause.
  • Keep public_access_enabled = true for back-end-only PrivateLink, or you'll lock users out of the console.
  • Wait ~20 minutes after the workspace flips to PrivateLink before you launch a test cluster. The registration and DNS propagation aren't instant, and an early test will look like a failure that isn't one.
  • Do not SSL-decrypt the relay. Same certificate-pinning gotcha from Part 4 - if any inspection sits in front of it and forward-proxy decrypts, the relay breaks. With PrivateLink the traffic bypasses the firewall entirely, but if you still have belt-and-suspenders inspection anywhere in the path, exclude the Databricks domains.

Verification

The proof is three checks, in order:

  1. Workspace shows RUNNING, then wait ~20 minutes.
  2. From inside the VPC, nslookup tunnel.<region>.cloud.databricks.com → it should resolve to a private endpoint IP in your cluster subnet, not a public address.
  3. Launch a cluster. It reaches RUNNING. BOOTSTRAP_TIMEOUT is gone.

That's it. The same cluster that spent 11 minutes stuck in INSTANCE_INITIALIZING in Part 4 now comes up cleanly - over the backbone, off the internet, with the rest of the VPC none the wiser.

Takeaways

  • VPC endpoints solve S3/STS/Kinesis. The control plane + relay are Databricks-owned - for those you need PrivateLink, and that's specifically what fixes BOOTSTRAP_TIMEOUT.
  • Back-end PrivateLink = two interface endpoints (workspace/REST + SCC relay). Both required.
  • An interface endpoint is just an ENI with a private IP - it fits into an existing subnet, so no new subnet, no new CIDR, and no routing change. (Contrast S3 gateway endpoints, which do add a route.)
  • Private DNS is the actual redirect mechanism, and it only touches the Databricks domains - so existing services are untouched.
  • Keep public_access_enabled = true for back-end-only, wait 20 minutes, and never decrypt the relay.

Next up: how all of this is actually structured in Terraform - the module layout, the dual Databricks providers (mws vs workspace), and the appendix that ties the whole series together.

Next: Part 6 - The Terraform structure behind it all (and a series appendix).

Comments

No comments yet. Start the discussion.