DEV Community

How to Use AI Coding Tools Without Leaking Source Code

What Your Coding Tools Actually Send

GitHub Copilot

When you press Tab to accept a Copilot suggestion, the extension sends:

  • The file you're editing (or relevant context window)
  • The language detected
  • The cursor position
  • Recently opened files in your project

Microsoft's own documentation confirms: "Copilot may collect code snippets and context from your editor to generate suggestions." The data is transmitted over HTTPS and stored for telemetry and model improvement unless you explicitly opt out in your organization's settings.

Cursor

Cursor goes further. As an AI-first IDE, it sends:

  • Full file contents from your active context
  • Project structure
  • Terminal output (when using AI terminal features)
  • Embedded documentation and comments
  • Your custom instructions and rules files

Cursor's privacy policy notes that code is retained for up to 30 days. The team offers a "Privacy Mode" option - when enabled, code is not used for training. But it still traverses their servers.

Claude Code

Claude Code (the CLI agent) sends whatever it reads:

  • Files you explicitly ask it to read or edit
  • Git history and diffs
  • Directory listings and file structures
  • Environment variables (if you share them via commands)
  • Terminal output

Since Claude Code runs as a CLI tool, you control what you feed it - but the convenience of "fix this bug in my codebase" means entire files end up in the API request.

The Real Exposure Risks

Let's move past theory. Here's what actually leaks in practice:

1. API Keys in Test Fixtures

# test_fixtures.py - you ask Cursor to "refactor these tests"
def test_payment_api():
    client = PaymentClient(api_key="sk_test_4eC39HqLyjWDarjtT1zdp7dc")
    response = client.charge(amount=1000)
    assert response.status_code == 200

That test key is harmless (it's a test key). But the same file might import a production key:

from config import PROD_API_KEY  # This is in your env, not the file

The file itself is safe - but if you've ever accidentally included a .env file in a prompt, you've sent production credentials to the AI.

2. Database Connection Strings in Config Files

# config/database.yml - sent to Copilot context
production:
  adapter: postgresql
  host: <%= ENV['DB_HOST'] %>
  username: <%= ENV['DB_USER'] %>
  password: <%= ENV['DB_PASSWORD'] %>

The ERB template is safe. But the resolved connection string? If you paste output from a Rails console session into Claude Code, the full resolved URL might end up in the conversation.

3. Customer Data in Fixtures and Seeds

// seed.js - you ask the AI to "add validation to this user seeding script"
const users = [
  { name: "John Smith", email: "john.smith@gmail.com", ssn: "123-45-6789" },
  { name: "Jane Doe", email: "jane.doe@company.com", ssn: "987-65-4321" },
];

This is the most common leak pattern. Developers paste fixture files with realistic-looking but real-enough data. The SSNs might be fake, but the email addresses might be real employees. The data structure reveals your customer schema. And now all of it lives on an external server.

4. Internal Hostnames and Architecture

# deployment script - sent to the AI for "review this deploy script"
def deploy():
    hosts = ["app-01.internal.prod", "app-02.internal.prod", "db-master.internal.prod"]
    run_ansible(hosts)

Your internal network topology, hostnames, and deployment patterns become part of the AI's context. These are gold for an attacker performing reconnaissance.

The Practical 30-Second Fix

Here's what you can implement right now, without changing your workflow:

Option A: Use a Local Proxy (Recommended)

Run a lightweight proxy on localhost that intercepts API calls from your AI tools and automatically masks sensitive patterns:

# One-time setup
git clone https://github.com/gunxueqiu6/ai-privacy-gateway.git
cd ai-privacy-gateway
docker-compose up -d

# Point your AI tools to:
# OpenAI API → http://localhost:8080/v1
# Anthropic API → http://localhost:8081/v1

The proxy detects and masks these automatically:

Before After
"My database password is Sup3rS3cret!" "My database password is [PASSWORD]"
"The server is at staging-3.internal.example.com" "The server is at [HOSTNAME]"
"sk-proj-abc123def456..." "[API_KEY]"

The AI tool receives the question with the sensitive parts redacted. It can still help you - it just can't learn your secrets.

Option B: Manual Pre-Screening

If you can't use a proxy, build this mental checklist before every prompt:

  • Does this contain credentials? → Redact to [USERNAME] / [PASSWORD]
  • Does this contain internal hostnames? → Replace with internal.example.com
  • Does this contain customer data? → Replace with [CUSTOMER_REDACTED]
  • Does this contain business logic you'd rather keep secret? → Abstract it to pseudocode

Option C: Use API Keys with Zero-Data Retention

For tools that support it, use API access with explicit zero-data-retention headers:

import os
from openai import OpenAI
from anthropic import Anthropic

# OpenAI - opt out of training data use
client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    default_headers={"OpenAI-Organization": "your-org-id"}
)

# Anthropic - no training on API data by default
client = Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"]
)

If you're using Copilot, Cursor, or Claude Code through the CLI, check whether your organization allows configuring a custom API endpoint. If it does, route through a local proxy.

Which Protection Layer Is Right for You?

Situation Recommended Approach
Solo developer, personal projects Manual redaction + basic caution
Small team, open-source code Local proxy, Docker setup
Medium team, proprietary code Proxy + org-wide policy + training
Enterprise, regulated industry Proxy + DLP integration + audit logging
Working with PHI/PII data Proxy + all traffic logged + quarterly review

The Architecture in Practice

Here's a production setup I've seen work well for a 20-person engineering team:

Developer laptop → AI Privacy Gateway (localhost:8080) → Anthropic/OpenAI API
                        ↓↑
                  Masked logs ← Elasticsearch ←┘
                        ↓
                  Slack alert (if raw PII detected)

Every prompt is masked before leaving the developer's machine. Masked logs are stored for 30 days for audit. If raw PII somehow gets through (a new detector is needed), the team gets a Slack alert within seconds.

The team's AI usage went up 3x after deploying this - because security concerns stopped being a reason to avoid AI tools.

What NOT to Do

A few approaches sound good but don't actually work:

  • "I'll just use a local model" - Local models avoid the network issue, but running a capable model locally requires significant hardware (48GB+ VRAM for coding-grade models), and they're generally less capable than cloud models.

  • "I'll encrypt my prompts" - Encryption protects data in transit and at rest, but the AI needs to read the plaintext to process it. Encryption doesn't help at the inference endpoint.

  • "I'll just be careful" - Human vigilance fails. It fails in week 2 of a sprint, it fails at 2 AM during an incident, it fails when you're showing a coworker something and copy-paste without thinking.

The Bottom Line

AI coding tools are too useful to abandon over privacy concerns, and the data risks are too real to ignore. The solution is a middle path: use the tools, but route their traffic through a local privacy proxy that strips sensitive data before it leaves your network.

The AI Privacy Gateway on GitHub does exactly this in under 60 seconds of setup time. But even if you use a different proxy or just commit to better manual hygiene - start now, not after your first incident. Every paste is a risk. Every masked paste is a risk eliminated.

Comments

No comments yet. Start the discussion.