DEV Community 1h ago

Log Management at Scale: How We Cut Costs 70% Without Losing Signal

$12,000/Month for Logs Nobody Reads

Our logging bill was $12,000/month. We were ingesting 2TB/day. When I asked the team what percentage of logs they actually looked at during incidents, the answer was embarrassing: about 5%. We were paying to store 95% noise.

The Log Audit

First, I categorized all log sources by value:

High value (always need during incidents):

Application errors (stack traces)
Authentication events
Business transactions
External API calls with responses
Health check failures

Medium value (sometimes useful):

Request/response logs (sampled)
Performance metrics in logs
Deployment events
Configuration changes

Low value (almost never needed):

Debug/trace level logs
Health check successes
Static asset requests
Heartbeat messages
Verbose framework logs

Strategy 1: Log Levels as a Service

We made log levels dynamic. In production, default is WARN. During incidents, flip to DEBUG for the affected service:

import os
import logging

# Log level from environment variable, changeable at runtime
LOG_LEVEL = os.environ.get('LOG_LEVEL', 'WARNING')
logging.basicConfig(level=getattr(logging, LOG_LEVEL))

# Endpoint to change log level without restart
@app.post('/admin/log-level')
async def set_log_level(level: str):
    logging.getLogger().setLevel(getattr(logging, level.upper()))
    return {'status': 'ok', 'level': level}

In Kubernetes:

# Normal operation
kubectl set env deployment/api LOG_LEVEL=WARNING

# During incident
kubectl set env deployment/api LOG_LEVEL=DEBUG

# After incident
kubectl set env deployment/api LOG_LEVEL=WARNING

Strategy 2: Tiered Retention

retention_policy:
  hot_storage:       # Fast search, expensive
    duration: 7 days
    filter: "level >= WARN OR tag:business_event"
  warm_storage:      # Slower search, cheaper
    duration: 30 days
    filter: "level >= INFO"
  cold_storage:      # Archive only, cheapest
    duration: 365 days
    filter: "tag:audit OR tag:compliance"
  drop:              # Don't store at all
    filter: "level = DEBUG OR source:health_check"

Strategy 3: Structured Logging

Unstructured logs are expensive to parse. Structured logs are cheap to query:

# Bad: Unstructured
logger.info(f"User {user_id} purchased {product_id} for ${amount}")
# Parsing this requires regex, which costs compute

# Good: Structured
logger.info("purchase_completed", extra={
    'user_id': user_id,
    'product_id': product_id,
    'amount': amount,
    'currency': 'USD'
})
# Output: {"message": "purchase_completed", "user_id": "u123", ...}
# Queryable without parsing

Strategy 4: Sample Verbose Logs

import random

def should_log_request(request):
    # Always log errors
    if request.status_code >= 400:
        return True
    # Always log slow requests
    if request.duration_ms > 1000:
        return True
    # Sample 10% of successful requests
    return random.random() < 0.10

The Results

Before:

Daily ingestion: 2 TB
Monthly cost: $12,000
Useful data: ~5%

After:

Daily ingestion: 400 GB
Monthly cost: $3,600
Useful data: ~70%

We cut costs by 70% AND improved signal quality. Searches are faster because there's less noise. Incidents resolve quicker because relevant logs surface immediately.

The Rule

Before adding a log statement, ask: "Will someone look at this during an incident?" If the answer is no, it's DEBUG level at most.

If you're spending too much on logs and want smarter log management, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops.
https://novaaiops.com

Read on DEV Community ↗ ← Back to News