Log Management at Scale: How We Cut Costs 70% Without Losing Signal
DEV Community

Log Management at Scale: How We Cut Costs 70% Without Losing Signal

$12,000/Month for Logs Nobody Reads

Our logging bill was $12,000/month. We were ingesting 2TB/day. When I asked the team what percentage of logs they actually looked at during incidents, the answer was embarrassing: about 5%. We were paying to store 95% noise.

The Log Audit

First, I categorized all log sources by value:

High value (always need during incidents):

  • Application errors (stack traces)
  • Authentication events
  • Business transactions
  • External API calls with responses
  • Health check failures

Medium value (sometimes useful):

  • Request/response logs (sampled)
  • Performance metrics in logs
  • Deployment events
  • Configuration changes

Low value (almost never needed):

  • Debug/trace level logs
  • Health check successes
  • Static asset requests
  • Heartbeat messages
  • Verbose framework logs

Strategy 1: Log Levels as a Service

We made log levels dynamic. In production, default is WARN. During incidents, flip to DEBUG for the affected service:

import os
import logging

# Log level from environment variable, changeable at runtime
LOG_LEVEL = os.environ.get('LOG_LEVEL', 'WARNING')
logging.basicConfig(level=getattr(logging, LOG_LEVEL))

# Endpoint to change log level without restart
@app.post('/admin/log-level')
async def set_log_level(level: str):
    logging.getLogger().setLevel(getattr(logging, level.upper()))
    return {'status': 'ok', 'level': level}

In Kubernetes:

# Normal operation
kubectl set env deployment/api LOG_LEVEL=WARNING

# During incident
kubectl set env deployment/api LOG_LEVEL=DEBUG

# After incident
kubectl set env deployment/api LOG_LEVEL=WARNING

Strategy 2: Tiered Retention

retention_policy:
  hot_storage:       # Fast search, expensive
    duration: 7 days
    filter: "level >= WARN OR tag:business_event"
  warm_storage:      # Slower search, cheaper
    duration: 30 days
    filter: "level >= INFO"
  cold_storage:      # Archive only, cheapest
    duration: 365 days
    filter: "tag:audit OR tag:compliance"
  drop:              # Don't store at all
    filter: "level = DEBUG OR source:health_check"

Strategy 3: Structured Logging

Unstructured logs are expensive to parse. Structured logs are cheap to query:

# Bad: Unstructured
logger.info(f"User {user_id} purchased {product_id} for ${amount}")
# Parsing this requires regex, which costs compute

# Good: Structured
logger.info("purchase_completed", extra={
    'user_id': user_id,
    'product_id': product_id,
    'amount': amount,
    'currency': 'USD'
})
# Output: {"message": "purchase_completed", "user_id": "u123", ...}
# Queryable without parsing

Strategy 4: Sample Verbose Logs

import random

def should_log_request(request):
    # Always log errors
    if request.status_code >= 400:
        return True
    # Always log slow requests
    if request.duration_ms > 1000:
        return True
    # Sample 10% of successful requests
    return random.random() < 0.10

The Results

Before:

  • Daily ingestion: 2 TB
  • Monthly cost: $12,000
  • Useful data: ~5%

After:

  • Daily ingestion: 400 GB
  • Monthly cost: $3,600
  • Useful data: ~70%

We cut costs by 70% AND improved signal quality. Searches are faster because there's less noise. Incidents resolve quicker because relevant logs surface immediately.

The Rule

Before adding a log statement, ask: "Will someone look at this during an incident?" If the answer is no, it's DEBUG level at most.

If you're spending too much on logs and want smarter log management, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops.
https://novaaiops.com

Comments

No comments yet. Start the discussion.