PII-Shield: Cleaning PII From Logs Before It Reaches ELK
DEV Community Grade 8 3d ago

PII-Shield: Cleaning PII From Logs Before It Reaches ELK

The first idea was simple. Take a log line. Look at suspicious parts. Count entropy. Hide anything that looks like a random secret. PII means personally identifiable information. It includes emails, phone numbers, addresses, passport numbers, card numbers, access tokens, and other values that should not move freely through logs. At first, entropy looked like a good signal. Many tokens, keys, and session values really look like noise: x9VdQp2Mz_La77kPq0 sk_live_51Nx... eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9... But entropy alone was not enough. Some values have low entropy and still must be hidden. For example: password=123 , token=dev , cvv=000 . Other values look random, but they are not secrets. Trace IDs, UUIDs, short commit hashes, request IDs, and path fragments can all look suspicious. If the entropy threshold is too low, the filter breaks useful logs. If the threshold is too high, it misses weak secrets. That is why PII-Shield grew beyond entropy. It added regex rules, sensitive keys, allow lists, and separate validators like the Luhn algorithm for payment card numbers. I also did not like where PII is often cleaned today. Many teams clean logs at the Fluentd, Logstash, SIEM, or log pipeline level. This helps. But it is late. The raw data has already left the application. It may have passed through buffers, retries, temporary files, alerts, and dashboards. PII-Shield tries to clean the data earlier. It is an open-source tool that removes PII and secrets from logs before they leave the pod. Repository: https://github.com/pii-shield/pii-shield The Basic Idea The short version is: application writes a log | v PII-Shield reads the raw log near the app | v only the cleaned line goes out The goal is not "clean it later". The goal is "do not let the raw value leave". PII-Shield can be used in several ways: a CLI tool or container that filters standard input and output; a sidecar container in Kubernetes; a Kubernetes operator that injects the sidecar when a pod is created; Helm charts for installation; WASM SDKs for Node.js and Python, if you want to run the scanner inside the process. The main Kubernetes path works like this. The application writes logs to a file on a shared volume. The sidecar reads that file. It scans each line. Then it writes the cleaned stream to standard output. A normal log collector can read it from there. β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ pod ────────────────────┐ β”‚ β”‚ β”‚app containerβ”‚ β”‚β”‚β”‚ β”‚β”‚ /var/log/app/output.log β”‚ β”‚vβ”‚ β”‚shared emptyDir volume β”‚ β”‚β”‚β”‚ β”‚vβ”‚ β”‚pii-shield sidecar -> sanitized stdout β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ v Loki / ELK / S3 / SIEM This is not invisible interception of everything. The application must write to a known file. But this path is easy to test. It does not need a shell inside the sidecar image. It also does not change the application runtime. What Counts As Sensitive The scanner does not have one magic button for "find all private data". It uses several layers. The first layer is sensitive keys. If a line has password=... , token=... , secret=... , or api_key=... , the value near that key should be hidden. input:payment failed token=sk_live_51Nx... output: payment failed token=[HIDDEN:9b22c1] The second layer is custom regex rules. Many companies have their own internal IDs. These can be ticket numbers, policy IDs, customer IDs, medical record numbers, legal case numbers, or contract numbers. These values are hard to guess from general signals. It is better to say it directly: export PII_CUSTOM_REGEX_LIST = '[ {"pattern": "^MRN-[0-9]{8}$", "name": "MedicalRecord"}, {"pattern": "^CASE-[0-9]{4}-[0-9]{6}$", "name": "CaseNumber"} ]' There is a performance detail here. If a user adds ten rules, and the scanner runs ten separate regex checks on each token, line processing can become heavier. So the rules are first checked one by one when the config is loaded. Then they are joined into one larger regexp with | . Each rule is wrapped in a group. This lets the scanner know which name to put into the redaction marker. For example: (^MRN-[0-9]{8}$)|(^CASE-[0-9]{4}-[0-9]{6}$) In the code this is stored as CombinedCustomRegex . The separate compiled rules are still kept in the config. But the main path uses the combined regexp. This does not promise a speed win for every possible rule set. Regex performance depends on the rules. But it removes the need to try each custom regex one after another on every token. The third layer is entropy. Many secrets look like random text. API keys, session tokens, and random passwords are common examples. PII-Shield uses Shannon entropy for this. If a token is long enough and random enough, the scanner treats it as suspicious. input:Authorization failed: x9VdQp2Mz_La77kPq0 output: Authorization failed: [HIDDEN:3e12aa] But entropy can fight with real logs. Commit hashes, UUIDs, trace IDs, request IDs, and file paths can also look suspicious. So there is an allow list: export PII_SAFE_REGEX_LIST = '[ {"pattern": "^[a-f0-9]{7}$"

The first idea was simple. Take a log line. Look at suspicious parts. Count entropy. Hide anything that looks like a random secret. PII means personally identifiable information. It includes emails, phone numbers, addresses, passport numbers, card numbers, access tokens, and other values that should not move freely through logs. At first, entropy looked like a good signal. Many tokens, keys, and session values really look like noise: x9VdQp2Mz_La77kPq0 sk_live_51Nx... eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9... But entropy alone was not enough. Some values have low entropy and still must be hidden. For example: password=123 , token=dev , cvv=000 . Other values look random, but they are not secrets. Trace IDs, UUIDs, short commit hashes, request IDs, and path fragments can all look suspicious. If the entropy threshold is too low, the filter breaks useful logs. If the threshold is too high, it misses weak secrets. That is why PII-Shield grew beyond entropy. It added regex rules, sensitive keys, allow lists, and separate validators like the Luhn algorithm for payment card numbers. I also did not like where PII is often cleaned today. Many teams clean logs at the Fluentd, Logstash, SIEM, or log pipeline level. This helps. But it is late. The raw data has already left the application. It may have passed through buffers, retries, temporary files, alerts, and dashboards. PII-Shield tries to clean the data earlier. It is an open-source tool that removes PII and secrets from logs before they leave the pod. Repository: https://github.com/pii-shield/pii-shield The Basic Idea The short version is: application writes a log | v PII-Shield reads the raw log near the app | v only the cleaned line goes out The goal is not "clean it later". The goal is "do not let the raw value leave". PII-Shield can be used in several ways: - a CLI tool or container that filters standard input and output; - a sidecar container in Kubernetes; - a Kubernetes operator that injects the sidecar when a pod is created; - Helm charts for installation; - WASM SDKs for Node.js and Python, if you want to run the scanner inside the process. The main Kubernetes path works like this. The application writes logs to a file on a shared volume. The sidecar reads that file. It scans each line. Then it writes the cleaned stream to standard output. A normal log collector can read it from there. β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ pod ────────────────────┐ β”‚ β”‚ β”‚ app container β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ /var/log/app/output.log β”‚ β”‚ v β”‚ β”‚ shared emptyDir volume β”‚ β”‚ β”‚ β”‚ β”‚ v β”‚ β”‚ pii-shield sidecar -> sanitized stdout β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ v Loki / ELK / S3 / SIEM This is not invisible interception of everything. The application must write to a known file. But this path is easy to test. It does not need a shell inside the sidecar image. It also does not change the application runtime. What Counts As Sensitive The scanner does not have one magic button for "find all private data". It uses several layers. The first layer is sensitive keys. If a line has password=... , token=... , secret=... , or api_key=... , the value near that key should be hidden. input: payment failed token=sk_live_51Nx... output: payment failed token=[HIDDEN:9b22c1] The second layer is custom regex rules. Many companies have their own internal IDs. These can be ticket numbers, policy IDs, customer IDs, medical record numbers, legal case numbers, or contract numbers. These values are hard to guess from general signals. It is better to say it directly: export PII_CUSTOM_REGEX_LIST='[ {"pattern": "^MRN-[0-9]{8}$", "name": "MedicalRecord"}, {"pattern": "^CASE-[0-9]{4}-[0-9]{6}$", "name": "CaseNumber"} ]' There is a performance detail here. If a user adds ten rules, and the scanner runs ten separate regex checks on each token, line processing can become heavier. So the rules are first checked one by one when the config is loaded. Then they are joined into one larger regexp with | . Each rule is wrapped in a group. This lets the scanner know which name to put into the redaction marker. For example: (^MRN-[0-9]{8}$)|(^CASE-[0-9]{4}-[0-9]{6}$) In the code this is stored as CombinedCustomRegex . The separate compiled rules are still kept in the config. But the main path uses the combined regexp. This does not promise a speed win for every possible rule set. Regex performance depends on the rules. But it removes the need to try each custom regex one after another on every token. The third layer is entropy. Many secrets look like random text. API keys, session tokens, and random passwords are common examples. PII-Shield uses Shannon entropy for this. If a token is long enough and random enough, the scanner treats it as suspicious. input: Authorization failed: x9VdQp2Mz_La77kPq0 output: Authorization failed: [HIDDEN:3e12aa] But entropy can fight with real logs. Commit hashes, UUIDs, trace IDs, request IDs, and file paths can also look suspicious. So there is an allow list: export PII_SAFE_REGEX_LIST='[ {"pattern": "^[a-f0-9]{7}$", "name": "GitShortSHA"} ]' The allow list runs before other checks. If the rule is too broad, it can allow a value that should be hidden. This is not a bug in the idea. It is the cost of manual tuning. The fourth layer is validation for specific data types. Payment card numbers are checked with the Luhn algorithm. This is not just a regex for 16 digits. The scanner looks for digit sequences from 13 to 19 digits. It checks boundaries. It drops weak digit sets. Then it checks the Luhn checksum. There is also a context check to reduce false positives. A number can pass Luhn and still be just a trace value: TraceId=4556737586899855 That should not always be treated as a card. But this line has card context: visa card 4556737586899855 provided This one should be hidden. Why Use A Hash Instead Of [REDACTED] If every secret becomes [REDACTED] , debugging gets harder. Sometimes you need to know that ten errors all had the same token or the same user ID. You still should not see the raw value. So PII-Shield replaces a sensitive value with a short marker: [HIDDEN:a1b2c3] The marker is based on a salt. If PII_SALT is stable, the same raw value gets the same marker. QA and SRE teams can compare events across logs. If the salt is random on every start, this link is lost after a restart. For production, the salt should come from a secret: env: - name: PII_SALT valueFrom: secretKeyRef: name: pii-shield-secrets key: salt It is also better to require a strong salt: PII_REQUIRE_STRONG_SALT=true Quick Test Without Kubernetes The fastest way to try the scanner is through the container: echo 'login failed email=ivan@example.com password=MySecretPass123!' \ | docker run -i --rm ghcr.io/pii-shield/pii-shield:2.1.0 The output should look like this: login failed email=[HIDDEN:...] password=[HIDDEN:...] The exact hash suffix depends on the salt. Kubernetes: Operator And Policy For Kubernetes, there is an operator. It can be installed with Helm: helm repo add pii-shield https://pii-shield.github.io/pii-shield/ helm repo update helm install pii-shield-operator pii-shield/pii-shield-operator \ -n operator-system \ --create-namespace Then you create a PiiPolicy : apiVersion: core.pii-shield.io/v1alpha1 kind: PiiPolicy metadata: name: strict-policy namespace: default spec: injectionMode: file logPath: /var/log/app/output.log failPolicy: open Then you mark the deployment: apiVersion: apps/v1 kind: Deployment metadata: name: billing-api spec: template: metadata: labels: pii-shield.io/inject: "true" annotations: pii-shield.io/policy: "strict-policy" The webhook adds the sidecar, the volume, and the settings. The application writes to /var/log/app/output.log . The sidecar reads this file and prints the cleaned stream. The Helm chart has small resource limits: 30Mi of memory and 50m CPU for the sidecar. You still need to measure this on your own logs. This matters when logs contain a lot of JSON or very long lines. Why The Scanner Does Not Use A Heavy JSON Parser Logs often come as JSON. But using encoding/json for every line would add overhead during constant stream processing. The scanner only needs to find values and keep the structure valid enough for redaction. So it uses a narrower parser. It can handle JSON-like text for this task, but it does not turn every line into a full object tree. This does not make the code prettier. It does reduce allocations and surprises under load. There are tests for: - nested JSON; - broken lines; - binary garbage; - multilingual logs; - false positives; - custom regex rules; - fuzz regressions. For scanner-only benchmarks: go test -bench=. -benchmem ./pkg/scanner For end-to-end CLI throughput: ./benchmark/run_benchmarks.sh In the current project notes, the normal Go scanner was in the microsecond range per line on a synthetic corpus. A neural PII detector on CPU was about three orders of magnitude slower in my test. So a model layer should not run on every line by default. If it is added, it should be an explicit mode. It may make sense for medical logs, legal logs, support chats, and similar domains. Fail Open Or Fail Closed A log filter has an unpleasant choice. What should happen if the scanner fails? fail open means the line is allowed to pass. Logs keep flowing. The application and monitoring do not go blind. But there is a risk that a raw value escapes. fail closed means the raw line is not released. The filter returns a drop marker instead. This is safer for privacy. It is worse for debugging. PII-Shield makes this configurable: PII_FAIL_POLICY=open or: PII_FAIL_POLICY=closed The default is open . Losing logs in production can become a separate incident. For strict compliance workloads, the right choice may be different. Current Limits I do not want to hide this part at the end of the README. PII-Shield can be run and tested now. But not all modes have the same maturity: - file-based sidecar mode is the main practical path; - fully transparent protection for normal Kubernetes

Comments

No comments yet. Start the discussion.