DEV Community 3h ago

SQL Meets AI - Inside the Rule Engine of an AI-Powered SQL Validation Solution

Why "AI validation" instead of just running the query?

A database will always tell you when a query is broken - after you run it. That has three problems in real projects:

The error arrives too late. In a CI pipeline or a code review, you want feedback before merge, not in production logs.
The error is cryptic. syntax error at or near "FORM" is accurate but unhelpful; a junior developer still has to figure out what went wrong.
The database has no policy opinion. It will happily execute DROP TABLE users; if the credentials allow it.

An AI-style validation agent flips the model: SQL is treated as untrusted input, analyzed statically, and returned with a structured, human-readable review - the same way a senior developer would comment on a pull request.

Developer / CI / LLM
        |
        v
REST API (Express + Zod)
        |
        v
Rule Engine          ← this article
        |
        v
Explanation Agent
        |
        v
Structured JSON review

The agent never connects to a database. That is a feature, not a limitation: it can run anywhere (CI runners, editors, serverless functions) with zero credentials and zero risk.

Step zero: never trust raw SQL text

Here is the most underrated part of the whole engine. Before any rule runs, the SQL string is normalized and its string literals are stripped:

const stringLiteralPattern = /'(?:''|[^'])*'|"(?:\\"|[^"])*"/g;

export function normalizeSql(sql: string): string {
  return sql.replace(/\s+/g, " ").trim();
}

export function stripStringLiterals(sql: string): string {
  return sql.replace(stringLiteralPattern, "''");
}

Why does this matter? Consider this perfectly valid query:

SELECT 'FORM is mentioned inside a string' AS note;

A naive validator that scans for the typo FORM would flag it as an error - a false positive. By replacing every literal with an empty '' first, keyword checks only ever see structural SQL, never user data. The regex even handles escaped quotes ('' inside single-quoted strings), which is the classic edge case that breaks homemade SQL scanners.

This is the same principle behind SQL injection defense: data and structure must never be confused. Any AI database tool - including LLM-backed ones - needs this separation, otherwise a malicious string literal could manipulate the analysis (a form of prompt injection).

The rule engine: deterministic checks, structured output

Every rule in src/validation/rules.ts follows the same contract - it either stays silent or emits a structured error object:

if (/\bFORM\b/i.test(literalSafeSql)) {
  errors.push({
    code: "TYPO_FORM",
    message: "Possible typo detected.",
    suggestion: "FROM",
    severity: "error"
  });
}

The interesting rules are the ones that need actual parsing logic, not just a regex. Two examples:

Detecting unclosed quotes with a state machine

You cannot detect an unclosed quote with a single regex, because single quotes, double quotes, escaped quotes and doubled quotes ('') all interact. The engine walks the string character by character, tracking quote state:

export function hasClosedQuotes(sql: string): boolean {
  let singleQuoteOpen = false;
  let doubleQuoteOpen = false;

  for (let index = 0; index < sql.length; index += 1) {
    const current = sql[index];
    const next = sql[index + 1];

    if (current === "'" && !doubleQuoteOpen) {
      if (next === "'") {
        index += 1;
        continue;
      } // escaped '' inside a literal
      singleQuoteOpen = !singleQuoteOpen;
    }

    if (current === '"' && !singleQuoteOpen) {
      doubleQuoteOpen = !doubleQuoteOpen;
    }
  }

  return !singleQuoteOpen && !doubleQuoteOpen;
}

This is a tiny state machine - the same technique real SQL lexers use, just scoped down to one job.

Guessing a missing comma without a full parser

SELECT id name email FROM users; is one of the most common beginner mistakes. The engine extracts the SELECT list and applies a heuristic: multiple bare identifiers, no commas, and no alias keywords like AS or DISTINCT that would legitimize the pattern:

function looksLikeMissingComma(sql: string): boolean {
  const match = /^\s*SELECT\s+(.+?)\s+FROM\s+/i.exec(sql);
  if (!match) return false;

  const selectList = match[1].trim();
  if (!selectList || selectList === "*" || selectList.includes(",")) return false;

  const parts = selectList.split(/\s+/).filter(Boolean);
  if (parts.length < 2) return false;

  const aliasMarkers = new Set(["AS", "DISTINCT"]);
  return !parts.some((part) => aliasMarkers.has(part.toUpperCase()));
}

Notice the design philosophy: the rule prefers staying silent over being noisy. SELECT id AS user_id FROM users passes untouched. A validator that cries wolf gets disabled within a week - precision matters more than recall for developer tools.

Dialects: the same SQL is valid and invalid at the same time

One of the most valuable things an AI database solution can teach (or enforce) is that SQL is not one language. The same pagination intent is written four different ways:

Engine	Syntax
PostgreSQL / MySQL	`SELECT * FROM users LIMIT 5;`
SQL Server	`SELECT TOP 5 * FROM users;`
Oracle	`SELECT * FROM users FETCH FIRST 5 ROWS ONLY;`
ANSI standard	`FETCH FIRST 5 ROWS ONLY`

The engine encodes these differences as engine-specific rules. For example:

if (engine === "postgresql" && /\bTOP\s+\d+\b/i.test(sql)) {
  errors.push({
    code: "ENGINE_SPECIFIC",
    message: "TOP is not valid PostgreSQL SELECT syntax.",
    suggestion: "Use LIMIT in PostgreSQL.",
    severity: "error"
  });
}

if (engine === "oracle" && /\bLIMIT\s+\d+\b/i.test(sql)) {
  errors.push({
    code: "ENGINE_SPECIFIC",
    message: "LIMIT is not valid Oracle syntax.",
    suggestion: "Use FETCH FIRST n ROWS ONLY.",
    severity: "error"
  });
}

There is a nice detail in ANSI mode: dialect-specific syntax is reported as a warning, not an error. The query might run fine on your engine - the agent is just telling you it is not portable. Severity levels turn a binary validator into a nuanced reviewer.

Governance: the read-only firewall

The agent also enforces policy. A tokenizer splits the (literal-stripped) SQL and checks every token against a denylist:

const forbiddenStatements = [
  "ALTER", "CREATE", "DELETE", "DROP",
  "EXEC", "EXECUTE", "INSERT", "MERGE",
  "TRUNCATE", "UPDATE"
];

const forbidden = upperTokens.find((token) => forbiddenStatements.includes(token));

Combined with the multiple-statement check, this blocks the classic piggyback attack pattern before it exists:

SELECT * FROM users; DROP TABLE users;
-- rejected: MULTIPLE_STATEMENTS + FORBIDDEN_STATEMENT

This is exactly the kind of layer you want in front of any LLM that generates SQL: even if the model hallucinates a destructive statement, the deterministic validator kills it before execution.

Real-world integration examples

1. CI/CD quality gate (GitHub Actions)

Fail the build if any .sql file in the repo has blocking errors:

- name: Validate SQL files
  run: |
    for f in $(git ls-files '*.sql'); do
      body=$(jq -n --arg q "$(cat $f)" '{engine:"postgresql", query:$q}')
      valid=$(curl -s -X POST http://localhost:3000/api/validate \
        -H "Content-Type: application/json" -d "$body" | jq '.valid')
      if [ "$valid" != "true" ]; then
        echo "❌ SQL validation failed in $f"; exit 1
      fi
    done

2. Pull request review bot

Because the API returns structured JSON (code, message, suggestion, explanation), a bot can post readable review comments with zero extra parsing:

🤖 SQL Review - migration_2026_07.sql

Issue: TOP is not valid PostgreSQL SELECT syntax.
Suggestion: Use LIMIT in PostgreSQL.

3. Tool for an LLM agent

This is where "AI database solutions" gets truly interesting. A text-to-SQL model should never self-certify its own output. Instead, register the validator as a tool:

{
  "name": "validate_sql",
  "description": "Validate a SQL statement before execution. Returns structured errors and a corrected version when possible.",
  "input_schema": {
    "type": "object",
    "properties": {
      "engine": {
        "enum": ["ansi", "mysql", "postgresql", "oracle", "sqlserver"]
      },
      "query": {
        "type": "string"
      }
    },
    "required": ["engine", "query"]
  }
}

The workflow becomes: LLM generates SQL → deterministic validator reviews it → only validated SQL is ever considered for execution. The probabilistic and deterministic layers check each other - which is the safest architecture we know today for AI + databases.

What I would improve next

A real parser (AST-based). Regex heuristics are perfect for teaching and for a first line of defense, but nested subqueries, CTEs and window functions need a genuine grammar. Libraries like node-sql-parser would be the natural next step.
Line/column positions. IDE diagnostics and PR comments become 10× more useful when they point at the exact character.
Configurable policy. INSERT should be blocked in a dashboard context but allowed in a migrations/ folder.
An optional LLM explanation layer - but always after the deterministic rules, explaining structured findings rather than inventing them.

Conclusion

The AI database conversation is dominated by generation, but validation and governance are where AI-style tooling delivers immediate, low-risk value: no credentials, no execution, deterministic and testable behavior, and output structured enough for humans, CI systems and LLMs to consume alike. If text-to-SQL is the accelerator, a validation agent is the brake and the seatbelt. Any serious AI database solution needs both.

Repository: github.com/cs2026086510-a11y/sql-ai-validator-agent
Companion article (API + Codex Skill): Building an AI SQL Validation Agent with Codex and a REST API by Cristian Soto

Read on DEV Community ↗ ← Back to News