LLM APIs as Infrastructure: Building Deterministic Systems Around Probabilistic AI
DEV Community

LLM APIs as Infrastructure: Building Deterministic Systems Around Probabilistic AI

The nature of an AI API

Somebody built this thing before you ever touched it. A lab trained it on an enormous amount of text, aligned it, wrapped it behind an endpoint, and rented it to you. You don't own the model. You inherit the interface: its capabilities, its limits, its context window, and its cost.

When you send a prompt, the model has two sources of information: what it learned during training and the context you provide. Training is fixed. Context is yours. From those two inputs, it produces the most probable continuation. People hear "probable" and think "guessing." But it isn't guessing like a coin flip. It's producing the most likely continuation given everything the model has learned and everything you just told it - a weighted, structured output shaped by patterns and distributions.

Most of the time, that's useful. Sometimes, it's confidently wrong. If it gets something wrong, that does not mean there is a bug waiting to be patched. It's the behavior you have to design around.

Beyond the API call

Traditional APIs train us to think in predictable systems: send a request, expect a response. Even complex systems have repeatable behavior. There may be state management, caching, retries, and rate limits, but your code usually knows what success and failure look like.

AI systems break that expectation at the model layer. Same prompt, same context, but different output. Not because something failed. Because that's how the component works. The request can succeed, the response can be 200 OK, the JSON can validate, and your logs can look clean, yet the answer can still be wrong. Hallucinations, dropped instructions, valid JSON with incorrect data. Nothing technically failed. The model simply produced the wrong output. That shifts the focus from whether the system responded to whether the response is right.

Traditional APIs usually fail in ways your code already knows how to handle. The request either resolves or rejects, and your error handling catches both.

try {
  const response = await fetchClaim(id);
  renderClaim(response);
} catch (error) {
  showError(error);
}

Traditional software encourages a simple mental model: Request → Response → Done.

AI systems need more checkpoints between the user's input and the final output - and a separate quality gate before anything ships at all.

Runtime (every request):
User Input → Prompt/Context → LLM → Structured Output → Schema Validation → Business Rules → UI

Pre-deployment (CI/CD):
Test Dataset → Full Pipeline → Eval → Pass/Fail Gate → Deploy

The API call is only one step. The model is only one part. It is not the whole app.

Where determinism matters

In a probabilistic system, determinism matters when the output becomes data, triggers an action, or changes the state of the application. If a user says, "I got rear-ended yesterday," the assistant can explain the interpretation in different ways. But the submitted incidentDate cannot be different every time. The system needs one resolved value, one validation path, and one record of what was submitted. The wording around the field can be flexible - the value that enters the system cannot.

Lock these down:

  • Final form fields: incidentDate, injuries, accidentType
  • Required fields: whether something is complete or missing
  • Business rules: whether the form can be submitted
  • Actions: sending the form, saving a record, approving a claim
  • Audit/history: what value was used and why

Let these breathe:

  • The assistant's wording
  • The explanation shown to the user
  • Suggestions for what to clarify
  • Summaries or labels that do not trigger action

The boundary in practice

Every step between the model and your user is a layer you own and control. Here's what that looks like in a real integration.

In a form application, the model can help turn a user's plain language into structured data. A user might write: "I got rear-ended yesterday. My side mirror broke, but nobody was hurt."

The application expects this shape:

interface IncidentExtraction {
  formData: {
    incidentDate: string;
    accidentType: string;
    damageDescription: string;
    injuries: "Yes" | "No" | "Unclear";
    notes: string;
  };
  feedback: string[];
  confirmation: string;
}

But the model does not receive a TypeScript interface. It receives instructions and a strict schema.

import { GoogleGenAI, Type } from "@google/genai";

const ai = new GoogleGenAI({});

const incidentSchema = {
  type: Type.OBJECT,
  properties: {
    formData: {
      type: Type.OBJECT,
      properties: {
        incidentDate: {
          type: Type.STRING,
          description: "Resolved date in YYYY-MM-DD format."
        },
        accidentType: {
          type: Type.STRING,
          description: "Short classification of the incident."
        },
        damageDescription: {
          type: Type.STRING,
          description: "Brief description of the damage."
        },
        injuries: {
          type: Type.STRING,
          enum: ["Yes", "No", "Unclear"],
          description: "Whether injuries were mentioned."
        },
        notes: {
          type: Type.STRING,
          description: "Any additional relevant details."
        }
      },
      required: ["incidentDate", "accidentType", "damageDescription", "injuries", "notes"]
    },
    feedback: {
      type: Type.ARRAY,
      items: { type: Type.STRING },
      description: "User-facing notes about assumptions or missing details."
    },
    confirmation: {
      type: Type.STRING,
      description: "Short message asking the user to review the filled form."
    }
  },
  required: ["formData", "feedback", "confirmation"]
};

const today = "2026-07-02";

const prompt = `Today is ${today}. Extract incident details from the user's description.
User description: "I got rear-ended yesterday. My side mirror broke, but nobody was hurt."
Rules:
- Resolve relative dates using today's date.
- If a field is unclear, use "Unclear" or an empty string.
- Do not submit the form.`;

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash",
  contents: prompt,
  config: {
    responseMimeType: "application/json",
    responseSchema: incidentSchema,
    temperature: 0.1 // Lower temperature for more consistent extractions
  }
});

The schema shapes the output. It does not guarantee it. Everything after this line is your code enforcing your rules:

let result: IncidentExtraction;
try {
  result = JSON.parse(response.text);
} catch {
  // Unparseable output. Do not guess. Fall back.
  return askUserToFillManually("I couldn't process that description. Please fill in the form directly.");
}

const { formData } = result;
const issues: string[] = [];

// The schema said "YYYY-MM-DD." Verify it anyway.
if (!isValidDate(formData.incidentDate)) {
  issues.push("Incident date could not be resolved.");
}

// Business rule the model knows nothing about.
if (formData.incidentDate > today) {
  issues.push("Incident date cannot be in the future.");
}

// Enums are enforced by the schema - trust, but verify.
if (!["Yes", "No", "Unclear"].includes(formData.injuries)) {
  issues.push("Injury status is unclear.");
}

if (issues.length > 0) {
  // Bad data never renders as truth. The user resolves it, not the model.
  return askUserToReview(formData, issues);
}

renderForm(formData, result.feedback, result.confirmation);

With the schema in place, the model can return a clean string matching your structure:

{
  "formData": {
    "incidentDate": "2026-07-01",
    "accidentType": "Rear-end collision",
    "damageDescription": "Broken side mirror",
    "injuries": "No",
    "notes": ""
  },
  "feedback": [
    "I interpreted \"yesterday\" as July 01, 2026.",
    "I marked injuries as No because the user said nobody was hurt."
  ],
  "confirmation": "I filled in the form based on your description. Please review before submitting."
}

Key practices for reliability:

  • Use structured output with a strict schema.
  • Always validate the output in your code - schemas are very helpful but not 100% guaranteed.
  • Include a feedback array for transparency.
  • Fall back gracefully when parsing or validation fails.

This is the boundary. The model interprets language. Your system owns execution and remains the source of truth.

Testing with Evals

The validation above protects each request. It says nothing about whether your pipeline is good overall - that's what evals are for. With normal software, you test whether something works or fails. With AI, you also need to test whether the answer is good enough.

Evals let you run a test dataset through your pipeline, measure quality against a clear bar, and catch weak outputs before changes reach users.

const results = await Promise.all(
  testDataset.map(test => extractIncidentFromLLM(test.input))
);

const accuracy =
  results.filter((r, i) =>
    r.formData.injuries === testDataset[i].expectedInjuries &&
    isValidDate(r.formData.incidentDate)
  ).length / testDataset.length * 100;

if (accuracy < 95) {
  throw new Error(`Eval failed: accuracy dropped to ${accuracy.toFixed(1)}% (threshold: 95%)`);
}

Best practices:

  • Include edge cases, ambiguous language, and adversarial inputs.
  • Track schema compliance, hallucination rate, and feedback quality.
  • Run evals in CI/CD.

In a real pipeline, you'd replace Promise.all with a concurrency-limited runner that handles retries and failed requests, so the eval tests the system without overwhelming the provider. A production eval keeps a separate score for each thing that can go wrong - like a report card instead of a GPA - so when the numbers drop, you know exactly what to fix.

Closing

The mistake is treating the model as though its job is to know the answer. Knowing was never the point. Treat the LLM API as infrastructure. Like a database, a message queue, or an API gateway, it has capabilities, constraints, and mechanics. Your job is not to make the model deterministic. It's to design the architecture around it so that probabilistic outputs become safe, predictable, and useful.

Comments

No comments yet. Start the discussion.