What Makes an AI System Adaptive in Practice
DEV Community

What Makes an AI System Adaptive in Practice

The Problem with Reacting to the Last Event

Say you're building a system that serves practice questions to someone studying for a test. The naive approach: track correct/incorrect on the last question, branch accordingly. This fails in a few specific ways.

First, it has no memory beyond one step - a lucky guess on a hard question sends you to an even harder one, and a careless mistake on an easy question drops you into material beneath your actual level.

Second, it can't express confidence. There's no way to know if the system's read on your ability is stable or if it's bouncing around on noise.

Third, it doesn't generalize: the branching logic is usually hard-coded per content type, so every new domain needs its own set of if-else rules.

None of these are implementation bugs. They're consequences of the underlying model being too thin: a single boolean, discarded after each step.

A system which is truly adaptive replaces that boolean with a persistent, continuously-updated estimate of some hidden variable: ability, preference, risk tolerance, whatever the domain. Then, it calls for and treats every new observation as evidence that updates the estimate, rather than a standalone trigger for a branch.

Three Components, Not a Single Algorithm

Once you commit to maintaining a real hidden-state estimate, the system needs three distinct pieces working together. It's tempting to think the "adaptive" part is just the estimation algorithm - a good statistical model - but the estimation by itself does nothing without two other pieces wrapped around it.

Component one: estimating hidden state from behavior. You need a principled way to go from "here's what the user did" to "here's what we now believe about their underlying ability or preference." This has to handle noise gracefully - one anomalous data point shouldn't wildly swing the estimate - and it has to produce something with a notion of confidence.

Component two: selecting the next input calibrated to that estimate. Once you know (approximately) where someone stands, you need a policy for choosing what to show them next. The obvious move is to aim at the boundary of their current ability - not so easy it's wasted, not so hard it's demoralizing or uninformative. This is a search problem: find content matching the target zone, and have a fallback strategy for when nothing matches closely.

Component three: a pipeline that keeps producing new inputs. This is the piece people forget. Estimation and selection are useless if the pool of things to select from runs dry. Any adaptive system that runs long enough - like a student studying for months, a user coming back daily - will eventually exhaust a static content set. You need a way to generate new material, validate that it's actually usable, and make it available for selection without regenerating it redundantly for the next person who needs the same thing.

Miss any of these three and the system degrades:

  • No estimation: you're back to if-else.
  • No selection strategy: you have a great model of the user and nothing intelligent to do with it.
  • No content pipeline: the system works beautifully until the content runs out, then quietly stops being adaptive and starts being static.

Component One in Practice: Item Response Theory

I'll ground this in the system I actually built: an adaptive exam prep platform, because abstract components are easier to trust once you've seen them solve a real problem.

For estimating a student's ability from their answers, I used Item Response Theory (IRT), specifically the two-parameter logistic (2PL) model. This isn't a novel technique. It's the same psychometric framework behind standardized tests like the GRE and GMAT, but it's a clean, well-understood instance of "component one" that generalizes well beyond exams.

The model assigns each question two parameters:

  • Difficulty (b), placed on the same scale as student ability, ranging roughly from -3 to +3
  • Discrimination (a), which captures how sharply the question distinguishes between students at different ability levels. A question with high discrimination is more informative. A student's correctness on that question tells you a lot about where they actually stand. A question with low discrimination tells you almost nothing, regardless of who answers it.

The probability that a student with ability ฮธ (theta) answers a given question correctly is:

P(correct | ฮธ) = 1 / (1 + e^(-a(ฮธ - b)))

This is a logistic curve. When a student's ability matches the question's difficulty exactly (ฮธ = b), the probability of a correct answer is 50%. As ability rises above difficulty, probability climbs toward 1; as it falls below, probability drops toward 0. The discrimination parameter controls how steep that climb is. High discrimination means the curve is nearly a step function; low discrimination means it's a shallow slope.

After each scored section, the system needs to update its estimate of ฮธ. A simple, effective approach is a gradient-based update: for each answered question, compute the expected probability of a correct answer given the current ฮธ, compare it to what actually happened, and nudge ฮธ in the direction of the surprise, scaled by a learning rate.

function updateTheta(theta, answers, items, learningRate = 0.15) {
  for each answer:
    expected = probability2PL(theta, item)
    delta = learningRate ร— (actualCorrect - expected)
    theta += delta
  return clamp(theta, -3, 3)
}

The learning rate is a real design decision, not a throwaway constant. Set it too high, and a single lucky guess or careless mistake swings the estimate wildly - you've reintroduced the noise problem the whole model was supposed to solve. Set it too low, and the system barely reacts to real changes in ability, which defeats the purpose of adapting at all. In practice, something around 0.15 balances responsiveness against stability, though the right value depends on how many data points you get per update cycle and how noisy those data points are.

One more concept worth carrying into any domain: Fisher Information. This tells you how much a given question, at a given ability estimate, would actually teach you about the student if they answered it.

function fisherInformation(theta, item) {
  p = probability2PL(theta, item)
  q = 1 - p
  return item.discriminationAยฒ ร— p ร— q
}

This function is maximized when p is close to 0.5 - exactly the point where the outcome is most uncertain. That's intuitive once you sit with it: if you already know someone will almost certainly get a question right or wrong, answering it tells you very little you didn't already know. The questions that teach the system the most are the ones sitting right at the edge of what the student can do.

This principle - target maximum uncertainty, not maximum confidence - generalizes far outside exams. Any system trying to learn about a user efficiently should be seeking out the inputs it's least certain about, not the ones it already has a confident read on.

The standard error of the ability estimate is proportional to one over the square root of total accumulated Fisher Information. Track this, and you get something most naive systems can't offer: a confidence interval on your own model of the user, not just a raw point estimate.

Component Two in Practice: Progressive Difficulty Search

With an ability estimate in hand, the next problem is selection: given a student's current ฮธ, what should they see next?

The natural instinct is to search for content matching ฮธ as closely as possible. But a tight match isn't always available. Maybe the content pool at that exact difficulty is thin, or the student is in an unusual range with less material.

The fix is a progressive widening search: start with a narrow band around the target, and only widen it if nothing in that band is available.

ranges = [
  [theta - 0.8, theta + 0.8],   // ideal range
  [theta - 1.5, theta + 1.5],   // widen
  [theta - 2.5, theta + 2.5],   // widen further
  [-3, 3]                        // full range as last resort
]

This guarantees the student almost always sees well-calibrated material, and only degrades gracefully - rather than failing outright - when the ideal match doesn't exist. It's a simple pattern, but it's the kind of detail that separates a system that's adaptive in principle from one that's adaptive in practice, under the actual constraints of a finite content pool.

Component Three in Practice: On-Demand Generation with Write-Through Caching

This is the piece that's easiest to skip and most costly to skip. A static content bank, no matter how large, eventually runs out. Running out doesn't fail loudly. It fails quietly: users start seeing repeated material, memorizing answers instead of engaging with the underlying skill, and the system's carefully-tuned selection logic starts operating on an increasingly thin and stale pool.

The fix is to generate new content on demand and cache it for reuse, rather than pre-building an exhaustive static bank. The flow looks like this:

  1. Check the cache at the target difficulty band, using the same progressive-widening logic as selection
  2. On a cache miss, call a generation model - in my case, a large language model - with the target difficulty and content type
  3. Validate the output against a strict schema before it's trusted
  4. Run quality heuristics beyond structural validation: is this actually good content, not just correctly-shaped content?
  5. Insert into the database before returning to the requester: write-through, not write-behind
  6. Record that this particular user has now seen this content, so it won't repeat for them
function getOrGenerate(...) {
  cached = getFromCache(...)
  if (cached) return cached

  generated = callGenerationModel(...)
  validated = validateAndFilter(generated)

  for each item in validated:
    db.insert(contentTable, item)
    db.insert(userSeenTable, { userId, itemId })

  return validated
}

The validation step matters more than it might seem. Generation models are unreliable by default. Structurally malformed output, subtly wrong content, or output that's correctly shaped but low quality all need to be caught before they enter the system. A discriminated union schema per content type, checked at the boundary, catches structural problems. Separate quality heuristics - plausible distractors, an unambiguous correct answer, difficulty ratings that actually match the content's real difficulty - catch problems that pass schema validation but are still bad.

Write-through caching, specifically inserting before returning rather than after, is the detail that eliminates an entire category of bugs. The database becomes the single source of truth immediately, the next user at that difficulty level gets a cache hit instead of triggering another generation call, and there's no separate cache-invalidation logic to get wrong later. The cost is roughly the latency of one database insert per newly-generated item. In practice, it's on the order of tens of milliseconds, which is negligible against the cost of a redundant generation call.

Why All Three Components Have to Feed Each Other

Here's the part that's easy to miss if you build these three components as separate features rather than as one system: they only produce real adaptivity when they're wired into a loop, each one's output feeding the next one's input.

  • The ability estimate determines what difficulty band to request from generation.
  • A generation's output, once validated and cached, becomes part of the pool that selection draws from.
  • What the student does with the selected content (correct, incorrect, how long it took) feeds back into the next ability estimate update.
  • And downstream signals, like a pattern of wrong answers in a particular area, feed into what future generation and selection should prioritize.

Build these as three independent features that happen to share a database, and you get something that looks adaptive on a feature list but doesn't behave like one system. The estimation module doesn't know what content the generation module has recently produced. The generation module doesn't know what the selection module actually needs. Nothing closes the loop, so the system's behavior over time is the sum of three components acting somewhat independently, rather than one coherent process getting smarter about the user as it goes.

Getting this loop to actually close - rather than just having three modules that technically interoperate - was, in my experience, harder than getting any single component right. IRT is well-documented math. Write-through caching is a known pattern. Progressive difficulty search is a straightforward algorithm. None of the individual pieces required much invention. The loop between them did.

What This Looks Like Applied

I built this into Xamio, an adaptive prep platform for AMIRNET, an English proficiency exam used for university admission in Israel. The IRT engine estimates a student's ability

Comments

No comments yet. Start the discussion.