DEV Community 1h ago

🎯 The AI Engineer 🤖 Interview Playbook 📖

⚡ TL;DR

The AI engineer role is software engineering with AI systems on top - you orchestrate models (LLMs, RAG, agents) into reliable products, not train models from scratch. Interviews test six things: ML/LLM fundamentals, applied ML, LLM/RAG engineering, coding, AI system design, and behavioral.

If you remember one thing: companies are hiring AI system builders, not people who can call an LLM API. The fastest way to stand out - think like a product + system owner, be explicit about failure modes, and show evaluation rigor. Evaluation is the single biggest skill gap among candidates, so it's your biggest opportunity.

The rest is discipline: solid DSA + Python, 2–3 deployed end-to-end projects, and the ability to explain trade-offs (quality vs. latency vs. cost) out loud.

🧭 What an AI engineer actually is

The role is new and definitions are still settling, so the first job is knowing what you're being hired for. Core responsibility: integrate AI into a product. Work with LLM providers (OpenAI, Anthropic) through their APIs, partner with PMs to find real user problems AI can solve, and ship reliably. It starts from a real problem - not "AI is cool, let's use it."

🔀 AI engineer vs. ML engineer vs. data scientist

Role	Focus	Owns	Day-to-day
AI engineer	Building with models	Prompts, pipelines, integration	RAG, prompting, tools, agents, evals
ML engineer	Optimizing models	Model weights, training	Training, features, metrics
Data scientist	Creating models	Datasets, experiments	Requirements → ML, modeling

The lines are blurry and the industry treats them as a spectrum. In practice, most postings are "ML engineer" or "software engineer with an AI focus." The consistent message from hiring managers: "Companies are not hiring for titles - they want to know if you can build reliable AI systems." If you can only do modeling or only do systems, you're already behind.

📈 Progressive complexity (know where a problem sits)

Simple: user input → prompt + LLM API → response.
RAG (~5× harder): add data pipelines, a search engine (vector/text), retrieval, reliability.
Agents (~10× harder): add tool calls, multi-step loops, trace instrumentation, tool-rollout management.

🚫 What AI engineers usually don't do

Create models from scratch, build custom architectures, or do heavy feature engineering. What they do: engineering best practices for AI systems, prompt design + versioning, product integration, and evaluation + monitoring.

🗺️ The interview process (what to expect)

Based on analysis of real job postings and candidate reports: the median process is 4 steps, most fall in the 3–5 range, and the whole thing runs 2–6 weeks.

Round	Typical length	What it tests
Recruiter / talent screen	15–30 min	Fit, salary expectations
Technical / coding	45–60 min	LeetCode-style, sometimes AI-flavored
AI/ML deep-dive	45–90 min	LLMs, RAG, hallucinations, fine-tuning vs. prompting
Take-home / project	1–7 days	Build a RAG or agent system
AI system design	60 min	Scale LLM apps, cost/latency optimization
Behavioral	30–60 min	STAR/SAIL, ownership in ambiguous work
Hiring manager / founder	15–60 min	Deep dive, motivation, values

🏢 Real loops (from candidate reports)

Mistral AI (Applied AI Engineer): LLM theory → coding → project deep-dive → tech manager → ML system design → take-home → values talk.
Amazon (GenAI, L6): LeetCode + practical ML coding (cosine similarity in NumPy) → SDE bar → GenAI depth (LLM/ViT architectures, fine-tuning, ROI estimation) → Leadership Principles throughout.
Eightfold.ai (Agentic AI): AI-agent-conducted coding round → 3-day take-home to build an agent → DSA interview with EM.
LangChain (AI Engineer): take-home (build an agent) → solution discussion → applied system design.
PostHog: talent call → 60-min technical → co-founder call → paid full-day SuperDay (compensated real work).
Microsoft (Applied AI/ML intern): AI-assisted coding (use ChatGPT, then re-prompt on a modified problem) → raw coding, no AI tools → behavioral.

Two trends to know: (1) in-person rounds are back (up from ~24% in 2022 to ~38% in 2025) to counter cheating; frontier labs increasingly require onsites. (2) References matter more - most top companies now want 2–3 references from recent managers.

🎯 The six question categories

Nearly every AI engineer loop draws from these six buckets. Prepare all six; weight by seniority and role.

ML & deep learning fundamentals - bias/variance, overfitting, precision/recall, ROC, gradient descent, CNNs, transformers, BERT/GANs.
Applied ML & infrastructure - pipelines, fine-tuning, transfer learning, FP32/FP16/BF16 trade-offs, sparse vs. dense, deployment.
LLM engineering & RAG - tokenization, context limits, cost/latency, hallucination, embeddings, vector search, chunking, grounding, re-ranking.
Coding / Python fundamentals - DSA (indexing/search/graph/tree/heap), Python internals (GIL, is vs ==, mutable/immutable, async), SQL.
AI system design - end-to-end pipelines, caching, cost, reliability, failure modes.
Behavioral - ambiguity, communication, influence, AI ethics, trade-off ownership.

📌 Focus by seniority

Level	Emphasis
Junior / Intern	Coding fundamentals, basic ML concepts, project enthusiasm, willingness to learn
Mid	End-to-end system knowledge, RAG pipelines, embeddings, production awareness
Senior	Trade-off fluency, system design at scale, failure-mode reasoning, cost optimization
Staff+	Technical leadership, cross-team influence, project presentations, org impact

At senior/staff levels, interviewers pick 3–5 topics and drill deep into failure modes and trade-offs rather than covering many topics superficially. Depth beats breadth.

🧠 Core knowledge checklist

The must-know surface area, grouped so you can self-audit. You don't need every advanced item, but you must be fluent in the basics and have opinions backed by trade-offs.

🔤 LLM fundamentals

Transformers: self-attention, Q/K/V, multi-head attention, positional encoding (RoPE), encoder vs. decoder vs. encoder-decoder.
Tokenization: BPE, WordPiece/SentencePiece, why domain terms get split badly.
Generation controls: temperature, top-p/top-k sampling, logits, context window, why the first token is slow (prefill vs. decode).
Efficiency: KV cache, quantization (INT8/INT4, FP16/BF16), distillation, MoE, Flash Attention, GQA.
Alignment: RLHF, DPO, instruction tuning, reward hacking, the "alignment tax."

📚 RAG (table stakes - expect deep questions)

Architecture: chunk → embed → index → retrieve → re-rank → generate.
Chunking strategies: fixed, recursive, semantic, parent-child. How to pick chunk size.
Retrieval: dense vs. sparse embeddings, cosine/dot/Euclidean, ANN, hybrid search, re-ranking.
Failure modes: hallucination despite good context, "lost in the middle," multi-hop questions, conflicting sources, stale data.
Query transforms: HyDE, decomposition, step-back prompting.
Citation/source attribution.
The key trade-off: RAG vs. fine-tuning vs. prompt engineering - and when you'd NOT use RAG.

🤖 Agents

ReAct, Plan-and-Execute, Reflection patterns; tool use / function calling; MCP.
Agent memory (short-term, long-term, episodic); the agent loop and stop conditions.
Failure handling: infinite loops, wrong tool selection, bad parameter extraction, token/budget blowups, guardrails against irreversible actions.
Single vs. multi-agent; orchestration; human-in-the-loop.

🎛️ Fine-tuning

Full vs. PEFT; LoRA / QLoRA; prefix/prompt tuning; adapters.
When to fine-tune (extreme specialization or latency) vs. default to prompt + RAG.
Catastrophic forgetting, dataset prep, key hyperparameters (LR, epochs, LoRA rank).

🚀 LLMOps / production

Serving (vLLM, continuous batching, speculative decoding, paged attention).
Prompt caching, semantic caching, streaming, structured output.
Observability: TTFT, inter-token latency, tokens/sec, per-user cost, tracing, drift.
Cost & reliability: model routing, fallbacks, rate limiting, graceful degradation, provider redundancy.

🛡️ Safety

Prompt injection (direct/indirect), jailbreaks, data leakage, PII handling.
Input/output guardrails, content filtering, red teaming, hallucination detection.

💡 Depth test: interviewers value "when would you NOT use RAG?" over "what is RAG?" Every concept should come with a trade-off and a failure mode.

💻 The coding round

The role is still mostly software engineering, so DSA fundamentals are non-negotiable. Algorithm rounds appear at OpenAI, Anthropic (90-min CodeSignal requiring perfect correctness), xAI (LeetCode Hard), Eightfold, and more.

What to drill

DSA: NeetCode 150/250, focus on patterns (indexing/search/graph/tree/heap) - not memorization. Use spaced repetition.
Python depth: GIL, concurrency vs. parallelism, async patterns, race conditions, is vs ==, mutable vs. immutable, reproducible code.
SQL: for handling datasets.
Full-stack basics: many AI roles are "low-key full-stack" - expect JS event loop, database choices, message queues.

AI-flavored coding (common warm-ups)

Cosine similarity / dot product / Euclidean distance from scratch (NumPy).
A basic RAG pipeline; semantic search; chunking strategies.
A simple agent with tool use; a function-calling handler.
Retry with exponential backoff; token counting / context management; a semantic cache.
From-scratch ML (frontier labs): multi-head attention, a transformer layer, LoRA, KV cache from memory. Use shape suffixes (Noam Shazeer method) to track tensor dimensions.

Note: these rounds are often 25–35 min, no debugging.

⚠️ Modern interviewers may run AI-assisted coding rounds (solve with ChatGPT, then re-prompt when they change the problem). They're testing how you prompt, verify, and direct the tool - not whether you can code unaided.

🏗️ AI system design

This is where senior candidates win or lose. The bar isn't "name the tools" - it's end-to-end system thinking plus a clear grasp of how the system breaks.

🧱 The frame that works

Present every solution as a pipeline, then stress-test each stage:

Input → Retrieval → Generation → Verification → Feedback

For each stage, answer: how does it fail, and how would you fix it? "If you can't explain how your system breaks and how you'd fix it, you're not ready."

6 habits that impress

Lead with product & business metrics. Anchor on user value: task success, retention, latency, cost - before naming a model.
Think in lifecycles, not static pipelines. Start simple, measure, find bottlenecks, iterate. "Only add complexity where it moves metrics."
Be fluent in trade-offs. Quality vs. latency vs. cost; internal model vs. external API; retrieval depth vs. hallucination risk.
Call out failure modes proactively - hallucination, bad retrieval, prompt brittleness - and your mitigation.
Show evaluation rigor (see §7).
Demonstrate pragmatic judgment: "I wouldn't use an LLM here - it's overkill," "we can get 80% with a cheaper model + rules," "gate expensive calls behind a confidence threshold."

💵 Cost reasoning separates production thinkers from prototypers

Be ready to estimate on the whiteboard. Example: 100K daily users × 10 interactions × ~2K tokens = 2B tokens/day ≈ $13K/day on a premium model. Then talk mitigation: caching, batching, model routing, smaller models behind confidence gates.

Common prompts

Design a RAG "chat with your docs," a deep-research agent, a multi-agent support system, an LLM inference platform, a recommender, content moderation, or an AI email assistant. A good scenario starts from a real user need and leaves the solution open - practice extracting the problem and asking clarifying questions before designing.

🎣 If you get an outdated prompt (e.g., "design a fixed-context RAG chatbot" when an agentic search design fits better), it's a signal about the company - its engineers may not be current. Answer well, but read the signal.

📊 Evaluation - your biggest differentiator

Evaluation is the biggest skill gap among AI engineer candidates, which makes it your biggest edge. "Unsuccessful LLM products almost always share a common root cause: a failure to create robust evaluation systems."

What to be able to discuss

Metrics beyond accuracy: faithfulness (is it grounded?), usefulness (does it solve the user's problem?), safety (does it resist harmful inputs?).
Classic metrics & when they apply: BLEU, ROUGE, BERTScore - and their limits.
LLM-as-a-judge / G-Eval - how it works and its limitations (bias, self-preference).
RAG eval: faithfulness, answer relevance, context precision/recall (Ragas, DeepEval).
Offline vs. online: eval sets + regression suites vs. A/B tests + human-in-the-loop.
Golden datasets & continuous evaluation for catching regressions when a provider ships a new model.

The "beyond just call the API" story

Professional AI engineering means building evaluation into every stage of development - from prompt iteration through production monitoring. Candidates who can articulate a complete evaluation strategy (metrics, datasets, automation, drift detection) consistently outperform those who can only describe model capabilities.

Read on DEV Community ↗ ← Back to News