π― The AI Engineer π€ Interview Playbook π
β‘ TL;DR
The AI engineer role is software engineering with AI systems on top - you orchestrate models (LLMs, RAG, agents) into reliable products, not train models from scratch. Interviews test six things: ML/LLM fundamentals, applied ML, LLM/RAG engineering, coding, AI system design, and behavioral.
If you remember one thing: companies are hiring AI system builders, not people who can call an LLM API. The fastest way to stand out - think like a product + system owner, be explicit about failure modes, and show evaluation rigor. Evaluation is the single biggest skill gap among candidates, so it's your biggest opportunity.
The rest is discipline: solid DSA + Python, 2β3 deployed end-to-end projects, and the ability to explain trade-offs (quality vs. latency vs. cost) out loud.
π§ What an AI engineer actually is
The role is new and definitions are still settling, so the first job is knowing what you're being hired for. Core responsibility: integrate AI into a product. Work with LLM providers (OpenAI, Anthropic) through their APIs, partner with PMs to find real user problems AI can solve, and ship reliably. It starts from a real problem - not "AI is cool, let's use it."
π AI engineer vs. ML engineer vs. data scientist
| Role | Focus | Owns | Day-to-day |
|---|---|---|---|
| AI engineer | Building with models | Prompts, pipelines, integration | RAG, prompting, tools, agents, evals |
| ML engineer | Optimizing models | Model weights, training | Training, features, metrics |
| Data scientist | Creating models | Datasets, experiments | Requirements β ML, modeling |
The lines are blurry and the industry treats them as a spectrum. In practice, most postings are "ML engineer" or "software engineer with an AI focus." The consistent message from hiring managers: "Companies are not hiring for titles - they want to know if you can build reliable AI systems." If you can only do modeling or only do systems, you're already behind.
π Progressive complexity (know where a problem sits)
- Simple: user input β prompt + LLM API β response.
- RAG (~5Γ harder): add data pipelines, a search engine (vector/text), retrieval, reliability.
- Agents (~10Γ harder): add tool calls, multi-step loops, trace instrumentation, tool-rollout management.
π« What AI engineers usually don't do
Create models from scratch, build custom architectures, or do heavy feature engineering. What they do: engineering best practices for AI systems, prompt design + versioning, product integration, and evaluation + monitoring.
πΊοΈ The interview process (what to expect)
Based on analysis of real job postings and candidate reports: the median process is 4 steps, most fall in the 3β5 range, and the whole thing runs 2β6 weeks.
| Round | Typical length | What it tests |
|---|---|---|
| Recruiter / talent screen | 15β30 min | Fit, salary expectations |
| Technical / coding | 45β60 min | LeetCode-style, sometimes AI-flavored |
| AI/ML deep-dive | 45β90 min | LLMs, RAG, hallucinations, fine-tuning vs. prompting |
| Take-home / project | 1β7 days | Build a RAG or agent system |
| AI system design | 60 min | Scale LLM apps, cost/latency optimization |
| Behavioral | 30β60 min | STAR/SAIL, ownership in ambiguous work |
| Hiring manager / founder | 15β60 min | Deep dive, motivation, values |
π’ Real loops (from candidate reports)
- Mistral AI (Applied AI Engineer): LLM theory β coding β project deep-dive β tech manager β ML system design β take-home β values talk.
- Amazon (GenAI, L6): LeetCode + practical ML coding (cosine similarity in NumPy) β SDE bar β GenAI depth (LLM/ViT architectures, fine-tuning, ROI estimation) β Leadership Principles throughout.
- Eightfold.ai (Agentic AI): AI-agent-conducted coding round β 3-day take-home to build an agent β DSA interview with EM.
- LangChain (AI Engineer): take-home (build an agent) β solution discussion β applied system design.
- PostHog: talent call β 60-min technical β co-founder call β paid full-day SuperDay (compensated real work).
- Microsoft (Applied AI/ML intern): AI-assisted coding (use ChatGPT, then re-prompt on a modified problem) β raw coding, no AI tools β behavioral.
Two trends to know: (1) in-person rounds are back (up from ~24% in 2022 to ~38% in 2025) to counter cheating; frontier labs increasingly require onsites. (2) References matter more - most top companies now want 2β3 references from recent managers.
π― The six question categories
Nearly every AI engineer loop draws from these six buckets. Prepare all six; weight by seniority and role.
- ML & deep learning fundamentals - bias/variance, overfitting, precision/recall, ROC, gradient descent, CNNs, transformers, BERT/GANs.
- Applied ML & infrastructure - pipelines, fine-tuning, transfer learning, FP32/FP16/BF16 trade-offs, sparse vs. dense, deployment.
- LLM engineering & RAG - tokenization, context limits, cost/latency, hallucination, embeddings, vector search, chunking, grounding, re-ranking.
- Coding / Python fundamentals - DSA (indexing/search/graph/tree/heap), Python internals (GIL,
isvs==, mutable/immutable, async), SQL. - AI system design - end-to-end pipelines, caching, cost, reliability, failure modes.
- Behavioral - ambiguity, communication, influence, AI ethics, trade-off ownership.
π Focus by seniority
| Level | Emphasis |
|---|---|
| Junior / Intern | Coding fundamentals, basic ML concepts, project enthusiasm, willingness to learn |
| Mid | End-to-end system knowledge, RAG pipelines, embeddings, production awareness |
| Senior | Trade-off fluency, system design at scale, failure-mode reasoning, cost optimization |
| Staff+ | Technical leadership, cross-team influence, project presentations, org impact |
At senior/staff levels, interviewers pick 3β5 topics and drill deep into failure modes and trade-offs rather than covering many topics superficially. Depth beats breadth.
π§ Core knowledge checklist
The must-know surface area, grouped so you can self-audit. You don't need every advanced item, but you must be fluent in the basics and have opinions backed by trade-offs.
π€ LLM fundamentals
- Transformers: self-attention, Q/K/V, multi-head attention, positional encoding (RoPE), encoder vs. decoder vs. encoder-decoder.
- Tokenization: BPE, WordPiece/SentencePiece, why domain terms get split badly.
- Generation controls: temperature, top-p/top-k sampling, logits, context window, why the first token is slow (prefill vs. decode).
- Efficiency: KV cache, quantization (INT8/INT4, FP16/BF16), distillation, MoE, Flash Attention, GQA.
- Alignment: RLHF, DPO, instruction tuning, reward hacking, the "alignment tax."
π RAG (table stakes - expect deep questions)
- Architecture: chunk β embed β index β retrieve β re-rank β generate.
- Chunking strategies: fixed, recursive, semantic, parent-child. How to pick chunk size.
- Retrieval: dense vs. sparse embeddings, cosine/dot/Euclidean, ANN, hybrid search, re-ranking.
- Failure modes: hallucination despite good context, "lost in the middle," multi-hop questions, conflicting sources, stale data.
- Query transforms: HyDE, decomposition, step-back prompting.
- Citation/source attribution.
- The key trade-off: RAG vs. fine-tuning vs. prompt engineering - and when you'd NOT use RAG.
π€ Agents
- ReAct, Plan-and-Execute, Reflection patterns; tool use / function calling; MCP.
- Agent memory (short-term, long-term, episodic); the agent loop and stop conditions.
- Failure handling: infinite loops, wrong tool selection, bad parameter extraction, token/budget blowups, guardrails against irreversible actions.
- Single vs. multi-agent; orchestration; human-in-the-loop.
ποΈ Fine-tuning
- Full vs. PEFT; LoRA / QLoRA; prefix/prompt tuning; adapters.
- When to fine-tune (extreme specialization or latency) vs. default to prompt + RAG.
- Catastrophic forgetting, dataset prep, key hyperparameters (LR, epochs, LoRA rank).
π LLMOps / production
- Serving (vLLM, continuous batching, speculative decoding, paged attention).
- Prompt caching, semantic caching, streaming, structured output.
- Observability: TTFT, inter-token latency, tokens/sec, per-user cost, tracing, drift.
- Cost & reliability: model routing, fallbacks, rate limiting, graceful degradation, provider redundancy.
π‘οΈ Safety
- Prompt injection (direct/indirect), jailbreaks, data leakage, PII handling.
- Input/output guardrails, content filtering, red teaming, hallucination detection.
π‘ Depth test: interviewers value "when would you NOT use RAG?" over "what is RAG?" Every concept should come with a trade-off and a failure mode.
π» The coding round
The role is still mostly software engineering, so DSA fundamentals are non-negotiable. Algorithm rounds appear at OpenAI, Anthropic (90-min CodeSignal requiring perfect correctness), xAI (LeetCode Hard), Eightfold, and more.
What to drill
- DSA: NeetCode 150/250, focus on patterns (indexing/search/graph/tree/heap) - not memorization. Use spaced repetition.
- Python depth: GIL, concurrency vs. parallelism, async patterns, race conditions,
isvs==, mutable vs. immutable, reproducible code. - SQL: for handling datasets.
- Full-stack basics: many AI roles are "low-key full-stack" - expect JS event loop, database choices, message queues.
AI-flavored coding (common warm-ups)
- Cosine similarity / dot product / Euclidean distance from scratch (NumPy).
- A basic RAG pipeline; semantic search; chunking strategies.
- A simple agent with tool use; a function-calling handler.
- Retry with exponential backoff; token counting / context management; a semantic cache.
- From-scratch ML (frontier labs): multi-head attention, a transformer layer, LoRA, KV cache from memory. Use shape suffixes (Noam Shazeer method) to track tensor dimensions.
Note: these rounds are often 25β35 min, no debugging.
β οΈ Modern interviewers may run AI-assisted coding rounds (solve with ChatGPT, then re-prompt when they change the problem). They're testing how you prompt, verify, and direct the tool - not whether you can code unaided.
ποΈ AI system design
This is where senior candidates win or lose. The bar isn't "name the tools" - it's end-to-end system thinking plus a clear grasp of how the system breaks.
π§± The frame that works
Present every solution as a pipeline, then stress-test each stage:
Input β Retrieval β Generation β Verification β Feedback
For each stage, answer: how does it fail, and how would you fix it? "If you can't explain how your system breaks and how you'd fix it, you're not ready."
6 habits that impress
- Lead with product & business metrics. Anchor on user value: task success, retention, latency, cost - before naming a model.
- Think in lifecycles, not static pipelines. Start simple, measure, find bottlenecks, iterate. "Only add complexity where it moves metrics."
- Be fluent in trade-offs. Quality vs. latency vs. cost; internal model vs. external API; retrieval depth vs. hallucination risk.
- Call out failure modes proactively - hallucination, bad retrieval, prompt brittleness - and your mitigation.
- Show evaluation rigor (see Β§7).
- Demonstrate pragmatic judgment: "I wouldn't use an LLM here - it's overkill," "we can get 80% with a cheaper model + rules," "gate expensive calls behind a confidence threshold."
π΅ Cost reasoning separates production thinkers from prototypers
Be ready to estimate on the whiteboard. Example: 100K daily users Γ 10 interactions Γ ~2K tokens = 2B tokens/day β $13K/day on a premium model. Then talk mitigation: caching, batching, model routing, smaller models behind confidence gates.
Common prompts
Design a RAG "chat with your docs," a deep-research agent, a multi-agent support system, an LLM inference platform, a recommender, content moderation, or an AI email assistant. A good scenario starts from a real user need and leaves the solution open - practice extracting the problem and asking clarifying questions before designing.
π£ If you get an outdated prompt (e.g., "design a fixed-context RAG chatbot" when an agentic search design fits better), it's a signal about the company - its engineers may not be current. Answer well, but read the signal.
π Evaluation - your biggest differentiator
Evaluation is the biggest skill gap among AI engineer candidates, which makes it your biggest edge. "Unsuccessful LLM products almost always share a common root cause: a failure to create robust evaluation systems."
What to be able to discuss
- Metrics beyond accuracy: faithfulness (is it grounded?), usefulness (does it solve the user's problem?), safety (does it resist harmful inputs?).
- Classic metrics & when they apply: BLEU, ROUGE, BERTScore - and their limits.
- LLM-as-a-judge / G-Eval - how it works and its limitations (bias, self-preference).
- RAG eval: faithfulness, answer relevance, context precision/recall (Ragas, DeepEval).
- Offline vs. online: eval sets + regression suites vs. A/B tests + human-in-the-loop.
- Golden datasets & continuous evaluation for catching regressions when a provider ships a new model.
The "beyond just call the API" story
Professional AI engineering means building evaluation into every stage of development - from prompt iteration through production monitoring. Candidates who can articulate a complete evaluation strategy (metrics, datasets, automation, drift detection) consistently outperform those who can only describe model capabilities.
Comments
No comments yet. Start the discussion.