System Design From Zero: An Engineering Head Teaches His Nephew
DEV Community

System Design From Zero: An Engineering Head Teaches His Nephew

Part 0: Why Smart Engineers Freeze in This Round

๐Ÿ‘ฆ Nephew: Uncle, I know Redis, MongoDB, Kafka, Docker, AWS - I've built real things. But the moment someone says "Design WhatsApp" or "Design a Notification System," my brain just... goes blank. I start talking about random technologies and it sounds unstructured.

๐Ÿ‘จโ€๐Ÿฆณ Uncle: That's not a knowledge problem. That's a framework problem, and it's the single most common reason strong engineers fail this round. Let me show you the wrong instinct first, because you'll recognize it immediately.

The wrong mindset: the moment you hear "design X," your brain jumps straight to "which database? which cache? should I use Kafka?" - technology-first thinking.

The right mindset: "what problem are we actually solving? what's the scale? where will this break?" - architecture comes after you answer those, never before.

๐Ÿ‘ฆ Nephew: So the fix isn't learning more tools?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Correct. You already know enough tools. What you're missing is a fixed sequence you run through every single time, regardless of whether it's a URL shortener, WhatsApp, Netflix, or a notification system. Master the sequence once, and you can walk into any "design X" question without ever feeling lost - because the shape of your answer never changes, only the specific boxes inside it. That's what we're building today, completely, in one sitting.

Part 1: The 12-Step Master Framework - The Spine of Everything

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Here it is. Memorize the flow, not just the numbers.

Step Phase Activity
1 Understand Understand the Problem
2 Understand Gather Requirements
3 Plan Estimate Scale
4 Design Design APIs
5 Design Design Database
6 Design High Level Design (HLD)
7 Explain Deep Dive Components
8 Scale Scaling Strategy
9 Robust Reliability
10 Observe Monitoring
11 Protect Security
12 Discuss Bottlenecks & Tradeoffs

๐Ÿ‘ฆ Nephew: Twelve steps feels like a lot to hold in my head under interview pressure.

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Don't hold twelve numbers. Hold three phases, each with a one-line purpose:

  • Phase 1 - UNDERSTAND (Steps 1-4): "Before you build, understand what you're building."
  • Phase 2 - DESIGN (Steps 5-8): "Now design the actual system."
  • Phase 3 - ROBUSTIFY (Steps 9-12): "Now make sure it doesn't fall apart."

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Three phases, and inside each phase, the steps follow naturally - once you understand the problem, requirements follow. Once you know requirements, scale estimation follows. Once you know scale, your API and database design write themselves, because now they're grounded in real numbers instead of guesses.

Never skip a step, and never skip the order. Skip scale estimation and jump straight to "I'll use Kafka" - you've just guessed. A senior engineer never guesses when they could calculate.

Part 2: Phase 1 - Understand (Steps 1-4)

Step 1 - Understand the Problem

๐Ÿ‘ฆ Nephew: Interviewer says "Design a URL Shortener." What do I actually say first?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: You do not say "I'll use Redis." You ask questions, out loud, before touching architecture:

  • What's the main use case - is this for individual users, or a marketing team doing bulk campaigns?
  • What features are required versus optional?
  • Any constraints I should know about (must the short URL be a specific length, must it be guessable-resistant, etc.)?

This alone - pausing to ask instead of diving in - signals seniority before you've said a single technical word.

Step 2 - Requirements: Split Functional vs Non-Functional

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Always split these into two explicit buckets. Interviewers love this separation because it shows systematic thinking, not just listing random features.

Functional Requirements - WHAT the system does:

  • User submits a long URL
  • System generates a short URL
  • Redirecting the short URL takes you to the original
  • (Optional) Analytics on clicks
  • (Optional) Expiration after 30 days

Non-Functional Requirements - HOW WELL it should behave:

  • 99.99% availability
  • Low latency (<100ms)
  • Scalable to millions of users
  • Secure (no URL guessing)

๐Ÿ‘ฆ Nephew: Why does splitting them matter so much? Isn't it all just "requirements"?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Because they drive completely different architecture decisions. Functional requirements shape your API and data model. Non-functional requirements shape your infrastructure choices - availability targets push you toward replication and multi-region thinking; latency targets push you toward caching; scale targets push you toward horizontal scaling and sharding.

Mixing them together is how candidates end up designing something that "does the right things" but "does them the wrong way" - technically correct features, but the wrong infrastructure underneath.

Step 3 - Non-Functional Requirements, One Level Deeper

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Let's go deeper here, because this is where seniority really shows. There isn't just one non-functional requirement - there's a whole family, and different systems prioritize different members of that family:

NFR Question It Answers Typical Tools
Scalability Can it grow 10x without breaking? Horizontal scaling, sharding, caching
Availability Is the system up? (99.99% = ~52 min downtime/year) Replicas, failover, load balancing
Reliability Can users trust the system? Transactions, replication, retry logic
Consistency Do all users see the same data? (Strong vs Eventual) ACID DB, leader-based writes, quorum
Latency How fast is the response? Caching, CDN, database indexes
Throughput How many requests/second? Message queues, load balancing
Fault Tolerance What happens if a component fails? Replication, backup, failover
Durability Does data survive crashes? Persistent storage, backups, WAL
Security Protected from attacks? HTTPS, encryption, rate limiting, auth
Maintainability Can engineers easily modify it? Microservices, CI/CD, monitoring

Memory trick - S-A-C-L-R-F-D-S: Scalability, Availability, Consistency, Latency, Reliability, Fault tolerance, Durability, Security.

๐Ÿ‘ฆ Nephew: Do I need all ten for every system?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: No - and knowing which ones matter most for this specific system is exactly what separates mid-level from senior. Look at this comparison:

System Most Important NFRs Why
WhatsApp Scalability, Availability, Latency Billions of messages, can't afford downtime, must feel instant
Netflix Availability, Scalability, Bandwidth Millions streaming simultaneously, video needs huge bandwidth
Banking Consistency, Durability, Security Money is involved - zero tolerance for data loss or inconsistency
Trading Platform Consistency, Latency, Reliability Milliseconds matter, accuracy is everything
URL Shortener Availability, Scalability, Latency Simple functionality, but must handle massive, simple traffic

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Saying "for a banking system, I'd prioritize strong consistency and durability over raw availability, because losing or double-counting money is unacceptable, even if it costs us a bit of uptime" - that one sentence, said out loud, tells the interviewer you actually understand tradeoffs, not just terminology.

Step 4 - Scale Estimation - The Step Everyone Skips (Don't)

๐Ÿ‘จโ€๐Ÿฆณ Uncle: This is, without exaggeration, the single most important step in the entire framework. Most candidates skip straight from requirements to architecture. That's the mistake that separates a mid-level answer from a senior one. You cannot choose the right architecture without knowing the numbers first.

Ask these before calculating anything:

  • How many Daily Active Users (DAU)?
  • How many requests per day?
  • What's the read/write ratio?
  • How much storage per record?

Why this single step tells you almost everything:

  • Immediately tells you if you need caching
  • Tells you your database size, and whether one server can hold it
  • Tells you your read/write pattern (read-heavy? write-heavy?)
  • Tells you if a single server is even viable, or if you need to design for distribution from day one

We'll do the actual math in the next section - but understand why it comes first: everything after this step (API design, database choice, caching strategy, number of servers) is a direct consequence of these numbers, not a separate creative decision.

Part 3: Back-of-the-Envelope Calculations - The Skill That Makes You Sound Senior Instantly

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Before you ever say "Redis," "MongoDB," "Kafka," or "CDN" out loud, you should be able to answer: how many users? How many requests? How much storage? How much bandwidth? How much memory? Otherwise you're designing blind, guessing dressed up as confidence.

Memorize these cold - memory units:

Unit Bytes Everyday Example
1 KB 10ยณ = 1,000 Small text file
1 MB 10โถ = 1,000,000 A song, small image
1 GB 10โน = 1,000,000,000 A movie, many songs
1 TB 10ยนยฒ = 1,000,000,000,000 A large database

And one time constant you'll use in nearly every calculation:

  • 1 Day = 86,400 seconds

๐Ÿ‘ฆ Nephew: Why 86,400 specifically? Where does that even come from?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: 24 hours ร— 60 minutes ร— 60 seconds = 86,400. Simple, but if you fumble this number live in an interview trying to do 24ร—60ร—60 in your head under pressure, it costs you composure at the worst possible moment. Just memorize it as a fact.

The Five Formulas You Must Know By Heart

Formula 1 - QPS (Queries Per Second), the most important one:

QPS = Total Requests Per Day / 86,400

Example: 100 million requests/day โ†’ 100,000,000 / 86,400 โ‰ˆ 1,157 QPS

Peak traffic rule - always design for peak, never average:

Peak QPS = Average QPS ร— 3 to 5

Example: 1,157 ร— 5 โ‰ˆ 5,785 QPS

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Why multiply by 3-5? Because average traffic hides the truth - real traffic isn't smooth, it spikes around specific hours, campaigns, or events. If you design a system that can only handle the average, the very first traffic spike takes it down. Designing for peak is designing for reality, not for a spreadsheet.

Formula 2 - Storage per day:

Storage = Records Per Day ร— Size Per Record

Example: 10 million new URLs/day ร— 500 bytes = 5 GB/day

Formula 3 - Yearly storage:

Yearly Storage = Daily Storage ร— 365

Example: 5 GB/day ร— 365 = 1.8 TB/year

๐Ÿ‘ฆ Nephew: Why does yearly storage matter so much? Isn't 5 GB/day tiny?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Because a single day's number looks harmless, and that's exactly the trap. 1.8 TB/year tells you something concrete: somewhere around year 2-3, you'll likely need sharding or a distributed database, because a single machine's disk starts becoming a real constraint. This number is what lets you say, confidently, in an interview: "for year one, a single well-provisioned database is fine; by year three, we'll need to shard" - a genuinely senior-sounding sentence, backed by actual math, not vibes.

Formula 4 - Bandwidth:

Bandwidth = Peak QPS ร— Response Size

Example: 5,000 QPS ร— 2 KB = 10 MB/sec

๐Ÿ‘จโ€๐Ÿฆณ Uncle: This tells you whether your network card, your load balancer, and your CDN strategy can actually keep up - especially critical for anything media-heavy (we'll see this explode in the Netflix example shortly).

Formula 5 - Cache memory:

Cache Memory = Hot Data Size ร— 1.2 (overhead factor)

Example: 10 million hot URLs ร— 500 bytes = 5 GB โ†’ with overhead, 5-8 GB Redis needed

๐Ÿ‘ฆ Nephew: Why the 1.2 overhead factor?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Redis (and caches generally) don't store just your raw data - there's metadata, key overhead, internal data structure costs. The 1.2 multiplier is a practical safety margin so you don't under-provision and hit unexpected evictions the moment real traffic arrives.

The 7 Numbers You Should Calculate, Every Single Time

  1. DAU (Daily Active Users)
  2. Total Requests/day
  3. Read Requests/day
  4. Write Requests/day
  5. Read QPS (average)
  6. Peak QPS (for capacity planning)
  7. Storage/year

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Build the habit of running through exactly these seven, in this order, for every design question, without exception - URL shortener, chat app, notification system, rate limiter, file upload, doesn't matter. This becomes muscle memory, and muscle memory is what survives interview nerves.

Seeing It All Together - Three Full Worked Examples

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Numbers in isolation don't teach you anything. Let's see how the same seven-number process produces completely different architectures depending on the system.

Example A - URL Shortener (a Read-Heavy System)

Assumptions: 10M users, 10M new URLs/day, 100M redirects/day, 500 bytes/record.

Metric Calculation Result Implication
Write QPS 10M / 86,400 ~116 QPS One server handles this easily
Read QPS 100M / 86,400 ~1,157 QPS Single server maxes around 1,000-5,000 QPS
Peak Read QPS 1,157 ร— 5 ~5,785 QPS Need multiple servers + cache
Daily Storage 10M ร— 500B 5 GB/day Small for one day
Yearly Storage 5 GB ร— 365 1.8 TB/year Sharding relevant around year 2-3
Bandwidth 5,785 ร— 500B ~3 MB/sec Trivial for a normal network
Cache Size 10M ร— 500B (ร—1.2) 5-8 GB A standard Redis instance

Architecture decision this produces: a cache-heavy system - Redis for hot URLs, PostgreSQL for reliable storage, 3-5 API servers for peak load, read replicas for scaling reads further.

Example B - WhatsApp (a Write-Heavy System)

Assumptions: 1 billion users, 1 billion messages/day, 1 KB/message.

Metric Calculation Result Implication
Write QPS 1B / 86,400 ~11,574 QPS Huge - impossible for one server
Peak QPS 11,574 ร— 5 ~57,870 QPS Need 10-50+ servers minimum
Daily Storage 1B ร— 1KB 1 TB/day Massive
Yearly Storage 1TB ร— 365 365 TB/year A distributed database is mandatory, not optional

Architecture decision this produces: Kafka to queue and decouple message writes, Cassandra for distributed, high-write-throughput storage, sharding by userId with consistent hashing, 50+ API servers, WebSockets for real-time delivery.

๐Ÿ‘ฆ Nephew: So the exact same 7-number process, on a different system, points you toward a totally different toolset - without me ever having to "guess" which technology sounds impressive?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: That's the entire point of this discipline. The numbers are the design decision. You're not choosing Kafka because it's trendy - you're choosing it because 57,870 peak QPS of writes, sustained, genuinely requires a queue-and-distribute approach; a single database simply cannot absorb that write rate directly.

Example C - Netflix (a Bandwidth-Heavy System)

Assumptions: 200 million subscribers, 2 hours of streaming per user per day, 5 Mbps average bitrate for HD video, 1 GB per hour of video.

Metric Calculation Result Implication
Concurrent Streams 200M ร— (2h / 24h) ~16.7M concurrent Massive CDN requirement
Bandwidth 16.7M ร— 5 Mbps ~83,500 Gbps Impossible without CDN edge caching
Storage (catalog) 10,000 titles ร— 5 GB 50 TB Manageable for central storage
Storage (daily logs) 200M ร— 1 KB 200 GB/day Analytics pipeline needed

Architecture decision this produces: CDN-first design with edge caching for popular content, regional caching clusters for medium-popularity content, and origin servers for long-tail content. Recommendation engine requires Spark/MapReduce for offline processing of viewing patterns.

Comments

No comments yet. Start the discussion.