AWS for Newbies - Episode 2
Part 1 - S3: Storage That Isn't "A Folder on a Server"
π¨β𦳠Uncle: S3 stands for Simple Storage Service. It is object storage - a place to store files (images, PDFs, videos, backups, logs - anything as raw bytes) completely separate from any server. Unlike EC2's disk (EBS), which is tied to one specific virtual machine, S3 exists independently. Ten different servers, or zero servers, can all read and write to the same S3 storage. That's exactly the property we needed after last episode's problem - files that survive even if the server that received them is gone.
In S3's vocabulary:
- A bucket is the top-level container - think of it as one storage account/warehouse. Bucket names must be globally unique across all of AWS, not just your account.
- An object is a single stored file, along with its metadata (size, content type, upload date, permissions).
- A key is the object's full path-like name inside the bucket, e.g.
images/profile-abc123.jpg.
π¦ Nephew: So a "key" is basically the file path?
π¨β𦳠Uncle: Functionally, yes - but here's a fact that surprises almost everyone: S3 doesn't actually have real folders. It's a flat structure of objects, each with a long key string. When you see images/profile.jpg displayed with a little folder icon in the AWS console, that's the console being helpful - it's just splitting the key string on the / character and drawing a folder illustration for you. Underneath, there is no such thing as an "images folder" object. It's purely a naming convention, called a prefix.
π¦ Nephew: Wait, that actually matters for what I want to do - separating PDFs, text files, and images into their own "folders."
π¨β𦳠Uncle: It matters a lot, and it's good news - it means organizing by file type costs you nothing extra. You just design your key naming convention deliberately, and S3 will happily group them for you visually and let you list/filter by prefix efficiently.
Part 2 - Designing the Key Structure (Folder-Wise Organization)
π¨β𦳠Uncle: Let's design it properly instead of winging it. A clean, type-separated key convention looks like this:
documents/{hash}.pdf
images/{hash}.jpg
text-files/{hash}.txt
Or, if you also want per-user isolation (very common in real apps):
users/{userId}/images/{hash}.jpg
users/{userId}/documents/{hash}.pdf
π¦ Nephew: Why the hash instead of the original filename, like resume.pdf?
π¨β𦳠Uncle: Three solid reasons. First - two different users might both upload a file called resume.pdf; using the raw filename risks silent overwrites unless you're careful. Second - filenames can contain characters that misbehave in URLs. Third, and most important for today's topic: the hash is how we detect duplicate files. Which brings us to the real meat of today's lesson.
Part 3 - Deduplication: Don't Store the Same File Twice
π¨β𦳠Uncle: Imagine 500 users all upload the exact same company logo, or the same PDF brochure gets re-uploaded across 50 different form submissions. Without deduplication, you're paying S3 storage costs for 500 identical copies of the same bytes. We fix this with content hashing.
π¦ Nephew: Meaning?
π¨β𦳠Uncle: SHA-256 is a cryptographic hash function - an algorithm that takes any input (in our case, a file's raw bytes) and produces a fixed-length, 64-character string (called a hash or digest) that is essentially a unique fingerprint of that exact content. Two important properties matter to us:
- The same file content always produces the exact same hash, no matter who uploads it or what they named it.
- Even a single-bit difference in the file produces a completely different hash. So it's not "similar files get similar hashes" - it's "identical content, identical hash; anything else, unrelated-looking hash."
The chance of two genuinely different files accidentally producing the same SHA-256 hash is astronomically small - small enough that the entire software industry (including Git itself) relies on this property daily.
π¦ Nephew: So the plan is: compute the hash, and if we've seen that hash before, don't store the file again?
π¨β𦳠Uncle: Exactly. Let's build it.
3.1 - Computing a SHA-256 hash in Node.js
Node has hashing built into its standard library - no extra package needed. Here's the core building block, written to handle large files efficiently by streaming the file instead of loading the whole thing into memory at once:
const crypto = require("crypto");
const fs = require("fs");
function hashFileStream(filePath) {
return new Promise((resolve, reject) => {
const hash = crypto.createHash("sha256");
const stream = fs.createReadStream(filePath);
stream.on("data", (chunk) => hash.update(chunk));
stream.on("end", () => resolve(hash.digest("hex")));
stream.on("error", reject);
});
}
// Usage:
// const fileHash = await hashFileStream("/tmp/uploaded-file.pdf");
// e.g. "3f786850e387550fdab836ed7e6dc881de23001b"
π¦ Nephew: Why stream it instead of just crypto.createHash('sha256').update(buffer).digest('hex') on the whole file at once?
π¨β𦳠Uncle: Because if someone uploads a 200 MB video file, loading the entire thing into memory just to hash it can spike your server's RAM and slow everything else down - especially if ten uploads happen at once. Streaming reads the file in small chunks, feeds each chunk into the hash calculation, and never holds the whole file in memory. Small files barely notice the difference; large files, it's the difference between a smooth server and a crashed one.
3.2 - Checking for duplicates before storing
Now, the hash alone is only useful if you remember which hashes you've already stored. That's a job for your database, not S3 itself.
CREATE TABLE files (
id SERIAL PRIMARY KEY,
sha256_hash VARCHAR(64) NOT NULL UNIQUE,
s3_key TEXT NOT NULL,
file_type VARCHAR(20) NOT NULL, -- 'pdf' | 'image' | 'text'
size_bytes BIGINT NOT NULL,
uploaded_by INTEGER REFERENCES users(id),
created_at TIMESTAMP DEFAULT now()
);
Notice sha256_hash has a UNIQUE constraint. That single line is doing a lot of work - even if two upload requests race each other at the exact same millisecond, the database itself will reject the second insert of the same hash, so you can't accidentally create a duplicate even under concurrent load.
The check-then-act flow in code:
async function findOrRegisterFile(fileHash, fileType, sizeBytes, extension, userId) {
// 1. Have we already stored this exact content?
const existing = await db.query(
"SELECT s3_key FROM files WHERE sha256_hash = $1",
[fileHash]
);
if (existing.rows.length > 0) {
// Duplicate! Don't upload again - just reuse the existing object.
return { isDuplicate: true, s3Key: existing.rows[0].s3_key };
}
// 2. New file - decide its key, based on type, using the hash itself
const folder = { pdf: "documents", image: "images", text: "text-files" }[fileType];
const s3Key = `${folder}/${fileHash}.${extension}`;
await db.query(
`INSERT INTO files (sha256_hash, s3_key, file_type, size_bytes, uploaded_by)
VALUES ($1, $2, $3, $4, $5)`,
[fileHash, s3Key, fileType, sizeBytes, userId]
);
return { isDuplicate: false, s3Key };
}
π¦ Nephew: So if it's a duplicate, we just... don't touch S3 at all? We just point the new "upload" record at the old object?
π¨β𦳠Uncle: Exactly right. The user experience looks identical - "your file uploaded successfully" - but behind the scenes, you saved storage cost, saved upload bandwidth, and saved processing time (if you were going to compress/resize it), all because the bytes were already sitting in S3 from someone else's earlier upload.
Part 4 - Enforcing File Size (At Every Layer, Not Just One)
π¨β𦳠Uncle: Here's a mistake I see constantly: a developer checks file size once, on the frontend, in JavaScript, and calls it done. That check is trivially bypassed - anyone can call your API directly with curl or Postman, skipping your frontend entirely. Real file-size enforcement is layered, the same "reject bad traffic as early as possible" principle from our security-groups discussion.
- Layer 1 - Client-side (UX only, not security): Reject obviously oversized files before even starting an upload, so the user gets instant feedback instead of waiting for a slow upload to fail.
- Layer 2 - Backend validation, before generating any upload permission:
const MAX_SIZE_BYTES = 5 * 1024 * 1024; // 5 MB
function validateFileSize(sizeBytes) {
if (sizeBytes > MAX_SIZE_BYTES) {
const err = new Error("File exceeds the 5MB limit");
err.statusCode = 413; // "Payload Too Large" - the correct HTTP status for this
throw err;
}
}
- Layer 3 - Enforced by S3 itself, at the moment of upload, via the presigned request's conditions. This is the layer most beginners don't know exists - and it's the one that actually matters for direct-to-S3 uploads, because your backend never sees the file bytes in that flow, so layer 2 alone can be lied to. We'll wire this up properly in Part 6.
- Layer 4 - Load balancer / API Gateway request size limits, as a blunt outer boundary against absurdly oversized requests hitting your infrastructure at all.
π¦ Nephew: So the frontend check is basically just "be nice to the user," and the real enforcement happens server-side and inside the S3 request itself.
π¨β𦳠Uncle: Correctly understood.
Part 5 - Permissions: Locking S3 Down Properly
π¨β𦳠Uncle: Now let's get the access model right, because this is where careless setups leak private user files to the entire internet - a mistake that's made the news more than once.
5.1 - Block Public Access (keep it ON)
When you create the bucket, AWS shows a setting called "Block all public access." Leave it enabled. With this on, no object in the bucket can be made public by accident - not through a misconfigured bucket policy, not through an ACL mistake, nothing. Any such attempt is silently denied. Public exposure should always be an intentional, narrow exception (which we handle via presigned URLs, or a CDN in front - future episode), never the default state of the bucket.
5.2 - IAM Policies: deciding what your backend/app can do
Access to S3 is controlled through IAM policies - JSON documents describing which actions are allowed on which resources. The key actions you'll use:
| Action | What it allows |
|---|---|
s3:PutObject |
Uploading (writing) a new object |
s3:GetObject |
Downloading (reading) an object |
s3:DeleteObject |
Deleting an object |
s3:ListBucket |
Listing what objects exist in the bucket |
A properly scoped policy for your backend's role looks like this - notice it's restricted to one specific bucket, not "all S3 buckets everywhere":
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject"],
"Resource": "arn:aws:s3:::my-app-uploads/*"
}
]
}
That Resource line, with the /* at the end, means "any object key inside the my-app-uploads bucket" - not the bucket's own settings, not other buckets. This is the Principle of Least Privilege again, applied to a service instead of a person.
5.3 - Attach this via an IAM Role, not access keys
If your backend runs on EC2 (from Episode 1) or Lambda, attach this permission through an IAM Role, exactly like we discussed last time - never paste an Access Key and Secret Key into your .env file for this. The role gives your running server temporary, auto-rotating credentials behind the scenes, with zero secrets to leak.
Part 6 - The Node.js Setup
π¨β𦳠Uncle: Let's install what we need. AWS's modern JavaScript SDK is modular - you install just the pieces you need, not one giant package.
npm install @aws-sdk/client-s3 @aws-sdk/s3-request-presigner
Set up the client once, and reuse it everywhere:
// s3Client.js
const { S3Client } = require("@aws-sdk/client-s3");
const s3 = new S3Client({
region: "ap-south-1", // keep this the same region as your EC2/RDS - lower latency, lower cost
});
module.exports = s3;
Notice - no access key, no secret key in this code. If this runs on an EC2 instance with the IAM role from Part 5 attached, the SDK automatically discovers and uses those temporary credentials. This is the payoff of setting up roles properly back in Episode 1.
Part 7 - The Full End-to-End Upload Flow
π¨β𦳠Uncle: Now let's assemble everything - hashing, deduplication, size limits, folder-wise keys, and permissions - into one coherent request flow. This is the production pattern, not the toy version.
- Client picks a file β computes size (and optionally a hash) locally
- Client calls
POST /api/uploads/request-urlwith{ fileName, fileType, sizeBytes, sha256Hash } - Backend:
a. Validates the user is authenticated & rate-limit not exceeded
b. ValidatessizeBytes <= 5MB(Layer 2 size check)
c. Checkssha256Hashagainst the database:- If it exists already: respond immediately, "already uploaded", return the existing
s3Key. No S3 call needed at all. - If new: build the S3 key using the folder-by-type convention and the hash, e.g.
images/9f86d0...jpg
d. Generates a PRESIGNED URL for that exact key, with: - A
Content-Length-Rangecondition (enforces size at the S3 level) - A
Content-Typecondition (enforces file type at the S3
- If it exists already: respond immediately, "already uploaded", return the existing
Comments
No comments yet. Start the discussion.