Phase 1: Document Ingestion - The Hidden Complexity Before Embeddings
Phase 1: Document Ingestion - The Hidden Complexity Before Embeddings
The Story Begins: Why Your Upload Button Is Just The Beginning
๐ฆ Nephew: Uncle! I finally built my RAG system. User uploads a PDF, system finds answers. Simple, right?
๐จโ๐ฆณ Uncle: (smiles knowingly) You uploaded a PDF and got an answer?
๐ฆ Nephew: Yes! It works!
๐จโ๐ฆณ Uncle: Did you get the right answer?
๐ฆ Nephew: Well... sometimes. Why?
๐จโ๐ฆณ Uncle: Because between "user uploads PDF" and "system creates embeddings", there are 15 critical steps. Skip even one, and your system fails silently. You get wrong answers and don't know why.
๐ฆ Nephew: 15 steps? I just embedded the text!
๐จโ๐ฆณ Uncle: Exactly. That's the problem. Come, let me show you what production engineers actually do.
The 15 Steps of Phase 1: Document Ingestion
๐จโ๐ฆณ Uncle: Think of it like cooking biryani. You don't just dump rice and meat, right?
๐ฆ Nephew: No, you prepare everything first!
๐จโ๐ฆณ Uncle: Exactly. Before you cook, you:
- Wash rice
- Soak rice
- Prepare meat
- Marinate meat
- Chop onions
- ... and many more steps
Only THEN you cook. The same with documents. Before embeddings, you must:
- Document Upload (user action)
- File Hashing (did we see this before?)
- PDF Parsing (extract text)
- Text Extraction (convert PDF โ text)
- Text Cleaning (remove junk)
- Metadata Extraction (add context)
- Chunking (split smartly)
- Chunk Boundaries (don't break meaning)
- Chunk Size (1000 tokens, not 100)
- Overlap (context continuity)
- Chunk Hashing (detect changes)
- Deduplication (prevent duplicates)
- Versioning (handle updates)
- Incremental Ingestion (avoid re-embedding)
- Cost Optimization (save money)
ONLY THEN:
โ Embeddings
โ Vector DB
๐ฆ Nephew: That's a lot! Where do I start?
๐จโ๐ฆณ Uncle: With step 1. Let's go slow. The foundation must be solid.
Phase 1, Step 1-2: Document Upload & File Hashing
Why Does File Hashing Matter?
๐ฆ Nephew: Why hash a file? Why not just upload it?
๐จโ๐ฆณ Uncle: Because humans are lazy. The HR person uploads the same PDF three times. Your system processes it three times. Three embeddings created. Three times the cost.
๐ฆ Nephew: So hash prevents duplicates?
๐จโ๐ฆณ Uncle: Yes. But here's the trick: don't hash the filename.
๐ฆ Nephew: Why not?
๐จโ๐ฆณ Uncle: Because someone could change the content and keep the same name. Look:
File:
HR_Policy.pdf(Version 1)Content: "30 days notice required"
Filename Hash:
HR_Policy.pdfFile:
HR_Policy.pdf(Version 2)Content: "7 days notice required"
Filename Hash:
HR_Policy.pdf(SAME!)
System thinks: Same file! Reality: Different policies! That's a disaster.
๐ฆ Nephew: So we hash the content?
๐จโ๐ฆณ Uncle: Yes. The actual binary data.
HR_Policy.pdf (Version 1)
โ
[PDF binary bytes: 0xAA 0xBB 0xCC ...]
โ
SHA256
โ
A7B82C1F9D3E...
HR_Policy.pdf (Version 2)
โ
[PDF binary bytes: 0xAA 0xBB 0xDD ...] โ Different!
โ
SHA256
โ
X9Z47M3Q2K1L... โ Different hash!
System detects: New file, process it.
Step 1-2: Node.js Implementation
// src/ingestion/fileHasher.ts
import crypto from 'crypto';
import fs from 'fs';
import path from 'path';
import db from '../config/database';
import logger from '../utils/logger';
interface FileHashResult {
fileName: string;
fileHash: string;
alreadyExists: boolean;
fileSize: number;
}
/**
* Step 1-2: Upload file and check if already processed
*
* โ ๏ธ CRITICAL POINTS:
* 1. Hash FILE CONTENT, not filename
* 2. Same content โ same hash (deterministic)
* 3. Different content โ different hash (even if same filename)
* 4. Check database before processing
*/
export async function handleFileUpload(
filePath: string,
tenantId: string
): Promise<FileHashResult> {
try {
const fileName = path.basename(filePath);
const fileSize = fs.statSync(filePath).size;
// Step 1: Read file and create hash from content
logger.info('File upload started', { fileName, fileSize });
const fileContent = fs.readFileSync(filePath);
const fileHash = crypto
.createHash('sha256')
.update(fileContent)
.digest('hex');
logger.debug('File hashed', {
fileName,
fileHash,
contentBytes: fileContent.length
});
// Step 2: Check if this exact file was already processed
const existingFile = await db.oneOrNone(
`SELECT id, created_at FROM documents WHERE file_hash = $1 AND tenant_id = $2`,
[fileHash, tenantId]
);
if (existingFile) {
logger.warn('Duplicate file detected', {
fileName,
fileHash,
firstUploadedAt: existingFile.created_at
});
return { fileName, fileHash, alreadyExists: true, fileSize };
}
// Step 3: File is new, save metadata
const documentRecord = await db.one(
`INSERT INTO documents (tenant_id, file_name, file_hash, file_size, status)
VALUES ($1, $2, $3, $4, $5) RETURNING id, created_at`,
[tenantId, fileName, fileHash, fileSize, 'uploaded']
);
logger.info('New file registered', {
documentId: documentRecord.id,
fileName,
fileHash: fileHash.substring(0, 12) + '...'
});
return { fileName, fileHash, alreadyExists: false, fileSize };
} catch (error: any) {
logger.error('File upload error', { error: error.message });
throw error;
}
}
/**
* Verify file integrity (optional but recommended)
* If file is corrupted, don't process it
*/
export function verifyFileIntegrity(
filePath: string,
expectedHash: string
): boolean {
const fileContent = fs.readFileSync(filePath);
const actualHash = crypto
.createHash('sha256')
.update(fileContent)
.digest('hex');
return actualHash === expectedHash;
}
// Database schema for documents table
export const documentsTableSQL = `
CREATE TABLE IF NOT EXISTS documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
file_name VARCHAR(500) NOT NULL,
file_hash VARCHAR(64) NOT NULL, -- SHA256 produces 64 char hex
file_size BIGINT NOT NULL,
status VARCHAR(50) DEFAULT 'uploaded', -- uploaded, parsing, parsed, chunking, chunked, embedding, complete
error_message TEXT,
uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Unique constraint: same file can't be uploaded twice for same tenant
CONSTRAINT unique_file_per_tenant UNIQUE(tenant_id, file_hash),
INDEX idx_file_hash (file_hash),
INDEX idx_tenant_status (tenant_id, status)
);
`;
๐ฆ Nephew: So if someone uploads the same PDF twice, we detect it and skip?
๐จโ๐ฆณ Uncle: Exactly. And we save embedding cost. Which is the most expensive step.
Phase 1, Step 3: PDF Parsing - Choosing the Right Tool
๐ฆ Nephew: Now we have the PDF. How do we extract text?
๐จโ๐ฆณ Uncle: This is where the real decision happens. There are five major tools. Each has tradeoffs.
๐ฆ Nephew: Five?! Which one should I use?
๐จโ๐ฆณ Uncle: Depends on your documents. Let me show you.
The PDF Parsing Landscape
Simple Text PDFs
โ pdf-parse (cheap, simple)
โ
Mixed content (text + tables)
โ PDFPlumber (better)
โ
Complex documents
โ Unstructured (production-grade)
โ
Advanced documents
โ LlamaParse (state-of-art)
(tables, images, OCR)
โ
Enterprise documents
โ Azure Document Intelligence
(forms, invoices, scans)
Tool Comparison
| Tool | Best For | Cost | Speed | Table Support | OCR | Metadata | Production Ready |
|---|---|---|---|---|---|---|---|
| pdf-parse | Simple text | โน0 (free) | โกโกโก Fast | โ No | โ | โ | โ ๏ธ Hobby |
| PDFPlumber | Text + tables | โน0 | โกโก Medium | โ Basic | โ | โ ๏ธ Limited | โ ๏ธ Small |
| Unstructured | Normal docs | โน50-200/mo | โก Slow | โ Good | โ Basic | โ Good | โ Yes |
| LlamaParse | Complex docs | โน100-500/mo | โก Slow | โ Excellent | โ Advanced | โ Excellent | โ Yes |
| Azure Doc Int. | Enterprise | โน500-2000/mo | โก Medium | โ Perfect | โ Perfect | โ Perfect | โ Enterprise |
๐จโ๐ฆณ Uncle: Let me explain each.
Tool 1: pdf-parse (Free, Simple)
// Simple approach - good for learning, bad for production
const pdf = require('pdf-parse');
const fs = require('fs');
async function extractTextFromPDF(filePath) {
const dataBuffer = fs.readFileSync(filePath);
const data = await pdf(dataBuffer);
console.log(data.text); // Raw text
// Output: "Company Policy Leave Policy Employees receive..."
}
๐จโ๐ฆณ Uncle: Notice what we lost?
Original PDF:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
COMPANY POLICY
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Leave Policy (HEADING)
Employees are entitled to 24 paid leaves annually. (PARAGRAPH)
Leave Types:
- Annual Leave
- Casual Leave (LIST)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
pdf-parse Output:
"COMPANY POLICY Leave Policy Employees are entitled to 24 paid leaves annually. Leave Types: Annual Leave Casual Leave"
LOST:
โ Heading level
โ List structure
โ Paragraph breaks
โ Section organization
โ Tables (if any)
๐ฆ Nephew: So we just get text soup?
๐จโ๐ฆณ Uncle: Yes. And when you embed soup, you get soup answers.
Tool 2: PDFPlumber (Better, Still Python)
# PDFPlumber - extracts tables better
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
# Extract text
text = page.extract_text()
# Extract tables (if any)
tables = page.extract_tables()
print(f"Page {i}:")
print(f"Text: {text}")
print(f"Tables: {tables}")
๐จโ๐ฆณ Uncle: Better, but still loses structure. And it's Python, not Node.js.
Tool 3: Unstructured (Production Choice)
๐จโ๐ฆณ Uncle: This is what most companies use. It preserves structure.
// Using Unstructured via API (Node.js friendly)
import axios from 'axios';
import fs from 'fs';
/**
* Step 3: Parse PDF using Unstructured
*
* IMPORTANT: Unstructured preserves document structure
* Returns: Array of structured elements
*/
export async function parsePDFWithUnstructured(
filePath: string
): Promise<any[]> {
try {
const fileContent = fs.readFileSync(filePath);
const base64Content = fileContent.toString('base64');
const response = await axios.post(
'https://api.unstructuredapp.io/general/v0/general',
{
file: base64Content,
strategy: 'hi_res', // High resolution parsing
coordinates: true // Preserve coordinates (useful for tables)
},
{
headers: {
'Authorization': `Bearer ${process.env.UNSTRUCTURED_API_KEY}`,
'Content-Type': 'application/json'
}
}
);
const elements = response.data.elements;
// Output: Structured elements
// [
// { type: "Title", text: "Leave Policy", metadata: {...} },
// { type: "Heading", text: "Annual Leave", metadata: {...} },
// { type: "Paragraph", text: "Employees are entitled...", metadata: {...} },
// { type: "ListItem", text: "Manager approval required", metadata: {...} }
// ]
logger.info('PDF parsed with structure preserved', {
elementCount: elements.length,
types: [...new Set(elements.map((e: any) => e.type))]
});
return elements;
} catch (error: any) {
logger.error('Unstructured parsing failed', {
error: error.message
});
throw error;
}
}
๐ฆ Nephew: So it preserves structure?
๐จโ๐ฆณ Uncle: Yes. Look at the difference:
Unstructured Output:
[
{
type: "Title",
text: "Leave Policy",
metadata: { page_number: 1, section: "Policies" }
},
{
type: "Paragraph",
text: "Employees receive 24 leaves",
metadata: { page_number: 1 }
},
{
type: "List",
text: "Annual Leave, Casual Leave",
metadata: { page_number: 2, list_items: 2 }
}
]
Now we know:
โ What is a title
โ What is body text
โ What is a list
โ Page numbers
โ Sections
Tool 4: LlamaParse (State of the Art)
๐จโ๐ฆณ Uncle: For really complex documents, use LlamaParse.
// LlamaParse - best for complex PDFs
import axios from 'axios';
import FormData from 'form-data';
import fs from 'fs';
/**
* Parse PDF with LlamaParse (best for:
* - Multi-column layouts
* - Tables with merged cells
* - Images with text
* - Scanned documents (OCR)
* - Footnotes and annotations
*/
export async function parsePDFWithLlamaParse(
filePath: string
): Promise<any> {
try {
// Step 1: Upload file
const formData = new FormData();
formData.append('file', fs.createReadStream(filePath));
formData.append(
'parsing_instruction',
`Extract all content including:
- Tables with proper structure
- Column layouts
- Images with OCR
- Headings and sections
- Metadata like page numbers`
);
const uploadResponse = await axios.post(
'https://api.llamaindex.ai/api/parsing/upload_file',
formData,
{
headers: {
'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`,
...formData.getHeaders()
}
}
);
const jobId = uploadResponse.data.id;
// Step 2: Poll for results
let result = null;
for (let i = 0; i < 60; i++) {
const statusResponse = await axios.get(
`https://api.llamaindex.ai/api/parsing/job/${jobId}/result/markdown`,
{
headers: {
'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`
}
}
);
if (statusResponse.status === 200) {
result = statusResponse.data;
break;
}
// Wait 2 seconds before retry
await new Promise(resolve => setTimeout(resolve, 2000));
}
logger.info('LlamaParse completed', { jobId });
return {
markdown: result,
parsedAt: new Date()
};
} catch (error: any) {
logger.error('LlamaParse failed', { error: error.message });
throw error;
}
}
๐ฆ Nephew: When do I use LlamaParse vs Unstructured?
๐จโ๐ฆณ Uncle: Simple rule:
Document type?
โโ Simple text policies
โ โโโ pdf-parse (free)
โ
โโ Text + basic tables
โ โโโ Unstructured (cheap, good)
โ
โโ Complex tables, multi-column
โ โโโ LlamaParse (excellent)
โ
โโ Enterprise documents, forms, invoices
โโโ Azure Document Intelligence (best)
Tool 5: Azure Document Intelligence (Enterprise)
// Azure Document Intelligence - for enterprise documents
import { DocumentAnalysisClient, AzureKeyCredential } from "@azure/ai-form-recognizer";
import fs from 'fs';
/**
* Parse with Azure (best for:
* - Invoices
* - Forms
* - Bank documents
* - Scanned PDFs with OCR
*/
export async function parseWithAzureDocumentIntelligence(
filePath: string
) {
try {
const client = new DocumentAnalysisClient(
process.env.AZURE_FORM_RECOGNIZER_ENDPOINT,
new AzureKeyCredential(process.env.AZURE_FORM_RECOGNIZER_KEY)
);
const fileContent = fs.readFileSync(filePath);
// Choose model based on document type
const poller = await client.beginAnalyzeDocument(
"prebuilt-document",
fileContent
);
const result = await poller.pollUntilDone();
logger.info('Azure Document Intelligence parsing complete', {
pages: result.pages?.length,
tables: result.tables?.length,
keyValuePairs: result.keyValuePairs?.length
});
return result;
} catch (error: any) {
logger.error('Azure Document Intelligence failed', {
error: error.message
});
throw error;
}
}
Comments
No comments yet. Start the discussion.