DEV Community 4h ago

Phase 1: Document Ingestion - The Hidden Complexity Before Embeddings

The Story Begins: Why Your Upload Button Is Just The Beginning

👦 Nephew: Uncle! I finally built my RAG system. User uploads a PDF, system finds answers. Simple, right?

👨‍🦳 Uncle: (smiles knowingly) You uploaded a PDF and got an answer?

👦 Nephew: Yes! It works!

👨‍🦳 Uncle: Did you get the right answer?

👦 Nephew: Well... sometimes. Why?

👨‍🦳 Uncle: Because between "user uploads PDF" and "system creates embeddings", there are 15 critical steps. Skip even one, and your system fails silently. You get wrong answers and don't know why.

👦 Nephew: 15 steps? I just embedded the text!

👨‍🦳 Uncle: Exactly. That's the problem. Come, let me show you what production engineers actually do.

The 15 Steps of Phase 1: Document Ingestion

👨‍🦳 Uncle: Think of it like cooking biryani. You don't just dump rice and meat, right?

👦 Nephew: No, you prepare everything first!

👨‍🦳 Uncle: Exactly. Before you cook, you:

Wash rice
Soak rice
Prepare meat
Marinate meat
Chop onions
... and many more steps

Only THEN you cook. The same with documents. Before embeddings, you must:

Document Upload (user action)
File Hashing (did we see this before?)
PDF Parsing (extract text)
Text Extraction (convert PDF → text)
Text Cleaning (remove junk)
Metadata Extraction (add context)
Chunking (split smartly)
Chunk Boundaries (don't break meaning)
Chunk Size (1000 tokens, not 100)
Overlap (context continuity)
Chunk Hashing (detect changes)
Deduplication (prevent duplicates)
Versioning (handle updates)
Incremental Ingestion (avoid re-embedding)
Cost Optimization (save money)

ONLY THEN:
↓ Embeddings
↓ Vector DB

👦 Nephew: That's a lot! Where do I start?

👨‍🦳 Uncle: With step 1. Let's go slow. The foundation must be solid.

Phase 1, Step 1-2: Document Upload & File Hashing

Why Does File Hashing Matter?

👦 Nephew: Why hash a file? Why not just upload it?

👨‍🦳 Uncle: Because humans are lazy. The HR person uploads the same PDF three times. Your system processes it three times. Three embeddings created. Three times the cost.

👦 Nephew: So hash prevents duplicates?

👨‍🦳 Uncle: Yes. But here's the trick: don't hash the filename.

👦 Nephew: Why not?

👨‍🦳 Uncle: Because someone could change the content and keep the same name. Look:

File: HR_Policy.pdf (Version 1)
Content: "30 days notice required"
Filename Hash: HR_Policy.pdf
File: HR_Policy.pdf (Version 2)
Content: "7 days notice required"
Filename Hash: HR_Policy.pdf (SAME!)

System thinks: Same file! Reality: Different policies! That's a disaster.

👦 Nephew: So we hash the content?

👨‍🦳 Uncle: Yes. The actual binary data.

HR_Policy.pdf (Version 1)
↓
[PDF binary bytes: 0xAA 0xBB 0xCC ...]
↓
SHA256
↓
A7B82C1F9D3E...

HR_Policy.pdf (Version 2)
↓
[PDF binary bytes: 0xAA 0xBB 0xDD ...] ← Different!
↓
SHA256
↓
X9Z47M3Q2K1L... ← Different hash!

System detects: New file, process it.

Step 1-2: Node.js Implementation

// src/ingestion/fileHasher.ts
import crypto from 'crypto';
import fs from 'fs';
import path from 'path';
import db from '../config/database';
import logger from '../utils/logger';

interface FileHashResult {
  fileName: string;
  fileHash: string;
  alreadyExists: boolean;
  fileSize: number;
}

/**
 * Step 1-2: Upload file and check if already processed
 *
 * ⚠️ CRITICAL POINTS:
 * 1. Hash FILE CONTENT, not filename
 * 2. Same content → same hash (deterministic)
 * 3. Different content → different hash (even if same filename)
 * 4. Check database before processing
 */
export async function handleFileUpload(
  filePath: string,
  tenantId: string
): Promise<FileHashResult> {
  try {
    const fileName = path.basename(filePath);
    const fileSize = fs.statSync(filePath).size;

    // Step 1: Read file and create hash from content
    logger.info('File upload started', { fileName, fileSize });
    const fileContent = fs.readFileSync(filePath);
    const fileHash = crypto
      .createHash('sha256')
      .update(fileContent)
      .digest('hex');

    logger.debug('File hashed', {
      fileName,
      fileHash,
      contentBytes: fileContent.length
    });

    // Step 2: Check if this exact file was already processed
    const existingFile = await db.oneOrNone(
      `SELECT id, created_at FROM documents WHERE file_hash = $1 AND tenant_id = $2`,
      [fileHash, tenantId]
    );

    if (existingFile) {
      logger.warn('Duplicate file detected', {
        fileName,
        fileHash,
        firstUploadedAt: existingFile.created_at
      });
      return { fileName, fileHash, alreadyExists: true, fileSize };
    }

    // Step 3: File is new, save metadata
    const documentRecord = await db.one(
      `INSERT INTO documents (tenant_id, file_name, file_hash, file_size, status)
       VALUES ($1, $2, $3, $4, $5) RETURNING id, created_at`,
      [tenantId, fileName, fileHash, fileSize, 'uploaded']
    );

    logger.info('New file registered', {
      documentId: documentRecord.id,
      fileName,
      fileHash: fileHash.substring(0, 12) + '...'
    });

    return { fileName, fileHash, alreadyExists: false, fileSize };
  } catch (error: any) {
    logger.error('File upload error', { error: error.message });
    throw error;
  }
}

/**
 * Verify file integrity (optional but recommended)
 * If file is corrupted, don't process it
 */
export function verifyFileIntegrity(
  filePath: string,
  expectedHash: string
): boolean {
  const fileContent = fs.readFileSync(filePath);
  const actualHash = crypto
    .createHash('sha256')
    .update(fileContent)
    .digest('hex');
  return actualHash === expectedHash;
}

// Database schema for documents table
export const documentsTableSQL = `
  CREATE TABLE IF NOT EXISTS documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(id),
    file_name VARCHAR(500) NOT NULL,
    file_hash VARCHAR(64) NOT NULL, -- SHA256 produces 64 char hex
    file_size BIGINT NOT NULL,
    status VARCHAR(50) DEFAULT 'uploaded', -- uploaded, parsing, parsed, chunking, chunked, embedding, complete
    error_message TEXT,
    uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Unique constraint: same file can't be uploaded twice for same tenant
    CONSTRAINT unique_file_per_tenant UNIQUE(tenant_id, file_hash),
    INDEX idx_file_hash (file_hash),
    INDEX idx_tenant_status (tenant_id, status)
  );
`;

👦 Nephew: So if someone uploads the same PDF twice, we detect it and skip?

👨‍🦳 Uncle: Exactly. And we save embedding cost. Which is the most expensive step.

Phase 1, Step 3: PDF Parsing - Choosing the Right Tool

👦 Nephew: Now we have the PDF. How do we extract text?

👨‍🦳 Uncle: This is where the real decision happens. There are five major tools. Each has tradeoffs.

👦 Nephew: Five?! Which one should I use?

👨‍🦳 Uncle: Depends on your documents. Let me show you.

The PDF Parsing Landscape

Simple Text PDFs
→ pdf-parse (cheap, simple)
↓
Mixed content (text + tables)
→ PDFPlumber (better)
↓
Complex documents
→ Unstructured (production-grade)
↓
Advanced documents
→ LlamaParse (state-of-art)
  (tables, images, OCR)
↓
Enterprise documents
→ Azure Document Intelligence
  (forms, invoices, scans)

Tool Comparison

Tool	Best For	Cost	Speed	Table Support	OCR	Metadata	Production Ready
pdf-parse	Simple text	₹0 (free)	⚡⚡⚡ Fast	✗ No	✗	✗	⚠️ Hobby
PDFPlumber	Text + tables	₹0	⚡⚡ Medium	✓ Basic	✗	⚠️ Limited	⚠️ Small
Unstructured	Normal docs	₹50-200/mo	⚡ Slow	✓ Good	✓ Basic	✓ Good	✓ Yes
LlamaParse	Complex docs	₹100-500/mo	⚡ Slow	✓ Excellent	✓ Advanced	✓ Excellent	✓ Yes
Azure Doc Int.	Enterprise	₹500-2000/mo	⚡ Medium	✓ Perfect	✓ Perfect	✓ Perfect	✓ Enterprise

👨‍🦳 Uncle: Let me explain each.

Tool 1: pdf-parse (Free, Simple)

// Simple approach - good for learning, bad for production
const pdf = require('pdf-parse');
const fs = require('fs');

async function extractTextFromPDF(filePath) {
  const dataBuffer = fs.readFileSync(filePath);
  const data = await pdf(dataBuffer);
  console.log(data.text); // Raw text

  // Output: "Company Policy Leave Policy Employees receive..."
}

👨‍🦳 Uncle: Notice what we lost?

Original PDF:
═══════════════════════════════════════
COMPANY POLICY
───────────────────────────────────────
Leave Policy (HEADING)
Employees are entitled to 24 paid leaves annually. (PARAGRAPH)

Leave Types:
- Annual Leave
- Casual Leave (LIST)
═══════════════════════════════════════

pdf-parse Output:
"COMPANY POLICY Leave Policy Employees are entitled to 24 paid leaves annually. Leave Types: Annual Leave Casual Leave"

LOST:
✗ Heading level
✗ List structure
✗ Paragraph breaks
✗ Section organization
✗ Tables (if any)

👦 Nephew: So we just get text soup?

👨‍🦳 Uncle: Yes. And when you embed soup, you get soup answers.

Tool 2: PDFPlumber (Better, Still Python)

# PDFPlumber - extracts tables better
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        # Extract text
        text = page.extract_text()

        # Extract tables (if any)
        tables = page.extract_tables()

        print(f"Page {i}:")
        print(f"Text: {text}")
        print(f"Tables: {tables}")

👨‍🦳 Uncle: Better, but still loses structure. And it's Python, not Node.js.

Tool 3: Unstructured (Production Choice)

👨‍🦳 Uncle: This is what most companies use. It preserves structure.

// Using Unstructured via API (Node.js friendly)
import axios from 'axios';
import fs from 'fs';

/**
 * Step 3: Parse PDF using Unstructured
 *
 * IMPORTANT: Unstructured preserves document structure
 * Returns: Array of structured elements
 */
export async function parsePDFWithUnstructured(
  filePath: string
): Promise<any[]> {
  try {
    const fileContent = fs.readFileSync(filePath);
    const base64Content = fileContent.toString('base64');

    const response = await axios.post(
      'https://api.unstructuredapp.io/general/v0/general',
      {
        file: base64Content,
        strategy: 'hi_res', // High resolution parsing
        coordinates: true // Preserve coordinates (useful for tables)
      },
      {
        headers: {
          'Authorization': `Bearer ${process.env.UNSTRUCTURED_API_KEY}`,
          'Content-Type': 'application/json'
        }
      }
    );

    const elements = response.data.elements;

    // Output: Structured elements
    // [
    //   { type: "Title", text: "Leave Policy", metadata: {...} },
    //   { type: "Heading", text: "Annual Leave", metadata: {...} },
    //   { type: "Paragraph", text: "Employees are entitled...", metadata: {...} },
    //   { type: "ListItem", text: "Manager approval required", metadata: {...} }
    // ]

    logger.info('PDF parsed with structure preserved', {
      elementCount: elements.length,
      types: [...new Set(elements.map((e: any) => e.type))]
    });

    return elements;
  } catch (error: any) {
    logger.error('Unstructured parsing failed', {
      error: error.message
    });
    throw error;
  }
}

👦 Nephew: So it preserves structure?

👨‍🦳 Uncle: Yes. Look at the difference:

Unstructured Output:
[
  {
    type: "Title",
    text: "Leave Policy",
    metadata: { page_number: 1, section: "Policies" }
  },
  {
    type: "Paragraph",
    text: "Employees receive 24 leaves",
    metadata: { page_number: 1 }
  },
  {
    type: "List",
    text: "Annual Leave, Casual Leave",
    metadata: { page_number: 2, list_items: 2 }
  }
]

Now we know:
✓ What is a title
✓ What is body text
✓ What is a list
✓ Page numbers
✓ Sections

Tool 4: LlamaParse (State of the Art)

👨‍🦳 Uncle: For really complex documents, use LlamaParse.

// LlamaParse - best for complex PDFs
import axios from 'axios';
import FormData from 'form-data';
import fs from 'fs';

/**
 * Parse PDF with LlamaParse (best for:
 * - Multi-column layouts
 * - Tables with merged cells
 * - Images with text
 * - Scanned documents (OCR)
 * - Footnotes and annotations
 */
export async function parsePDFWithLlamaParse(
  filePath: string
): Promise<any> {
  try {
    // Step 1: Upload file
    const formData = new FormData();
    formData.append('file', fs.createReadStream(filePath));
    formData.append(
      'parsing_instruction',
      `Extract all content including:
       - Tables with proper structure
       - Column layouts
       - Images with OCR
       - Headings and sections
       - Metadata like page numbers`
    );

    const uploadResponse = await axios.post(
      'https://api.llamaindex.ai/api/parsing/upload_file',
      formData,
      {
        headers: {
          'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`,
          ...formData.getHeaders()
        }
      }
    );

    const jobId = uploadResponse.data.id;

    // Step 2: Poll for results
    let result = null;
    for (let i = 0; i < 60; i++) {
      const statusResponse = await axios.get(
        `https://api.llamaindex.ai/api/parsing/job/${jobId}/result/markdown`,
        {
          headers: {
            'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`
          }
        }
      );

      if (statusResponse.status === 200) {
        result = statusResponse.data;
        break;
      }

      // Wait 2 seconds before retry
      await new Promise(resolve => setTimeout(resolve, 2000));
    }

    logger.info('LlamaParse completed', { jobId });

    return {
      markdown: result,
      parsedAt: new Date()
    };
  } catch (error: any) {
    logger.error('LlamaParse failed', { error: error.message });
    throw error;
  }
}

👦 Nephew: When do I use LlamaParse vs Unstructured?

👨‍🦳 Uncle: Simple rule:

Document type?
├─ Simple text policies
│  └─→ pdf-parse (free)
│
├─ Text + basic tables
│  └─→ Unstructured (cheap, good)
│
├─ Complex tables, multi-column
│  └─→ LlamaParse (excellent)
│
└─ Enterprise documents, forms, invoices
   └─→ Azure Document Intelligence (best)

Tool 5: Azure Document Intelligence (Enterprise)

// Azure Document Intelligence - for enterprise documents
import { DocumentAnalysisClient, AzureKeyCredential } from "@azure/ai-form-recognizer";
import fs from 'fs';

/**
 * Parse with Azure (best for:
 * - Invoices
 * - Forms
 * - Bank documents
 * - Scanned PDFs with OCR
 */
export async function parseWithAzureDocumentIntelligence(
  filePath: string
) {
  try {
    const client = new DocumentAnalysisClient(
      process.env.AZURE_FORM_RECOGNIZER_ENDPOINT,
      new AzureKeyCredential(process.env.AZURE_FORM_RECOGNIZER_KEY)
    );

    const fileContent = fs.readFileSync(filePath);

    // Choose model based on document type
    const poller = await client.beginAnalyzeDocument(
      "prebuilt-document",
      fileContent
    );

    const result = await poller.pollUntilDone();

    logger.info('Azure Document Intelligence parsing complete', {
      pages: result.pages?.length,
      tables: result.tables?.length,
      keyValuePairs: result.keyValuePairs?.length
    });

    return result;
  } catch (error: any) {
    logger.error('Azure Document Intelligence failed', {
      error: error.message
    });
    throw error;
  }
}

Read on DEV Community ↗ ← Back to News

Phase 1: Document Ingestion - The Hidden Complexity Before Embeddings