Phase 1: Document Ingestion - The Hidden Complexity Before Embeddings
DEV Community

Phase 1: Document Ingestion - The Hidden Complexity Before Embeddings

Phase 1: Document Ingestion - The Hidden Complexity Before Embeddings

The Story Begins: Why Your Upload Button Is Just The Beginning

๐Ÿ‘ฆ Nephew: Uncle! I finally built my RAG system. User uploads a PDF, system finds answers. Simple, right?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: (smiles knowingly) You uploaded a PDF and got an answer?

๐Ÿ‘ฆ Nephew: Yes! It works!

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Did you get the right answer?

๐Ÿ‘ฆ Nephew: Well... sometimes. Why?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Because between "user uploads PDF" and "system creates embeddings", there are 15 critical steps. Skip even one, and your system fails silently. You get wrong answers and don't know why.

๐Ÿ‘ฆ Nephew: 15 steps? I just embedded the text!

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Exactly. That's the problem. Come, let me show you what production engineers actually do.

The 15 Steps of Phase 1: Document Ingestion

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Think of it like cooking biryani. You don't just dump rice and meat, right?

๐Ÿ‘ฆ Nephew: No, you prepare everything first!

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Exactly. Before you cook, you:

  • Wash rice
  • Soak rice
  • Prepare meat
  • Marinate meat
  • Chop onions
  • ... and many more steps

Only THEN you cook. The same with documents. Before embeddings, you must:

  1. Document Upload (user action)
  2. File Hashing (did we see this before?)
  3. PDF Parsing (extract text)
  4. Text Extraction (convert PDF โ†’ text)
  5. Text Cleaning (remove junk)
  6. Metadata Extraction (add context)
  7. Chunking (split smartly)
  8. Chunk Boundaries (don't break meaning)
  9. Chunk Size (1000 tokens, not 100)
  10. Overlap (context continuity)
  11. Chunk Hashing (detect changes)
  12. Deduplication (prevent duplicates)
  13. Versioning (handle updates)
  14. Incremental Ingestion (avoid re-embedding)
  15. Cost Optimization (save money)

ONLY THEN:
โ†“ Embeddings
โ†“ Vector DB

๐Ÿ‘ฆ Nephew: That's a lot! Where do I start?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: With step 1. Let's go slow. The foundation must be solid.

Phase 1, Step 1-2: Document Upload & File Hashing

Why Does File Hashing Matter?

๐Ÿ‘ฆ Nephew: Why hash a file? Why not just upload it?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Because humans are lazy. The HR person uploads the same PDF three times. Your system processes it three times. Three embeddings created. Three times the cost.

๐Ÿ‘ฆ Nephew: So hash prevents duplicates?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Yes. But here's the trick: don't hash the filename.

๐Ÿ‘ฆ Nephew: Why not?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Because someone could change the content and keep the same name. Look:

  • File: HR_Policy.pdf (Version 1)

  • Content: "30 days notice required"

  • Filename Hash: HR_Policy.pdf

  • File: HR_Policy.pdf (Version 2)

  • Content: "7 days notice required"

  • Filename Hash: HR_Policy.pdf (SAME!)

System thinks: Same file! Reality: Different policies! That's a disaster.

๐Ÿ‘ฆ Nephew: So we hash the content?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Yes. The actual binary data.

HR_Policy.pdf (Version 1)
โ†“
[PDF binary bytes: 0xAA 0xBB 0xCC ...]
โ†“
SHA256
โ†“
A7B82C1F9D3E...

HR_Policy.pdf (Version 2)
โ†“
[PDF binary bytes: 0xAA 0xBB 0xDD ...] โ† Different!
โ†“
SHA256
โ†“
X9Z47M3Q2K1L... โ† Different hash!

System detects: New file, process it.

Step 1-2: Node.js Implementation

// src/ingestion/fileHasher.ts
import crypto from 'crypto';
import fs from 'fs';
import path from 'path';
import db from '../config/database';
import logger from '../utils/logger';

interface FileHashResult {
  fileName: string;
  fileHash: string;
  alreadyExists: boolean;
  fileSize: number;
}

/**
 * Step 1-2: Upload file and check if already processed
 *
 * โš ๏ธ CRITICAL POINTS:
 * 1. Hash FILE CONTENT, not filename
 * 2. Same content โ†’ same hash (deterministic)
 * 3. Different content โ†’ different hash (even if same filename)
 * 4. Check database before processing
 */
export async function handleFileUpload(
  filePath: string,
  tenantId: string
): Promise<FileHashResult> {
  try {
    const fileName = path.basename(filePath);
    const fileSize = fs.statSync(filePath).size;

    // Step 1: Read file and create hash from content
    logger.info('File upload started', { fileName, fileSize });
    const fileContent = fs.readFileSync(filePath);
    const fileHash = crypto
      .createHash('sha256')
      .update(fileContent)
      .digest('hex');

    logger.debug('File hashed', {
      fileName,
      fileHash,
      contentBytes: fileContent.length
    });

    // Step 2: Check if this exact file was already processed
    const existingFile = await db.oneOrNone(
      `SELECT id, created_at FROM documents WHERE file_hash = $1 AND tenant_id = $2`,
      [fileHash, tenantId]
    );

    if (existingFile) {
      logger.warn('Duplicate file detected', {
        fileName,
        fileHash,
        firstUploadedAt: existingFile.created_at
      });
      return { fileName, fileHash, alreadyExists: true, fileSize };
    }

    // Step 3: File is new, save metadata
    const documentRecord = await db.one(
      `INSERT INTO documents (tenant_id, file_name, file_hash, file_size, status)
       VALUES ($1, $2, $3, $4, $5) RETURNING id, created_at`,
      [tenantId, fileName, fileHash, fileSize, 'uploaded']
    );

    logger.info('New file registered', {
      documentId: documentRecord.id,
      fileName,
      fileHash: fileHash.substring(0, 12) + '...'
    });

    return { fileName, fileHash, alreadyExists: false, fileSize };
  } catch (error: any) {
    logger.error('File upload error', { error: error.message });
    throw error;
  }
}

/**
 * Verify file integrity (optional but recommended)
 * If file is corrupted, don't process it
 */
export function verifyFileIntegrity(
  filePath: string,
  expectedHash: string
): boolean {
  const fileContent = fs.readFileSync(filePath);
  const actualHash = crypto
    .createHash('sha256')
    .update(fileContent)
    .digest('hex');
  return actualHash === expectedHash;
}

// Database schema for documents table
export const documentsTableSQL = `
  CREATE TABLE IF NOT EXISTS documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(id),
    file_name VARCHAR(500) NOT NULL,
    file_hash VARCHAR(64) NOT NULL, -- SHA256 produces 64 char hex
    file_size BIGINT NOT NULL,
    status VARCHAR(50) DEFAULT 'uploaded', -- uploaded, parsing, parsed, chunking, chunked, embedding, complete
    error_message TEXT,
    uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Unique constraint: same file can't be uploaded twice for same tenant
    CONSTRAINT unique_file_per_tenant UNIQUE(tenant_id, file_hash),
    INDEX idx_file_hash (file_hash),
    INDEX idx_tenant_status (tenant_id, status)
  );
`;

๐Ÿ‘ฆ Nephew: So if someone uploads the same PDF twice, we detect it and skip?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Exactly. And we save embedding cost. Which is the most expensive step.

Phase 1, Step 3: PDF Parsing - Choosing the Right Tool

๐Ÿ‘ฆ Nephew: Now we have the PDF. How do we extract text?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: This is where the real decision happens. There are five major tools. Each has tradeoffs.

๐Ÿ‘ฆ Nephew: Five?! Which one should I use?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Depends on your documents. Let me show you.

The PDF Parsing Landscape

Simple Text PDFs
โ†’ pdf-parse (cheap, simple)
โ†“
Mixed content (text + tables)
โ†’ PDFPlumber (better)
โ†“
Complex documents
โ†’ Unstructured (production-grade)
โ†“
Advanced documents
โ†’ LlamaParse (state-of-art)
  (tables, images, OCR)
โ†“
Enterprise documents
โ†’ Azure Document Intelligence
  (forms, invoices, scans)

Tool Comparison

Tool Best For Cost Speed Table Support OCR Metadata Production Ready
pdf-parse Simple text โ‚น0 (free) โšกโšกโšก Fast โœ— No โœ— โœ— โš ๏ธ Hobby
PDFPlumber Text + tables โ‚น0 โšกโšก Medium โœ“ Basic โœ— โš ๏ธ Limited โš ๏ธ Small
Unstructured Normal docs โ‚น50-200/mo โšก Slow โœ“ Good โœ“ Basic โœ“ Good โœ“ Yes
LlamaParse Complex docs โ‚น100-500/mo โšก Slow โœ“ Excellent โœ“ Advanced โœ“ Excellent โœ“ Yes
Azure Doc Int. Enterprise โ‚น500-2000/mo โšก Medium โœ“ Perfect โœ“ Perfect โœ“ Perfect โœ“ Enterprise

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Let me explain each.

Tool 1: pdf-parse (Free, Simple)

// Simple approach - good for learning, bad for production
const pdf = require('pdf-parse');
const fs = require('fs');

async function extractTextFromPDF(filePath) {
  const dataBuffer = fs.readFileSync(filePath);
  const data = await pdf(dataBuffer);
  console.log(data.text); // Raw text

  // Output: "Company Policy Leave Policy Employees receive..."
}

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Notice what we lost?

Original PDF:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
COMPANY POLICY
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Leave Policy (HEADING)
Employees are entitled to 24 paid leaves annually. (PARAGRAPH)

Leave Types:
- Annual Leave
- Casual Leave (LIST)
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

pdf-parse Output:
"COMPANY POLICY Leave Policy Employees are entitled to 24 paid leaves annually. Leave Types: Annual Leave Casual Leave"

LOST:
โœ— Heading level
โœ— List structure
โœ— Paragraph breaks
โœ— Section organization
โœ— Tables (if any)

๐Ÿ‘ฆ Nephew: So we just get text soup?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Yes. And when you embed soup, you get soup answers.

Tool 2: PDFPlumber (Better, Still Python)

# PDFPlumber - extracts tables better
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        # Extract text
        text = page.extract_text()

        # Extract tables (if any)
        tables = page.extract_tables()

        print(f"Page {i}:")
        print(f"Text: {text}")
        print(f"Tables: {tables}")

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Better, but still loses structure. And it's Python, not Node.js.

Tool 3: Unstructured (Production Choice)

๐Ÿ‘จโ€๐Ÿฆณ Uncle: This is what most companies use. It preserves structure.

// Using Unstructured via API (Node.js friendly)
import axios from 'axios';
import fs from 'fs';

/**
 * Step 3: Parse PDF using Unstructured
 *
 * IMPORTANT: Unstructured preserves document structure
 * Returns: Array of structured elements
 */
export async function parsePDFWithUnstructured(
  filePath: string
): Promise<any[]> {
  try {
    const fileContent = fs.readFileSync(filePath);
    const base64Content = fileContent.toString('base64');

    const response = await axios.post(
      'https://api.unstructuredapp.io/general/v0/general',
      {
        file: base64Content,
        strategy: 'hi_res', // High resolution parsing
        coordinates: true // Preserve coordinates (useful for tables)
      },
      {
        headers: {
          'Authorization': `Bearer ${process.env.UNSTRUCTURED_API_KEY}`,
          'Content-Type': 'application/json'
        }
      }
    );

    const elements = response.data.elements;

    // Output: Structured elements
    // [
    //   { type: "Title", text: "Leave Policy", metadata: {...} },
    //   { type: "Heading", text: "Annual Leave", metadata: {...} },
    //   { type: "Paragraph", text: "Employees are entitled...", metadata: {...} },
    //   { type: "ListItem", text: "Manager approval required", metadata: {...} }
    // ]

    logger.info('PDF parsed with structure preserved', {
      elementCount: elements.length,
      types: [...new Set(elements.map((e: any) => e.type))]
    });

    return elements;
  } catch (error: any) {
    logger.error('Unstructured parsing failed', {
      error: error.message
    });
    throw error;
  }
}

๐Ÿ‘ฆ Nephew: So it preserves structure?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Yes. Look at the difference:

Unstructured Output:
[
  {
    type: "Title",
    text: "Leave Policy",
    metadata: { page_number: 1, section: "Policies" }
  },
  {
    type: "Paragraph",
    text: "Employees receive 24 leaves",
    metadata: { page_number: 1 }
  },
  {
    type: "List",
    text: "Annual Leave, Casual Leave",
    metadata: { page_number: 2, list_items: 2 }
  }
]

Now we know:
โœ“ What is a title
โœ“ What is body text
โœ“ What is a list
โœ“ Page numbers
โœ“ Sections

Tool 4: LlamaParse (State of the Art)

๐Ÿ‘จโ€๐Ÿฆณ Uncle: For really complex documents, use LlamaParse.

// LlamaParse - best for complex PDFs
import axios from 'axios';
import FormData from 'form-data';
import fs from 'fs';

/**
 * Parse PDF with LlamaParse (best for:
 * - Multi-column layouts
 * - Tables with merged cells
 * - Images with text
 * - Scanned documents (OCR)
 * - Footnotes and annotations
 */
export async function parsePDFWithLlamaParse(
  filePath: string
): Promise<any> {
  try {
    // Step 1: Upload file
    const formData = new FormData();
    formData.append('file', fs.createReadStream(filePath));
    formData.append(
      'parsing_instruction',
      `Extract all content including:
       - Tables with proper structure
       - Column layouts
       - Images with OCR
       - Headings and sections
       - Metadata like page numbers`
    );

    const uploadResponse = await axios.post(
      'https://api.llamaindex.ai/api/parsing/upload_file',
      formData,
      {
        headers: {
          'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`,
          ...formData.getHeaders()
        }
      }
    );

    const jobId = uploadResponse.data.id;

    // Step 2: Poll for results
    let result = null;
    for (let i = 0; i < 60; i++) {
      const statusResponse = await axios.get(
        `https://api.llamaindex.ai/api/parsing/job/${jobId}/result/markdown`,
        {
          headers: {
            'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`
          }
        }
      );

      if (statusResponse.status === 200) {
        result = statusResponse.data;
        break;
      }

      // Wait 2 seconds before retry
      await new Promise(resolve => setTimeout(resolve, 2000));
    }

    logger.info('LlamaParse completed', { jobId });

    return {
      markdown: result,
      parsedAt: new Date()
    };
  } catch (error: any) {
    logger.error('LlamaParse failed', { error: error.message });
    throw error;
  }
}

๐Ÿ‘ฆ Nephew: When do I use LlamaParse vs Unstructured?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Simple rule:

Document type?
โ”œโ”€ Simple text policies
โ”‚  โ””โ”€โ†’ pdf-parse (free)
โ”‚
โ”œโ”€ Text + basic tables
โ”‚  โ””โ”€โ†’ Unstructured (cheap, good)
โ”‚
โ”œโ”€ Complex tables, multi-column
โ”‚  โ””โ”€โ†’ LlamaParse (excellent)
โ”‚
โ””โ”€ Enterprise documents, forms, invoices
   โ””โ”€โ†’ Azure Document Intelligence (best)

Tool 5: Azure Document Intelligence (Enterprise)

// Azure Document Intelligence - for enterprise documents
import { DocumentAnalysisClient, AzureKeyCredential } from "@azure/ai-form-recognizer";
import fs from 'fs';

/**
 * Parse with Azure (best for:
 * - Invoices
 * - Forms
 * - Bank documents
 * - Scanned PDFs with OCR
 */
export async function parseWithAzureDocumentIntelligence(
  filePath: string
) {
  try {
    const client = new DocumentAnalysisClient(
      process.env.AZURE_FORM_RECOGNIZER_ENDPOINT,
      new AzureKeyCredential(process.env.AZURE_FORM_RECOGNIZER_KEY)
    );

    const fileContent = fs.readFileSync(filePath);

    // Choose model based on document type
    const poller = await client.beginAnalyzeDocument(
      "prebuilt-document",
      fileContent
    );

    const result = await poller.pollUntilDone();

    logger.info('Azure Document Intelligence parsing complete', {
      pages: result.pages?.length,
      tables: result.tables?.length,
      keyValuePairs: result.keyValuePairs?.length
    });

    return result;
  } catch (error: any) {
    logger.error('Azure Document Intelligence failed', {
      error: error.message
    });
    throw error;
  }
}

Comments

No comments yet. Start the discussion.