DEV Community

I Built an AI Pipeline for 10,000 Daily Listings. Here's What Broke at Scale.

I watched a pipeline I spent weeks building get shut down in one meeting. The AI rewrite engine for a job platform's listing descriptions was working. Output quality was solid. But at 10,000 listings a day, the API bill hit a number the client couldn't stomach. The pipeline went dark.

That moment taught me more about production AI than any tutorial ever did. If you're an engineering lead or founder evaluating whether to build AI into your product, most of what you'll read skips the hard part. The architecture patterns. The cost math that makes or breaks a feature. The failure modes that only surface at scale.

Here's what I learned from shipping an LLM scoring pipeline that processes 10,000+ items daily, and where I almost got it wrong.

The Day Function Calling Saved Us From Hallucinations

Raw prompts are fine for demos. They're dangerous in production. My first version of the listing scoring system used a GPT prompt with instructions like "extract the skills, location, and salary range from this job description." The output was a paragraph of text I had to regex-parse. It broke constantly. Missed fields. Fabricated data. The worst part: you couldn't tell when it was wrong.

I switched to OpenAI function calling with a strict JSON schema. Here's the pattern that made the difference:

const scoringSchema = {
  name: "score_listing",
  description: "Extract structured data and score a job listing",
  parameters: {
    type: "object",
    properties: {
      title: { type: "string" },
      company: { type: "string" },
      skills: { type: "array", items: { type: "string" } },
      location_type: { type: "string", enum: ["remote", "hybrid", "onsite"] },
      has_salary: { type: "boolean" },
      salary_min: { type: "number" },
      salary_max: { type: "number" },
      relevance_score: { type: "number", minimum: 0, maximum: 100 }
    },
    required: ["title", "company", "skills", "location_type", "has_salary", "relevance_score"]
  }
};

The key insight: the has_salary boolean acts as a guard. If it's false, the salary fields are never populated. This simple pattern eliminated fabricated salary data from our system. Before this, the model would guess a salary range even when none was listed.

Function calling with typed, validated schemas turned LLM output from something you hope is correct into something you can trust programmatically. It's the single highest-use change you can make in an AI pipeline.

The 23x Cost Gap Nobody Talks About

The rewrite pipeline was shut down because GPT-4 class models cost too much at scale. $0.01 per listing sounds cheap. At 300,000 listings a month, it's $3,000. For one feature.

Here's what I learned about model selection:

  • Match the model to the task complexity. Classification and extraction tasks don't need the most expensive model. They need reliable structured output. I now use GPT-4o mini for extraction and batch processing through OpenAI's Batch API, which cuts cost by 50% compared to synchronous calls. The latency tradeoff is fine for batch workloads.
  • Evaluate aggressively. After the rewrite pipeline was blocked, I started testing DeepSeek V4 Flash as a replacement. Early results show comparable output quality at roughly 23x lower cost. That gap is the difference between a pipeline that ships and one that dies in a board meeting.
  • Batch everything that isn't real-time. The listing scoring pipeline runs on a schedule. There's zero reason to pay for synchronous latency. The Batch API processes the same work at half the cost with a few hours of delay. For non-interactive workloads, it's free money.

The pipeline is still offline pending evaluation results. But the pattern is clear: model selection is a business decision, not just a technical one.

Retry Strategies That Don't Cascade

LLM APIs fail. Not rarely. Often enough that you need a strategy. In my first version, a failed API call would trigger an immediate retry. When the rate limit kicked in, every concurrent request would retry simultaneously, creating a thundering herd that made the problem worse.

The fix was a three-tier retry with exponential backoff:

  • First retry: 1 second delay. Catches transient network issues.
  • Second retry: 5 second delay. Handles most rate limits.
  • Third retry: 30 second delay. Final attempt before failing to a dead letter queue.

The dead letter queue was critical. Failed items get logged with the error type and model response, stored for manual review or reprocessing after fixes. Without it, you lose visibility into systemic failures.

Here's the pattern I use now:

async function processWithRetry(listing, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await callLLMApi(listing);
    } catch (error) {
      if (attempt === maxRetries - 1) {
        await deadLetterQueue.push({ listing, error, attempt });
        return null;
      }
      const delay = Math.pow(2, attempt) * 1000;
      await sleep(delay);
    }
  }
}

This pattern processes 10,000 listings daily with fewer than 50 failures. Most are from genuinely bad input data, not API issues.

Monitoring Costs More Than You Expect

The infrastructure cost of running this pipeline wasn't the LLM API calls. It was the database. The platform uses MongoDB Atlas. The scraping system used deep skip-based pagination to iterate through 1M+ listings. As the offset grew, so did the CPU usage. The cluster hit regular spike events during scraping runs.

The fix was pragmatic: limit scraping concurrency and document the long-term solution (cursor-based pagination). The CPU spikes stopped immediately.

The lesson: LLM pipelines don't exist in isolation. They interact with your database, your cache, your CDN, your rate limits. A bot crawling your sitemaps can cost you more than a week of AI inference. I once watched a single Meta crawler session pull 35GB of data before we blocked it at the Cloudflare edge.

Every AI pipeline needs observability across the full stack, not just the model calls. Sentry for errors. Cloudflare WAF for traffic filtering. Database monitoring for query performance. If you're only watching your LLM usage dashboard, you're missing the real cost.

When To Build and When To Buy

The honest answer: build when your pipeline needs to process proprietary data or run custom logic that no off-the-shelf tool handles. Buy when your use case is generic chat or document Q&A over public data.

My pipeline had to score listings against specific business rules that changed weekly. No SaaS tool could do that. Building was the right call. But if you're adding a chatbot to your documentation site, don't build a RAG pipeline from scratch. Use a service. The engineering time you save will pay for the subscription many times over.

The hard middle ground is where most teams get stuck. You have unique data but generic use cases. In that case, build a thin pipeline over an existing model API. Don't fine-tune. Don't train. Just wire up function calls with good schemas and let the API do the heavy lifting.

If your team is wrestling with the cost and reliability of an AI feature you're trying to ship, that's exactly the kind of bottleneck I help teams break through. Happy to compare notes on what's worked in production.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

Comments

No comments yet. Start the discussion.