DEV Community 1h ago

I Built a Live Video Streaming Engine That Heals Itself

Problem 1: Most Live Streams Don't Actually Work

I source from iptv-org, a community-maintained database of publicly available television streams. It has 13,000+ entries. Sounds great on paper. In practice, roughly 40% of those entries are dead at any given moment. Channels shut down, URLs change, servers go offline. If I just pick a random stream and load it, there's nearly a coin-flip chance the user stares at a black screen.

Why this is hard: There's no central authority telling you which streams are alive. The database updates periodically, but the actual server status changes minute by minute.

How I solved it: I built a background validation pipeline that continuously tests streams before the user ever sees them. The system maintains a buffer of 20 pre-validated, ready-to-play streams at all times. When the user clicks "next channel," they get a stream that was already proven alive seconds ago. Zero loading screens, zero dead feeds. The validation loop runs on a 600ms throttle to avoid flooding the network. When the buffer is full, it sleeps for 5 seconds and checks again.

Problem 2: A Stream Can Look Alive But Still Be Broken

This was the bug that took me the longest to figure out. HLS streams (the format most IPTV uses) work in two steps: first, the browser downloads a .m3u8 manifest file (an index of video segments), then it downloads the actual .ts video segments. Many servers set CORS headers on the manifest but not on the video segments. Or they use geo-IP restrictions that block the segments based on your location. The result: the manifest loads perfectly. You think you have a working stream. But when the player tries to play actual video - nothing. Black screen.

Why this is hard: A simple "does this URL return a 200?" check is useless. The manifest responds fine. The failure only happens at the fragment level, which you can only discover by actually attempting playback.

How I solved it: I built a Dual-Gate Validation system:

Gate 1 (Manifest Check): I spin up a headless Hls.js instance attached to a hidden <video> element. When the MANIFEST_PARSED event fires, I know the manifest is accessible.
Gate 2 (Fragment Check): I don't stop there. I actually start silent playback and wait for the FRAG_LOADED event - meaning a real video segment was downloaded and decoded.

Only when both gates pass within 5 seconds do I consider the stream valid. This single design decision eliminated an entire class of "it loaded but won't play" bugs.

Problem 3: One Bad Server Wastes Hundreds of Validation Attempts

Once I had dual-gate validation working, I noticed a pattern: when a stream from cdn.example.com fails due to CORS, every other stream on that same CDN fails too. They share the same server configuration. I was wasting 5 seconds per stream, testing dozens of URLs that were all going to fail for the same reason.

Why this is hard: The failure isn't at the URL level - it's at the domain level. You need to recognize patterns across failures, not just handle them individually.

How I solved it: I built a domain-level CORS blocklist that acts as a circuit breaker for entire infrastructure clusters. When a stream fails due to CORS or network errors, I extract the hostname, and blocklist the entire domain. Before any future validation attempt, I check the candidate's hostname against an in-memory Set - O(1) lookup, zero network requests wasted. The blocklist is persisted to IndexedDB so it survives page reloads. Over a single session, the system rapidly eliminates clusters of broken infrastructure. After 10 minutes of use, candidate selection becomes dramatically faster because all the known-bad domains are already filtered out.

Problem 4: The System Needs to Learn Which Streams Are Reliable

Random selection from 13,000 streams is wasteful. Some channels have been reliably online for months. Others are flaky - they work sometimes and fail randomly. I needed the system to prioritize proven reliable streams without me manually curating a list.

Why this is hard: You can't just pick the "best" streams and hardcode them. Stream reliability changes over time. You need a system that adapts.

How I solved it: I built a telemetry-driven health scoring system. Every stream gets a health score (0–100), stored in IndexedDB:

New/untested streams start at 60 (neutral-positive, worth trying)
Successful dual-gate validation: +20 points
Failed validation: -25 points (harsh penalty, by design)
CORS failures additionally flag the stream as corsCompatible: false

The pre-warming loop doesn't pick streams randomly - it uses weighted random selection where each stream's probability of being chosen is proportional to its health score. Reliable feeds naturally float to the top. Unreliable ones sink but aren't completely eliminated (minimum weight of 5 allows occasional retries in case they've been fixed). The result: the longer someone uses the app, the better the stream selection gets. It's a self-improving system.

Problem 5: What Happens When Everything Fails at Once?

Network goes down. ISP throttles connections. The iptv-org data has a catastrophically bad batch. Whatever the reason, sometimes the player hits a streak of failures that the normal recovery path can't handle. Without protection, the app would spin endlessly trying dead streams.

Why this is hard: You can't just retry forever. But you also can't just give up and show an error screen. You need a middle ground that protects the user experience while the system recovers.

How I solved it: I implemented a three-state circuit breaker (borrowed from distributed systems design):

CLOSED (normal): Streams play normally. If 2+ playback errors occur within a 5-second window, the breaker trips.
OPEN (tripped): All playback halts for 3 seconds. The system stops trying to load new streams entirely, giving transient issues time to clear.
HALF-OPEN (recovery): The system loads one of 6 hardcoded "safe" fallback streams (DW, France 24, Al Jazeera - channels with near-100% uptime). If it plays cleanly for 10 seconds, the breaker resets. If it fails again, it cycles back to OPEN with a different fallback.

This means even in the worst-case scenario, the user always lands on a working stream within seconds. The system self-heals.

The Architecture in One Sentence

A background pipeline that fetches 13,000 IPTV streams, validates them through a dual-gate check in a headless video player, scores them with telemetry, blocklists broken domains, maintains a pre-warmed buffer of 20 ready-to-play streams, and is protected by a circuit breaker - all running entirely client-side in the browser, with zero backend servers.

What I'd Do Differently

This is a craft project, not a product. My seniors rightly pointed out that there's no strong "why" for an end user - why would someone in India watch a random TV channel in a language they don't understand? Fair point. But the engineering problems were real. Handling unreliable distributed systems, building self-healing pipelines, implementing circuit breakers, designing telemetry-driven selection algorithms - these are the same problems you'd face at any company dealing with third-party integrations, video infrastructure, or distributed data. The game was the wrapper. The engineering was the point.

Tech Stack

Layer	Technology
Core	React 19, TypeScript, Vite
3D Rendering	Three.js, React Three Fiber
Video Playback	Hls.js (headless for validation, visible for playback)
Persistent Storage	IndexedDB (localforage)
State Management	Zustand

If you're interested in the code: github.com/Chaitaneya/Geo-Stream

Try the raw engine: Live Demo

Play the game I built on top of it: Try Geo-Stream

If you've solved similar problems with unreliable third-party streams or built self-healing pipelines, I'd love to hear how you approached it.

Read on DEV Community ↗ ← Back to News